9 — Understanding Word Embeddings in NLP

Aysel Aydin
3 min readJun 30, 2024

--

In this article, we will talk about word embedding and techniques, their usage areas.

If you want to read more articles about NLP, don’t forget to stay tuned :) click here.

Word Embedding

Word embedding is a technique that represents words with meaningful numbers. It represents words or phrases in vector space with several dimensions. It is a popular approach for learned numeric representations of text.

Why do we need Word Embeddings?

  • Computer understands only numbers.
  • Word Embeddings are the texts converted into numbers.
  • A vector representation of a word may be a one-hot encoded vector like [0,0,0,1,0,0], but word embeddings are dense vectors like [0.2, -0.1, 0.4, 0.7].
  • Word embeddings are capable of capturing the context of a word in a document, semantic and syntactic similarity, relationships with other words, etc.
  • A word embedding is a learned representation for text where words that have the same meaning have similar representations.

Applications of Word Embedding

Word embedding can be used for a variety of NLP tasks. Here are some word embedding use cases:

  • Text classification
  • Named Entity Recognition (NER)
  • Semantic similarity and clustering
  • Sentiment Analysis
  • Machine translation
  • Information retrieval

Word Embeddings Techniques

I will talk about the techniques that I will briefly explain below in detail in subsequent articles.

Word embeddings can be made using a variety of techniques. Choosing a technique depends on the specific requirements of the task. You have to consider the size of the dataset, the domain of the data, and the complexity of the language. Here is how some of the more popular word embedding techniques work:

1 — Word2Vec: Word2Vec is a widely used method in NLP. It is a type of word embedding that was introduced by Google in 2013. Word2Vec uses a neural network to learn word embeddings by predicting the context of a given word. Word2Vec aims to capture semantic relationships between words based on their co-occurrence patterns in a large corpus of text data.

2 — Glove: Glove is another popular type of word embedding that was introduced in 2014. Word2Vec only captures the local context of words. During training, it only considers neighboring words to capture the context. GloVe considers the entire corpus and creates a large matrix that can capture the co-occurrence of words within the corpus.

3 — FastText: FastText is another word embedding method that is an extension of the word2vec model that was introduced by Facebook in 2016. By breaking words into subword units (character n-grams), FastText can capture morphological information and handle out-of-vocabulary words effectively.

FastText is particularly useful for morphologically rich languages and tasks requiring word similarity based on subword units.

4 — Bert: BERT is a transformer-based deep learning model that learns contextual word embeddings that was introduced by Google in 2019.

I have explained One-Hot Encoding, Bag of Words and TF-IDF, which are traditional approaches to word embedding, in my previous articles.

Conclusion

In conclusion, word embedding is a technique that represents words with meaningful numbers. They have a wide range of applications, from text classification to machine translation to question answering.

Follow for more upcoming Articles about NLP, ML & DL❤️
Contact Accounts: Twitter, LinkedIn

--

--