7 — Understanding N-Grams in NLP
Hello, in this article, we will talk about n-grams. If you want to read more articles about NLP, don’t forget to stay tuned :) click here.
Definition & Importance N-Grams in NLP
N-grams are sequences of n items from a given text. These items can be words, letters, or symbols. For example, in the phrase “Natural Language Processing”
- 1-grams (unigrams) are “Natural”, “Language” and “Processing”
- 2-grams (bigrams) are “Natural Language” and “Language Processing”
- 3-gram (trigram) is “Natural Language Processing”
N-grams are important in NLP because they help analyze and predict the next item in a sequence. They are used in various applications like text prediction, speech recognition and machine translation. By understanding n-grams, we can improve the accuracy and efficiency of these applications.
Frequencies and Distributions
N-grams are studied by looking at how often they appear in a text. This helps us understand common patterns in language.
Calculating N-Gram Frequencies
To find n-gram frequencies, we break the text into n-grams and count each one.
Example: “I love natural language processing”
Unigrams (1-grams):
- I: 1
- love: 1
- natural: 1
- language: 1
- processing: 1
Bigrams (2-grams):
- I love: 1
- love natural: 1
- natural language: 1
- language processing: 1
Trigrams (3-grams):
- I love natural: 1
- love natural language: 1
- natural language processing: 1
N-gram distributions show us how often different n-grams appear. In most languages, some n-grams appear very often, while many others appear rarely.
Knowing n-gram frequencies and distributions helps us build better NLP tools, understand language patterns, and predict text more accurately.
In this section, we will look at some practical examples of using n-grams with Python. We will use a popular library called NLTK (Natural Language Toolkit).
import nltk
from nltk import ngrams
from collections import Counter
text = "I love natural language processing. It is very interesting."
tokens = nltk.word_tokenize(text)
bigrams = list(ngrams(tokens, 2))
trigrams = list(ngrams(tokens, 3))
This code will output the frequencies of bigrams and trigrams in the text. By running this code, you can see how n-grams are created and how their frequencies are calculated.