7 — Understanding N-Grams in NLP

Aysel Aydin
2 min readMay 19, 2024

--

Hello, in this article, we will talk about n-grams. If you want to read more articles about NLP, don’t forget to stay tuned :) click here.

Definition & Importance N-Grams in NLP

N-grams are sequences of n items from a given text. These items can be words, letters, or symbols. For example, in the phrase “Natural Language Processing

  • 1-grams (unigrams) are “Natural”, “Language” and “Processing
  • 2-grams (bigrams) are “Natural Language” and “Language Processing
  • 3-gram (trigram) is “Natural Language Processing

N-grams are important in NLP because they help analyze and predict the next item in a sequence. They are used in various applications like text prediction, speech recognition and machine translation. By understanding n-grams, we can improve the accuracy and efficiency of these applications.

Frequencies and Distributions

N-grams are studied by looking at how often they appear in a text. This helps us understand common patterns in language.

Calculating N-Gram Frequencies
To find n-gram frequencies, we break the text into n-grams and count each one.

Example: “I love natural language processing”

Unigrams (1-grams):

  • I: 1
  • love: 1
  • natural: 1
  • language: 1
  • processing: 1

Bigrams (2-grams):

  • I love: 1
  • love natural: 1
  • natural language: 1
  • language processing: 1

Trigrams (3-grams):

  • I love natural: 1
  • love natural language: 1
  • natural language processing: 1

N-gram distributions show us how often different n-grams appear. In most languages, some n-grams appear very often, while many others appear rarely.

Knowing n-gram frequencies and distributions helps us build better NLP tools, understand language patterns, and predict text more accurately.

In this section, we will look at some practical examples of using n-grams with Python. We will use a popular library called NLTK (Natural Language Toolkit).

import nltk
from nltk import ngrams
from collections import Counter

text = "I love natural language processing. It is very interesting."

tokens = nltk.word_tokenize(text)

bigrams = list(ngrams(tokens, 2))
trigrams = list(ngrams(tokens, 3))

This code will output the frequencies of bigrams and trigrams in the text. By running this code, you can see how n-grams are created and how their frequencies are calculated.

Through this article, we talked about n-grams and gave examples to understand them better.

I hope it will be a useful article for you. If you stayed with me until the end, thank you for reading! Happy coding 🤞

Contact Accounts: Twitter, LinkedIn

--

--