5 — TF-IDF: A Traditional Approach to Feature Extraction in NLP using Python

4 min readOct 29, 2023

In the last article, we covered the topic of Bag of Words, a Natural Language Processing strategy used to convert a text document into numbers that can be used by ML.

The BoW method is simple and works well, but it creates a problem because it treats all words equally. As a result, we can’t distinguish very common words or rare words when BoW is used. TF-IDF comes into play at this stage, to solve this problem.

Unlike the bag of words model, the TF-IDF representation takes into account the importance of each word in a document. The term TF stands for term frequency, and the term IDF stands for inverse document frequency.

To understand TF-IDF, firstly we should cover the two terms separately:

Term frequency (TF)
Inverse document frequency (IDF)

Term Frequency (TF)

Term frequency refers to the frequency of a word in a document. For a specified word, it is defined as the ratio of the number of times a word appears in a document to the total number of words in the document.

— t is the word or token.

— d is the document.

Inverse document frequency (IDF)

Inverse document frequency measures the importance of the word in the corpus. It measures how common a particular word is across all the documents in the corpus.

TF-IDF Score

The TF-IDF score for a term in a document is calculated by multiplying its TF and IDF values. This score reflects how important the term is within the context of the document and across the entire corpus. Terms with higher TF-IDF scores are considered more significant.

Let’s take an example to understand this concept in depth.

Let’s examine the example we gave for bag of words in our previous article with TF-IDF.

Imagine a social media platform that aims to analyze customer reviews and understand the popularity of services among users. This platform decides to employ the TF-IDF method for processing customer reviews.

Step 1: Data Preprocessing

from nltk.stem import WordNetLemmatizer 
import re
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

def preprocessing_text(text):
    lemmatizer = WordNetLemmatizer()
    emoji_pattern = r'^(?:[\u2700-\u27bf]|(?:\ud83c[\udde6-\uddff]){1,2}|(?:\ud83d[\udc00-\ude4f]){1,2}|[\ud800-\udbff][\udc00-\udfff]|[\u0021-\u002f\u003a-\u0040\u005b-\u0060\u007b-\u007e]|\u3299|\u3297|\u303d|\u3030|\u24c2|\ud83c[\udd70-\udd71]|\ud83c[\udd7e-\udd7f]|\ud83c\udd8e|\ud83c[\udd91-\udd9a]|\ud83c[\udde6-\uddff]|\ud83c[\ude01-\ude02]|\ud83c\ude1a|\ud83c\ude2f|\ud83c[\ude32-\ude3a]|\ud83c[\ude50-\ude51]|\u203c|\u2049|\u25aa|\u25ab|\u25b6|\u25c0|\u25fb|\u25fc|\u25fd|\u25fe|\u2600|\u2601|\u260e|\u2611|[^\u0000-\u007F])+$'
    
    text= text.lower()
    text = text.split()
    text = [lemmatizer.lemmatize(word) for word in text if not word in set(stopwords.words('english'))]
    text = ' '.join(text)  
    text = re.sub(r'[0-9]+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(emoji_pattern, '', text)
    text= re.sub(r'\s+', ' ', text)
    
    return text

comments = """I am really disappointed this product. 
I would not use it again. It has really bad feature.
I love this product! It has some good features"""

sentences_list = nltk.sent_tokenize(comments)
 
corpus = [preprocessing_text(sentence) for sentence in sentences_list]

print(corpus)

Output:

[
'really disappointed product', 
'would use again', 
'really bad feature', 
'love product', 
'good feature'
]

Unique Word List: 
['again' 'bad' 'disappointed' 'feature' 'good' 'love' 'product' 'really'
 'use' 'would']

Step 2: Calculating Product of Term Frequency & Inverse Document Frequency

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

corpus = [
    'really disappointed product',
    'would use again',
    'really bad feature',
    'love product',
    'good feature'
]

tfidf_vectorizer = TfidfVectorizer()

tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

terms = tfidf_vectorizer.get_feature_names_out()

df = pd.DataFrame(tfidf_matrix.toarray(), columns=terms)

print(df)

Output:

The TF-IDF calculated in Scikit-learn’s TfidfTransformer and TfidfVectorizer is slightly different from the standard calculation. A constant 1 is added to the numerator and denominator of the IDF as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions. The standard calculation present doesn’t have the constant 1.
(Github: Scikit-Learn)