4 — Bag of Words Model in NLP
In this article, we will cover the Bag of Words. If you want to read articles I have previously covered in the field of NLP, click here.
Bag of Words (BoW) is a Natural Language Processing strategy for converting a text document into numbers that can be used by a computer program. This method involves converting text into a vector based on the frequency of words in the text, without considering the order or context of the words.
Let’s examine the example we gave for tokenization in our previous article with BoW.
Imagine a social media platform that aims to analyze customer reviews and understand the popularity of services among users. This platform decides to employ the Bag of Words method for processing customer reviews.
Data Collection: The first step involves collecting and storing customer reviews, which consist of text written by customers about various services.
Preprocessing: Text data is cleaned by removing punctuation marks, numbers and unnecessary whitespace.
Creating a Word List: A word list is created for BoW. This list includes all the unique words in the dataset.
Text Representation: Each customer review is represented using the BoW method. The frequency of each word is recorded within a vector based on its position in the word list. For example, the BoW representation for the phrase “great service” could be as follows: [service: 1, great: 1, other_words: 0].
Analysis and Classification: With this representation method, the platform can analyze how popular services are among customers and identify which services receive positive or negative reviews. For instance, if a service’s BoW representation frequently includes positive terms like “high quality” and “affordable,” it can be inferred that the service receives positive feedback.
Improvement: Based on the results obtained, the platform can take steps to optimize its service portfolio and enhance the overall customer experience.
In this way, the BoW method enables the social media platform to analyze customer reviews, monitor service performance and make improvements effectively.
Let’s apply the above steps to a sample text.
Firstly, let’s write a function to pre-process our text. We removed stop words, emojis, numbers, punctuation marks, and excess spaces from the sentence and converted all characters to lowercase with this function.
import nltk
from nltk.stem import WordNetLemmatizer
import re
from nltk.corpus import stopwords
def preprocessing_text(text):
lemmatizer = WordNetLemmatizer()
emoji_pattern = r'^(?:[\u2700-\u27bf]|(?:\ud83c[\udde6-\uddff]){1,2}|(?:\ud83d[\udc00-\ude4f]){1,2}|[\ud800-\udbff][\udc00-\udfff]|[\u0021-\u002f\u003a-\u0040\u005b-\u0060\u007b-\u007e]|\u3299|\u3297|\u303d|\u3030|\u24c2|\ud83c[\udd70-\udd71]|\ud83c[\udd7e-\udd7f]|\ud83c\udd8e|\ud83c[\udd91-\udd9a]|\ud83c[\udde6-\uddff]|\ud83c[\ude01-\ude02]|\ud83c\ude1a|\ud83c\ude2f|\ud83c[\ude32-\ude3a]|\ud83c[\ude50-\ude51]|\u203c|\u2049|\u25aa|\u25ab|\u25b6|\u25c0|\u25fb|\u25fc|\u25fd|\u25fe|\u2600|\u2601|\u260e|\u2611|[^\u0000-\u007F])+$'
text = text.split()
text = [lemmatizer.lemmatize(word) for word in text if not word in set(stopwords.words('english'))]
text = ' '.join(text)
text = re.sub(r'[0-9]+', '', text)
text = re.sub(r'[^\w\s]', '', text)
text = re.sub(emoji_pattern, '', text)
text= re.sub(r'\s+', ' ', text)
text= text.lower().strip()
return text
paragraph = """I am really disappointed this product. I would not use it again. It has really bad feature.
I love this product! It has some good features"""
sentences_list = nltk.sent_tokenize(paragraph)
corpus = [preprocessing_text(sentence) for sentence in sentences_list]
print(corpus)
Output:
[
'i really disappointed product',
'i would use again',
'it really bad feature',
'i love product',
'it good feature'
]
Now, we will create a Bag of Words model using the count vectorizer function available in sklearn.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names_out()
X_array = X.toarray()
print("Unique Word List: \n", feature_names)
print("Bag of Words Matrix: \n", X_array)
Unique Word List:
['again' 'bad' 'disappointed' 'feature' 'good' 'it' 'love' 'product'
'really' 'use' 'would']
Bag of Words Matrix:
[[0 0 1 0 0 0 0 1 1 0 0]
[1 0 0 0 0 0 0 0 0 1 1]
[0 1 0 1 0 1 0 0 1 0 0]
[0 0 0 0 0 0 1 1 0 0 0]
[0 0 0 1 1 1 0 0 0 0 0]]
Let’s create a dataframe and show the result visually.
import pandas as pd
df = pd.DataFrame(data=X_array, columns=feature_names, index=corpus)
print(df)
Conclusion
Through this article, we have learned about the bag of words.
In summary, Bag of Words used to convert words in a text into a matrix representation by extracting its features, it shows us which word occurs in a sentence and its frequency, for use in modeling such as machine learning algorithms.
In the next article, we will cover the TF-IDF topic.
I hope it will be a useful article for you. If you stayed with me until the end, thank you for reading! Happy coding 🤞