6 — Creating a Word Cloud using TF-IDF in Python

Aysel Aydin
3 min readApr 8, 2024

--

In this article, we will cover creating a word cloud using TF-IDF in Python. Before we start, I recommend you read the article I have previously covered on TF-IDF.

What is a Word Cloud?

A word cloud is a visual way to show words from a piece of writing. In a word cloud, words are displayed in different sizes and colors. The bigger and bolder words are the ones that appear more often in the text.

The main idea of a word cloud is to quickly see which words are most important or common in a text. It helps to understand the main topics or themes of the writing based on word frequency. Word clouds are used in different areas:

  1. Text Summarization: Quickly identify the most important or common words in a document.
  2. Keyword Analysis: Understanding the prominent terms in a collection of documents or articles.
  3. Sentiment Analysis: Visualizing the sentiment-related words in a corpus to gauge overall sentiment.
  4. Content Exploration: Discovering themes or patterns within large volumes of text data.

To make a word cloud, you take the text and count how many times each word appears. Then, you use a special tool to create the visual cloud where words that show up a lot are bigger and more noticeable. You can change the colors and layout to make the word cloud look interesting and meaningful.

Let’s create and clean some example text data that we’ll work with. You can get the text to be used to create the word cloud from a JSON file or a CSV file. In this article, we will read our text from a list.

texts = [
"Python is a popular programming language.",
"NLP is a field of artificial intelligence that focuses on the interaction between computers and human language.",
"Sentiment analysis is the process of classifying the emotional intent of text.",
"Machine learning is an important application of AI.",
"Natural Language Processing is used for text analysis.",
"Python libraries like scikit-learn and NLTK are used in NLP.",
"AI and machine learning are transforming industries.",
"If you are interested in NLP, stay tuned!"
]

We’ll also clean the data to remove unnecessary characters and ensure consistency.

import re

def clean_text(text):
clean_txt = text.lower()
# Clear characters other than numbers and letters of the alphabet only
clean_txt = re.sub(r'[^0-9a-zçğıiöşü\s]', '', clean_txt,
flags=re.IGNORECASE)

return ' '.join(sorted(clean_txt.split()))


cleaned_texts = [clean_text(text) for text in texts]

Now, let’s use the cleaned text data to create TF-IDF vectors. These vectors will assign importance scores to each word based on its frequency across the texts.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

tfidf_matrix = vectorizer.fit_transform(cleaned_texts)

The steps we have followed so far are as follows:

  1. Clean the dataset
  2. Fit TF-IDF model

Now, let’s generate the word cloud!

from wordcloud import WordCloud
import matplotlib.pyplot as plt

wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(vectorizer.vocabulary_)

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

Output:

Through this article, we have explained how to create a word cloud using TF-IDF with Python.

I hope it will be a useful article for you. If you stayed with me until the end, thank you for reading! Happy coding 🤞

Contact Accounts: Twitter, LinkedIn

--

--

Aysel Aydin
Aysel Aydin

Written by Aysel Aydin

Master Expert AI & ML Engineer @Turkcell

Responses (2)