Word Cloud
This tutorial demonstrates how to create a word cloud from a corpus of tweets written in Spanish.
First, set up the necessary modules from the package:
# Import modules from the tidyX package for text preprocessing, normalization, and visualization
from tidyX import TextPreprocessor as tp
from tidyX import TextNormalization as tn
from tidyX import TextVisualizer as tv
# Import auxiliary libraries for data manipulation and visualization
import os
import pandas as pd
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import spacy # Natural Language Processing library
# Load a dataframe that contains 1000 tweets from Colombia discussing Venezuela
tweets = tp.load_data(file="spanish")
tweets.head() # Display the first few rows of the dataframe
| Tweet | |
|---|---|
| 0 | RT @emilsen_manozca ¿Me regala una moneda pa u... |
| 1 | RT @CriptoNoticias Banco venezolano activa ser... |
| 2 | Capturado venezolano que asesinó a comerciante... |
| 3 | RT @PersoneriaVpar @PersoneriaVpar acompaña al... |
| 4 | Bueno ya sacaron la carta de "amenaza de atent... |
For illustrative purposes, generate a word cloud without preprocessing the text:
# Combine all individual tweets into one large text string
text = " ".join(doc for doc in tweets['Tweet'])
# Generate a word cloud from the combined text
wordcloud = WordCloud(background_color="white", width=800, height=400).generate(text)
# Visualize and display the generated word cloud
plt.figure(figsize=(10, 5))
plt.title("WordCloud before tidyX")
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off") # Hide the axis for better aesthetics
Next, preprocess the tweets using techniques outlined in the Stemming and Lemmatizing Texts Efficiently tutorial:
# Clean the text: remove emojis, stopwords, and apply other preprocessing steps
tweets['clean'] = tweets['Tweet'].apply(lambda x: tp.preprocess(x,
delete_emojis=False,
remove_stopwords=True,
language_stopwords="spanish"))
# Tokenize the cleaned text to create a dictionary for normalization
dictionary_normalization = tp.unnest_tokens(df=tweets.copy(), input_column="clean", id_col=None, unique=True)
# Load the Spanish language model from spacy for lemmatization
model_es = spacy.load("es_core_news_sm")
# Apply lemmatization to the tokens to reduce words to their base form
dictionary_normalization["lemma"] = dictionary_normalization["clean"].apply(lambda x: tn.lemmatizer(token=x, model=model_es))
# Remove any stopwords introduced after lemmatization
dictionary_normalization["lemma"] = dictionary_normalization["lemma"].apply(lambda x: tp.remove_words(x, remove_stopwords=True, language="spanish"))
# Reconstruct the original tweets using lemmatized tokens
tweets_long = tp.unnest_tokens(df=tweets.copy(), input_column="clean", id_col=None, unique=False)
tweets_normalized = tweets_long \
.merge(dictionary_normalization, how="left", on="clean") \
.groupby(["id_x", "Tweet"])["lemma"] \
.agg(lambda x: " ".join(x)) \
.reset_index()
| id_x | Tweet | lemma | |
|---|---|---|---|
| 0 | 0 | RT @emilsen_manozca ¿Me regala una moneda pa u... | regalar moneda pa cafar venezolano tuitero ah... |
| 1 | 1 | RT @CriptoNoticias Banco venezolano activa ser... | banco venezolano activo servicio usuario cript... |
| 2 | 2 | Capturado venezolano que asesinó a comerciante... | capturado venezolano asesino comerciante merca... |
| 3 | 3 | RT @PersoneriaVpar @PersoneriaVpar acompaña al... | acompa grupo especial migratorio cesar reunion... |
| 4 | 4 | Bueno ya sacaron la carta de "amenaza de atent... | bueno sacar cartar amenazar atentado president... |
for i in range(3):
print("-"*50)
print("Example", i + 1)
print("Original tweet:", tweets_normalized.loc[i, "Tweet"])
print("Lemmatized tweet:", tweets_normalized.loc[i, "lemma"])
--------------------------------------------------
Example 1
Original tweet: RT @emilsen_manozca ¿Me regala una moneda pa un café? -¿Eres venezolano? Noo! Tuitero. -Ahhh 😂😂😂👋
Lemmatized tweet: regalar moneda pa cafar venezolano tuitero ah 😂 👋
--------------------------------------------------
Example 2
Original tweet: RT @CriptoNoticias Banco venezolano activa servicio para usuarios de criptomonedas #ServiciosFinancieros https://t.co/1r2rZIUdlo
Lemmatized tweet: banco venezolano activo servicio usuario criptomoneda
--------------------------------------------------
Example 3
Original tweet: Capturado venezolano que asesinó a comerciante del Mercado Público https://t.co/XrmWKVYMR8 https://t.co/CfMLaB25jI
Lemmatized tweet: capturado venezolano asesino comerciante mercado publico
Lastly, generate a word cloud using the preprocessed and lemmatized text:
# Combine all lemmatized tweets into one large text string
text = " ".join(doc for doc in tweets_normalized['lemma'])
# Generate a word cloud from the lemmatized text
wordcloud = WordCloud(background_color="white", width=800, height=400).generate(text)
# Visualize and display the new word cloud
plt.figure(figsize=(10, 5))
plt.title("WordCloud after tidyX")
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off") # Hide the axis for better aesthetics