Topic Modelling
Introduction
In the age of social media, Twitter has become a fertile ground for data mining, sentiment analysis, and various other natural language processing (NLP) tasks. However, dealing with Spanish tweets adds another layer of complexity due to language-specific nuances, slang, abbreviations, and other colloquial expressions. tidyX aims to streamline the preprocessing pipeline for Spanish tweets, making them ready for various NLP tasks such as text classification, topic modeling, sentiment analysis, and more. In this tutorial, we will focus on a classification task based on Topic Modelling, showing preprocessing, modeling and results with real data snippets.
Context
Using data provided by Barómetro de Xenofobia, a non-profit organization that quantifies the amount of hate speech against migrants on social media, we aim to classify the overall conversation related to migrants. This is a common NLP task that involves preprocessing poorly-written social media posts. Subsequently, these processed posts are fed into an unsupervised Topic Classification Model (LDA) to identify an optimal number of cluster topics. This helps reveal the main discussion points concerning Venezuelan migrants in Colombia.
# Import TidyX and other libraries.
from tidyX import TextPreprocessor as tp
from tidyX import TextNormalization as tn
# Import other libraries needed in this tutorial
import pandas as pd
import os
import gensim
from gensim import corpora
from gensim.models import CoherenceModel
import tqdm
import numpy as np
import itertools
from collections import Counter
import pprint
import pyLDAvis
pyLDAvis.enable_notebook()
import pyLDAvis.gensim_models
import spacy
# Load a dataframe that contains 1000 tweets from Colombia discussing Venezuela
tweets = tp.load_data(file = "spanish")
tweets.head()
| Snippet | |
|---|---|
| 0 | RT @emilsen_manozca ¿Me regala una moneda pa u... |
| 1 | RT @CriptoNoticias Banco venezolano activa ser... |
| 2 | Capturado venezolano que asesinó a comerciante... |
| 3 | RT @PersoneriaVpar @PersoneriaVpar acompaña al... |
| 4 | Bueno ya sacaron la carta de "amenaza de atent... |
| 5 | @IvanDuque es muy bueno que se le dé respaldo ... |
| 6 | RT @RafaelG10099924 @mluciaramirez @Eganbernal... |
| 7 | #ParaVenezuelaPropongo que se levante el bloqu... |
| 8 | RT @geoduque La diferencia entre la preocupaci... |
| 9 | RT @PanamericanaTV ¡No le abrió la puerta de s... |
Preprocessing Tweets
We will then use preprocess() function to clean the sample and prepare it for modelling
cleaning_process = lambda x: tp.preprocess(x, delete_emojis = True, extract = False, remove_stopwords = True, language_stopwords = "spanish")
tweets['Clean_tweets'] = tweets['Tweet'].apply(cleaning_process)
Here is a random sample of the before and after with specific Tweets
# You can change the random_state for different samples
sample_tweets = tweets.sample(5, random_state = 1)
print("Before and After Text Cleaning:")
print('-' * 40)
for index, row in sample_tweets.iterrows():
print(f"Original: {row['Tweet']}")
print(f"Cleaned: {row['Clean_tweets']}")
print('-' * 40)
Before and After Text Cleaning:
----------------------------------------
Original: Antes el pasaporte venezolano permitía la entrada en en sinfín de países del mundo. Hoy cada día estamos más limitados gracias al socialismo del siglo 21. Hasta Cuba, que saquea a Venezuela, nos impone una visa. #PeroTodoTieneSuFinal
Cleaned: pasaporte venezolano permitia entrada sinfin paises mundo hoy cada dia limitados gracias socialismo siglo cuba saquea venezuela impone visa
----------------------------------------
Original: @VickyDavilaH Bueno y si @AlvaroUribeVel se proclama presidente de una vez por todas y nombra a @IvanDuque ministro de guerra y lo deja que solito libere al pueblo venezolano, ¿será que le prestan atención a la grave crisis que vive el Chocó, que parece que solo cuentan con el Esmad ?
Cleaned: bueno proclama presidente vez todas nombra ministro guerra deja solito libere pueblo venezolano prestan atencion grave crisis vive choco parece solo cuentan esmad
----------------------------------------
Original: @zonacero Nomás quieren Telesur y Venezolana de Televisión, super imparcialicimos.
Cleaned: nomas quieren telesur venezolana television super imparcialicimos
----------------------------------------
Original: RT @XiomaryUrbaez Sr @jguaido yo, venezolana y residente en el país, SÍ QUIERO INTERVENCIÓN. Le agradezco que sin haber hecho una consulta pública sobre algo tan importante, no hable por mí. Gracias.
Cleaned: sr venezolana residente pais quiero intervencion agradezco haber hecho consulta publica tan importante hable gracias
----------------------------------------
Original: Y también las grandes masas de venezolanos queriendo refugiarse en Colombia, de verdad que esto es una gran insensatez descarada y cruel, porque todo está premeditadamente calculado.
Cleaned: grandes masas venezolanos queriendo refugiarse colombia verdad gran insensatez descarada cruel premeditadamente calculado
----------------------------------------
Tokenize and lemmatize tweets in the dataset
We use unnest_token() function to divide each tweet into multiple rows, assigning one token to each row. This structure allows us to aggregate identical terms, thereby creating an auxiliary dataframe that acts as a dictionary for lemmas.
We want an iterable of lemmatized non-stopword tokens in order to recreate a cleaner version of the tweet. In order to achieve that, we call tn.lemmatizer() returning an original base form of a token in a specific language structure.
# load a spaCy model, depending on language, out-of-the-box
model_es = spacy.load("es_core_news_sm") # depends on your needs. Please visit: https://spacy.io/models
# Create a dictionary of tokens to lemmatize
word_dict = tp.unnest_tokens(df = tweets.copy(), input_column = 'Clean_tweets', id_col = None, unique = True)
# Lemmatize the tokens
word_dict["lemmatized_tweets"] = word_dict["Clean_tweets"].apply(lambda x: tn.lemmatizer(token = x, model = model_es))
# Rebuild the tweets using the lemmatized tokens
rebuild_tweets = tp.unnest_tokens(df = tweets.copy(), input_column = "Clean_tweets", id_col = None, unique = False)
tokenized_cleaned_tweets = rebuild_tweets \
.merge(word_dict, how = "left", on = "Clean_tweets") \
.groupby(["id_x", "Snippet"])[["lemmatized_tweets"]] \
.agg(lambda x: " ".join(x)) \
.reset_index()
tokenized_cleaned_tweets.head(3)
| id_x | Snippet | lemmatized_tweets | |
|---|---|---|---|
| 0 | 0 | RT @emilsen_manozca ¿Me regala una moneda pa u... | regalar moneda pa cafe venezolano no tuitero ah |
| 1 | 1 | RT @CriptoNoticias Banco venezolano activa ser... | banco venezolano activo servicio usuario cript... |
| 2 | 2 | Capturado venezolano que asesinó a comerciante... | capturado venezolano asesino comerciante merca... |
Here is a random sample of the before and after with specific Tweets
tweets['lemmatized_tweets'] = tokenized_cleaned_tweets['lemmatized_tweets']
sample_tweets = tweets.sample(5, random_state=1) # You can change the random_state for different samples
print("Before and After Text Cleaning:")
print('-' * 40)
for index, row in sample_tweets.iterrows():
print(f"Original: {row['Snippet']}")
print(f"Cleaned: {row['lemmatized_tweets']}")
print('-' * 40)
Before and After Text Cleaning:
----------------------------------------
Original: Antes el pasaporte venezolano permitía la entrada en en sinfín de países del mundo. Hoy cada día estamos más limitados gracias al socialismo del siglo 21. Hasta Cuba, que saquea a Venezuela, nos impone una visa. #PeroTodoTieneSuFinal
Cleaned: pasaporte venezolano permitia entrada sinfin pais mundo hoy cada diar limitado gracias socialismo siglo cuba saquea venezuela imponer vis
----------------------------------------
Original: @VickyDavilaH Bueno y si @AlvaroUribeVel se proclama presidente de una vez por todas y nombra a @IvanDuque ministro de guerra y lo deja que solito libere al pueblo venezolano, ¿será que le prestan atención a la grave crisis que vive el Chocó, que parece que solo cuentan con el Esmad ?
Cleaned: bueno proclamar presidente vez todo nombra ministro guerra dejar solitir liberar pueblo venezolano prestar atencion grave crisis vivir choco parecer solo contar esmad
----------------------------------------
Original: @zonacero Nomás quieren Telesur y Venezolana de Televisión, super imparcialicimos.
Cleaned: noma querer telesur venezolano television super imparcialicir
----------------------------------------
Original: RT @XiomaryUrbaez Sr @jguaido yo, venezolana y residente en el país, SÍ QUIERO INTERVENCIÓN. Le agradezco que sin haber hecho una consulta pública sobre algo tan importante, no hable por mí. Gracias.
Cleaned: sr venezolano residente pai querer intervencion agradecer haber hecho consulta publicar tanto importante hablar gracias
----------------------------------------
Original: Y también las grandes masas de venezolanos queriendo refugiarse en Colombia, de verdad que esto es una gran insensatez descarada y cruel, porque todo está premeditadamente calculado.
Cleaned: grande masa venezolano querer refugiar el colombia verdad gran insensatez descarado cruel premeditadamente calculado
----------------------------------------
Conclusion and Analysis of toy-Topic Modelling Results
In the realm of unsupervised classification, Topic Modelling has emerged as a powerful tool. It meticulously unravels the underlying topics within a corpus of text, illuminating the subtle narratives interwoven within large textual data. This technique finds its prowess particularly accentuated when applied to a heterogeneous assortment of tweets. By incorporating a more substantial and varied datas enhanced. Here, we simply made a basic aproximation for Topic Modelling to show how tidyX can be useful in order to process social media data preparing it to NLP tasks like this one.
For those eager to delve deeper into this subject, we recommend reaching out to Barómetro de Xenofobia, a reservoir of comprehensive data that can greatly augment research in this field.
Given λ = 0.5, and navigating through a dataset comprising 1000 tweets, our toy exploration has yielded the following intriguing topics:
Topic 1: Migrant Necessities and Frontier Struggles
Some Interesting Relevant Words: ayuda, pedir, querer, primero, niño, frontera, gobierno, vida.
This topic unveils the urgent necessities and pleas echoed by the migrants. It portrays a vivid picture of their journey, marked by vulnerability and struggle, especially among children at the frontiers. The narrative fluctuates between government interventions and intrinsic human endeavors for survival.
Topic 2: Migrant Flows and Economic Perceived Competition for Resources
Some Interesting Relevant Words: Bogotá, migrante, venir, salir, regresar, cualquiera, trabajar, quitar, sector, mil
Focusing on the economic dynamics within principal cities such as Bogotá, this topic elucidates the Perceived Competition for Resources and flow of the migrant populace. It highlights the intricate tapestry of employment, competition, and the transformative economic landscapes molded by the presence of migrants. There is a vast interesting literature that studies this labour market externalities and impact evaluation when migrants start seeking jobs in a foreign country.
Topic 3: Advocacy and Critique—The Landscape of Migrant Rights and Initiatives
Some Interesting Relevant Words: mal, poder, decir, ser, deber, criticar, derecho, igual, tratar, programa, fundación, presencia
This topic blossoms into a vibrant discourse revolving around rights, responsibilities, and critiques of migrants in Colombia. It encapsulates a conversation of political and social agendas related to migrants context in Colombia.
In closing, our analysis, though constrained by the volume of the dataset, serves as a gateway to exploring the vast universe of modelling tasks that tidyX** attemps to address. It invites further exploration, promising a richer and evolving way to analyze social media data.**