Topic Modelling

Introduction

In the age of social media, Twitter has become a fertile ground for data mining, sentiment analysis, and various other natural language processing (NLP) tasks. However, dealing with Spanish tweets adds another layer of complexity due to language-specific nuances, slang, abbreviations, and other colloquial expressions. tidyX aims to streamline the preprocessing pipeline for Spanish tweets, making them ready for various NLP tasks such as text classification, topic modeling, sentiment analysis, and more. In this tutorial, we will focus on a classification task based on Topic Modelling, showing preprocessing, modeling and results with real data snippets.

Context

Using data provided by Barómetro de Xenofobia, a non-profit organization that quantifies the amount of hate speech against migrants on social media, we aim to classify the overall conversation related to migrants. This is a common NLP task that involves preprocessing poorly-written social media posts. Subsequently, these processed posts are fed into an unsupervised Topic Classification Model (LDA) to identify an optimal number of cluster topics. This helps reveal the main discussion points concerning Venezuelan migrants in Colombia.

# Import TidyX and other libraries.
from tidyX import TextPreprocessor as tp
from tidyX import TextNormalization as tn

# Import other libraries needed in this tutorial
import pandas as pd
import os
import gensim
from gensim import corpora
from gensim.models import CoherenceModel
import tqdm
import numpy as np
import itertools
from collections import Counter
import pprint
import pyLDAvis
pyLDAvis.enable_notebook()
import pyLDAvis.gensim_models
import spacy

# Load a dataframe that contains 1000 tweets from Colombia discussing Venezuela
tweets = tp.load_data(file = "spanish")
tweets.head()
Snippet
0 RT @emilsen_manozca ¿Me regala una moneda pa u...
1 RT @CriptoNoticias Banco venezolano activa ser...
2 Capturado venezolano que asesinó a comerciante...
3 RT @PersoneriaVpar @PersoneriaVpar acompaña al...
4 Bueno ya sacaron la carta de "amenaza de atent...
5 @IvanDuque es muy bueno que se le dé respaldo ...
6 RT @RafaelG10099924 @mluciaramirez @Eganbernal...
7 #ParaVenezuelaPropongo que se levante el bloqu...
8 RT @geoduque La diferencia entre la preocupaci...
9 RT @PanamericanaTV ¡No le abrió la puerta de s...

Preprocessing Tweets

We will then use preprocess() function to clean the sample and prepare it for modelling

cleaning_process = lambda x: tp.preprocess(x, delete_emojis = True, extract = False, remove_stopwords = True, language_stopwords = "spanish")
tweets['Clean_tweets'] = tweets['Tweet'].apply(cleaning_process)

Here is a random sample of the before and after with specific Tweets

# You can change the random_state for different samples
sample_tweets = tweets.sample(5, random_state = 1)

print("Before and After Text Cleaning:")
print('-' * 40)
for index, row in sample_tweets.iterrows():
    print(f"Original: {row['Tweet']}")
    print(f"Cleaned: {row['Clean_tweets']}")
    print('-' * 40)
Before and After Text Cleaning:
----------------------------------------
Original: Antes el pasaporte venezolano permitía la entrada en en sinfín de países del mundo. Hoy cada día estamos más limitados gracias al socialismo del siglo 21. Hasta Cuba, que saquea a Venezuela, nos impone una visa. #PeroTodoTieneSuFinal
Cleaned: pasaporte venezolano permitia entrada sinfin paises mundo hoy cada dia limitados gracias socialismo siglo cuba saquea venezuela impone visa
----------------------------------------
Original: @VickyDavilaH Bueno y si @AlvaroUribeVel se proclama presidente de una vez por todas y nombra a @IvanDuque ministro de guerra y lo deja que solito libere al pueblo venezolano, ¿será que le prestan atención a la grave crisis que vive el Chocó, que parece que solo cuentan con el Esmad ?
Cleaned: bueno proclama presidente vez todas nombra ministro guerra deja solito libere pueblo venezolano prestan atencion grave crisis vive choco parece solo cuentan esmad
----------------------------------------
Original: @zonacero Nomás quieren Telesur y Venezolana de Televisión, super imparcialicimos.
Cleaned: nomas quieren telesur venezolana television super imparcialicimos
----------------------------------------
Original: RT @XiomaryUrbaez Sr @jguaido yo, venezolana y residente en el país, SÍ QUIERO INTERVENCIÓN. Le agradezco que sin haber hecho una consulta pública sobre algo tan importante, no hable por mí. Gracias.
Cleaned: sr venezolana residente pais quiero intervencion agradezco haber hecho consulta publica tan importante hable gracias
----------------------------------------
Original: Y también las grandes masas de venezolanos queriendo refugiarse en Colombia, de verdad que esto es una gran insensatez descarada y cruel, porque todo está premeditadamente calculado.
Cleaned: grandes masas venezolanos queriendo refugiarse colombia verdad gran insensatez descarada cruel premeditadamente calculado
----------------------------------------

Tokenize and lemmatize tweets in the dataset

We use unnest_token() function to divide each tweet into multiple rows, assigning one token to each row. This structure allows us to aggregate identical terms, thereby creating an auxiliary dataframe that acts as a dictionary for lemmas.

We want an iterable of lemmatized non-stopword tokens in order to recreate a cleaner version of the tweet. In order to achieve that, we call tn.lemmatizer() returning an original base form of a token in a specific language structure.

# load a spaCy model, depending on language, out-of-the-box
model_es = spacy.load("es_core_news_sm") # depends on your needs. Please visit: https://spacy.io/models
# Create a dictionary of tokens to lemmatize
word_dict = tp.unnest_tokens(df = tweets.copy(), input_column = 'Clean_tweets', id_col = None, unique = True)
# Lemmatize the tokens
word_dict["lemmatized_tweets"] = word_dict["Clean_tweets"].apply(lambda x: tn.lemmatizer(token = x, model = model_es))
# Rebuild the tweets using the lemmatized tokens
rebuild_tweets = tp.unnest_tokens(df = tweets.copy(), input_column = "Clean_tweets", id_col = None, unique = False)
tokenized_cleaned_tweets = rebuild_tweets \
    .merge(word_dict, how = "left", on = "Clean_tweets") \
        .groupby(["id_x", "Snippet"])[["lemmatized_tweets"]] \
            .agg(lambda x: " ".join(x)) \
                .reset_index()
tokenized_cleaned_tweets.head(3)
id_x Snippet lemmatized_tweets
0 0 RT @emilsen_manozca ¿Me regala una moneda pa u... regalar moneda pa cafe venezolano no tuitero ah
1 1 RT @CriptoNoticias Banco venezolano activa ser... banco venezolano activo servicio usuario cript...
2 2 Capturado venezolano que asesinó a comerciante... capturado venezolano asesino comerciante merca...

Here is a random sample of the before and after with specific Tweets

tweets['lemmatized_tweets'] = tokenized_cleaned_tweets['lemmatized_tweets']
sample_tweets = tweets.sample(5, random_state=1)  # You can change the random_state for different samples
print("Before and After Text Cleaning:")
print('-' * 40)
for index, row in sample_tweets.iterrows():
    print(f"Original: {row['Snippet']}")
    print(f"Cleaned: {row['lemmatized_tweets']}")
    print('-' * 40)
Before and After Text Cleaning:
----------------------------------------
Original: Antes el pasaporte venezolano permitía la entrada en en sinfín de países del mundo. Hoy cada día estamos más limitados gracias al socialismo del siglo 21. Hasta Cuba, que saquea a Venezuela, nos impone una visa. #PeroTodoTieneSuFinal
Cleaned: pasaporte venezolano permitia entrada sinfin pais mundo hoy cada diar limitado gracias socialismo siglo cuba saquea venezuela imponer vis
----------------------------------------
Original: @VickyDavilaH Bueno y si @AlvaroUribeVel se proclama presidente de una vez por todas y nombra a @IvanDuque ministro de guerra y lo deja que solito libere al pueblo venezolano, ¿será que le prestan atención a la grave crisis que vive el Chocó, que parece que solo cuentan con el Esmad ?
Cleaned: bueno proclamar presidente vez todo nombra ministro guerra dejar solitir liberar pueblo venezolano prestar atencion grave crisis vivir choco parecer solo contar esmad
----------------------------------------
Original: @zonacero Nomás quieren Telesur y Venezolana de Televisión, super imparcialicimos.
Cleaned: noma querer telesur venezolano television super imparcialicir
----------------------------------------
Original: RT @XiomaryUrbaez Sr @jguaido yo, venezolana y residente en el país, SÍ QUIERO INTERVENCIÓN. Le agradezco que sin haber hecho una consulta pública sobre algo tan importante, no hable por mí. Gracias.
Cleaned: sr venezolano residente pai querer intervencion agradecer haber hecho consulta publicar tanto importante hablar gracias
----------------------------------------
Original: Y también las grandes masas de venezolanos queriendo refugiarse en Colombia, de verdad que esto es una gran insensatez descarada y cruel, porque todo está premeditadamente calculado.
Cleaned: grande masa venezolano querer refugiar el colombia verdad gran insensatez descarado cruel premeditadamente calculado
----------------------------------------

Seemingly used words and social media bad writting addressing

May you saw in the previous proccesed tweets that there are seemingly used or Out-of-Vocabulary (OOV) words that became evident after processing and cleaning the tweets showed. This words can be a result of bad spelling, common in social media, abbreviations, or other language rules.

Here we propose a method to handle this limitations, some research related to this topic establishes local solutions to this condition, we invite the user to try this approach and also find some other resources to proccess the resulted lemmas. Some additional research to handle OOV words can be found in:

  1. FastText

  2. Kaggle NER Bi-LSTM

  3. Contextual Spell Check

We use our create_bol() function to find distances between lemmas, we are based on the premise that seemingly used lemmas ar far away from the original corpus and don’t have a big apperance on it.

# We take our lemmatized tweets and create a list of lists for the bag of lemmas
flattened_list = list(itertools.chain.from_iterable(tokenized_cleaned_tweets['lemmatized_tweets'].str.split(" ")))
# Now we count the number of times each lemma appears in the list and sort the list in descending order
word_count = Counter(flattened_list)
sorted_words = sorted(word_count.items(), key=lambda x: x[1], reverse=True)
sorted_words_only = [word for word, count in sorted_words]
numpy_array = np.array(sorted_words_only)
# Now we create our bag of lemmas
bol_df = tp.create_bol(numpy_array, verbose=True)
bol_df.head(10)
bow_id bow_name lemma similarity threshold
0 1 venezolano venezolano 100 85
1 1 venezolano venezolana 90 85
2 1 venezolano venezolan 95 85
3 1 venezolano venezolanado 91 85
4 2 el el 100 88
5 3 pai pai 100 87
6 4 colombia colombia 100 85
7 4 colombia colombiano 89 85
8 5 hacer hacer 100 86
9 6 ser ser 100 87

Now we want to select a specific subset of words that does not include our probable OOV or NEW words in the text processing. We will replace words using 85% confidence treshold soo we can infer what was intended to be written.

# Replace each lemma in the original list of lists with its bow_name
lemma_to_bow = dict(zip(bol_df['lemma'], bol_df['bow_name']))
replaced_lemmas = [[lemma_to_bow.get(lemma, lemma) for lemma in doc] for doc in tokenized_cleaned_tweets['lemmatized_tweets'].str.split(" ")]

Here some random examples with the new mapping, you can inspect the differences in lemmas:

tweets['new_clean_lemmas'] = replaced_lemmas
sample_tweets = tweets.sample(10, random_state=1)  # You can change the random_state for different samples
print("Before and After Text Cleaning:")
print('-' * 40)
for index, row in sample_tweets.iterrows():
    print(f"Original: {row['Snippet']}")
    print(f"Cleaned: {row['new_clean_lemmas']}")
    print('-' * 40)
Before and After Text Cleaning:
----------------------------------------
Original: Antes el pasaporte venezolano permitía la entrada en en sinfín de países del mundo. Hoy cada día estamos más limitados gracias al socialismo del siglo 21. Hasta Cuba, que saquea a Venezuela, nos impone una visa. #PeroTodoTieneSuFinal
Cleaned: ['pasaporte', 'venezolano', 'permitir', 'entrada', 'sinfin', 'pais', 'mundo', 'hoy', 'cada', 'diar', 'limitado', 'gracias', 'socialismo', 'siglo', 'cuba', 'saquear', 'venezuela', 'imponer', 'vis']
----------------------------------------
Original: @VickyDavilaH Bueno y si @AlvaroUribeVel se proclama presidente de una vez por todas y nombra a @IvanDuque ministro de guerra y lo deja que solito libere al pueblo venezolano, ¿será que le prestan atención a la grave crisis que vive el Chocó, que parece que solo cuentan con el Esmad ?
Cleaned: ['buen', 'proclamar', 'presidente', 'vez', 'todo', 'nombra', 'ministro', 'guerra', 'dejar', 'solitir', 'liderar', 'pueblo', 'venezolano', 'presentar', 'atencion', 'grave', 'crisis', 'vivir', 'choco', 'parecer', 'solo', 'contar', 'esmad']
----------------------------------------
Original: @zonacero Nomás quieren Telesur y Venezolana de Televisión, super imparcialicimos.
Cleaned: ['noma', 'querer', 'telesur', 'venezolano', 'television', 'super', 'imparcialicir']
----------------------------------------
Original: RT @XiomaryUrbaez Sr @jguaido yo, venezolana y residente en el país, SÍ QUIERO INTERVENCIÓN. Le agradezco que sin haber hecho una consulta pública sobre algo tan importante, no hable por mí. Gracias.
Cleaned: ['sr', 'venezolano', 'presidente', 'pai', 'querer', 'intervencion', 'agradecer', 'haber', 'hecho', 'consulta', 'publicar', 'tanto', 'importante', 'hablar', 'gracias']
----------------------------------------
Original: Y también las grandes masas de venezolanos queriendo refugiarse en Colombia, de verdad que esto es una gran insensatez descarada y cruel, porque todo está premeditadamente calculado.
Cleaned: ['grande', 'masa', 'venezolano', 'querer', 'refugiar', 'el', 'colombia', 'verdad', 'gran', 'insensatez', 'descarado', 'cruel', 'premeditadamente', 'calculado']
----------------------------------------
Original: RT @fernandoperezm #Metro de Madrid #FelizViernesATodos talento venezolano https://t.co/Pe4wuvq6eU
Cleaned: ['madrid', 'talento', 'venezolano']
----------------------------------------
Original: Para que dejen de estar creyendo en los medios oficiales venezolanos. Hacen el mismo trabajo de varios medios colombianos de lavarle la cara al gobierno. https://t.co/msWCOzeCdH
Cleaned: ['dejar', 'creer', 'medio', 'oficial', 'venezolano', 'hacer', 'mismo', 'trabajar', 'varios', 'medio', 'colombia', 'lavar', 'el', 'cara', 'gobierno']
----------------------------------------
Original: RT @Crisantemonegro He visto hace dias venezolanos vendiendo cosas o trabajando, pero hoy por primera vez se me acercaron dos a pedirme dinero porque no tenian para comer; iban cargados con maletas y se les veia el recorrido que llevan. Fué inevitable no llorar frente a tan triste situación.
Cleaned: ['visto', 'hacer', 'dia', 'venezolano', 'vender', 'cosa', 'trabajar', 'hoy', 'primero', 'vez', 'acercar', 'dos', 'pedir', 'yo', 'dinero', 'comer', 'ir', 'cargado', 'maleta', 'veiar', 'recorrido', 'llevar', 'inevitable', 'llorar', 'frente', 'tanto', 'triste', 'situacion']
----------------------------------------
Original: El canciller uruguayo, Ernesto Talvi, aseguró que no “ha sido posible” enviar vuelos humanitarios al país para repatriar a los ciudadanos venezolanos Entérate24.com- E l canciller Jorge Arreaza, aseguró que Venezuela no ha recibido ninguna solicitud de vuelo por parte de Uruguay para repatriar a los ciudadanos venezolanos que se encuentran...
Cleaned: ['canciller', 'uruguay', 'ernesto', 'talvi', 'asegurar', 'ser', 'posible', 'enviar', 'vuelo', 'humanitario', 'pai', 'repatriar', 'ciudadano', 'venezolano', 'enterate', 'com', 'l', 'canciller', 'jorge', 'arreazar', 'asegurar', 'venezuela', 'recibido', 'ningun', 'solicitud', 'vuelo', 'parte', 'uruguay', 'repatriar', 'ciudadano', 'venezolano', 'encontrar']
----------------------------------------
Original: Si tan solo crearan la infraestructura necesaria para que llegara a cada hogar venezolano la mayoría no tendria falta de gas ... pero la 5ta no hace nada productivo ni constructivo😒 sino todo lo contrario...
Cleaned: ['tanto', 'solo', 'crearar', 'infraestructura', 'necesario', 'llegar', 'cada', 'hogar', 'venezolano', 'mayorio', 'falta', 'gas', 'ta', 'hacer', 'producto', 'constructivo', 'sino', 'contrario']
----------------------------------------

From here, you can use this processed tweets to train different models and make your own empirical applications of NLP using social media data. However, we will show you a simple application of Topic Modelling using the data we processed. For more information about this methodology, we deliver some links to help understanding this type of unsupervised classification.

  1. Practical guide for Topic Modelling

  2. An example of a fully developed real pipeline for Topic Modelling

  3. Topic Modelling used in a Kaggle competition

  4. Real Research derived from Topic Modelling

Now we can plug this processed documents in a toy model to see some topics about Venezuelan migrants in Colombia:

This model resolves in some steps: 1. We iterate over the best combination of hyperparameters alpha, beta, and number of topics. 2. We filter the results and pick the model with best coherence. We calculate Coherence Score and Perplexity of each LDA Topic Modeling implementation. 3. We display a visualization of the topics found in the toy model.

NOTE: This code takes a lot of time iterating over different combinations of hyperparameters, expect long kernel runs. You may adjust it for your use case.

# Now we create our initial variables for Topic Modeling
# Create Dictionary
dictionary = corpora.Dictionary(replaced_lemmas)
corpus = [dictionary.doc2bow(text) for text in replaced_lemmas]
# A function that resolves our hyperparameters using a corpus and a dictionary
def compute_coherence_perplexity_values(corpus, dictionary, k, a, b):

    lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=k,
                                           random_state=100,
                                           chunksize=100,
                                           passes=10,
                                           alpha=a,
                                           eta=b,
                                           workers=7)

    coherence_model_lda = CoherenceModel(model=lda_model, texts=replaced_lemmas, dictionary=dictionary, coherence='c_v')

    return (coherence_model_lda.get_coherence(),lda_model.log_perplexity(corpus))

grid = {}
grid['Validation_Set'] = {}

# Topics range
min_topics = 2
max_topics = 4
step_size = 1
topics_range = range(min_topics, max_topics, step_size)

# Alpha parameter
alpha = list(np.arange(0.01, 1, 0.3))
alpha.append('symmetric')
alpha.append('asymmetric')

# Beta parameter
beta = list(np.arange(0.01, 1, 0.3))
beta.append('symmetric')

# Validation sets
num_of_docs = len(corpus)
corpus_sets = [# gensim.utils.ClippedCorpus(corpus, num_of_docs*0.25),
               # gensim.utils.ClippedCorpus(corpus, num_of_docs*0.5),
               gensim.utils.ClippedCorpus(corpus, int(num_of_docs*0.75)),
               corpus]
corpus_title = ['75% Corpus', '100% Corpus']
model_results = {'Validation_Set': [],
                 'Topics': [],
                 'Alpha': [],
                 'Beta': [],
                 'Coherence': [],
                 'Perplexity': []
                }

# Can take a long time to run
if 1 == 1:
    # This is the number of times we want to iterate to find optimal hyperparameters
    pbar = tqdm.tqdm(total = 540)

    # iterate through validation corpuses
    for i in range(len(corpus_sets)):
        # iterate through number of topics
        for k in topics_range:
            # iterate through alpha values
            for a in alpha:
                # iterare through beta values
                for b in beta:
                    # get the coherence score for the given parameters
                    (cv, pp) = compute_coherence_perplexity_values(corpus=corpus_sets[i], dictionary=dictionary,
                                                  k=k, a=a, b=b)
                    # Save the model results
                    model_results['Validation_Set'].append(corpus_title[i])
                    model_results['Topics'].append(k)
                    model_results['Alpha'].append(a)
                    model_results['Beta'].append(b)
                    model_results['Coherence'].append(cv)
                    model_results['Perplexity'].append(pp)
                    pbar.update(1)
    pd.DataFrame(model_results).to_csv(os.path.join(data_folder,"lda_tuning_results.csv"), index=False)
    pbar.close()

Now we want to find the optimal model to train, let’s see the results of our trainning pocess:

# Pre-trained models available in Github data folder, we recommend retraining the model with your own data
tabla_tunning = pd.read_csv(os.path.join(data_folder,"lda_tuning_results.csv"))
tabla_tunning = tabla_tunning.sort_values(by = 'Coherence', ascending = False)
tabla_tunning
Validation_Set Topics Alpha Beta Coherence Perplexity
117 100% Corpus 3 asymmetric 0.61 0.417986 -7.722477
58 75% Corpus 3 asymmetric 0.9099999999999999 0.406760 -7.942932
32 75% Corpus 3 0.01 0.61 0.399778 -7.888492
43 75% Corpus 3 0.61 0.9099999999999999 0.391599 -8.025149
39 75% Corpus 3 0.31 symmetric 0.383571 -7.980128
... ... ... ... ... ... ...
63 100% Corpus 2 0.01 0.9099999999999999 0.292411 -7.677486
108 100% Corpus 3 0.9099999999999999 0.9099999999999999 0.291343 -7.848078
104 100% Corpus 3 0.61 symmetric 0.288476 -7.852673
102 100% Corpus 3 0.61 0.61 0.281319 -7.834575
85 100% Corpus 2 asymmetric 0.01 0.270411 -8.548538

120 rows × 6 columns

Let’s train the model! We now pick the best result from the validation table created on the last step. We might want to revisit alternatives to the model that fits better with interpretability of the Topics found.

lda_final_model = gensim.models.LdaMulticore(corpus=corpus,
                                             id2word=dictionary,
                                             num_topics=3,
                                             random_state=100,
                                             chunksize=100,
                                             passes=10,
                                             alpha=0.01,
                                             eta=0.61,
                                             workers=7)

Now that we have trained an optimized version of our toy model, we want to visually inspect the derived topics and see if we find some interesting patterns giving information related to the way people speaks about Venezuelan migrants in Colombia.

[[(dictionary[id], freq) for id, freq in cp] for cp in corpus[:1]]

pprint.pprint(lda_final_model.print_topics())
doc_lda = lda_final_model[corpus]

visxx = pyLDAvis.gensim_models.prepare(topic_model=lda_final_model, corpus=corpus, dictionary=dictionary)
pyLDAvis.display(visxx)
[(0,
  '0.049*"venezolano" + 0.015*"colombia" + 0.008*"el" + 0.005*"ver" + '
  '0.004*"hacer" + 0.004*"ir" + 0.004*"ayuda" + 0.004*"decir" + 0.004*"pai" + '
  '0.003*"ser"'),
 (1,
  '0.035*"venezolano" + 0.006*"pai" + 0.006*"colombia" + 0.005*"ser" + '
  '0.005*"decir" + 0.005*"poder" + 0.004*"hacer" + 0.003*"pueblo" + 0.003*"el" '
  '+ 0.003*"ir"'),
 (2,
  '0.039*"venezolano" + 0.007*"pai" + 0.007*"hacer" + 0.006*"colombia" + '
  '0.004*"migrant" + 0.004*"ser" + 0.004*"el" + 0.004*"poder" + 0.004*"ver" + '
  '0.004*"bogota"')]
# We save the model in order to be able to use it later
pyLDAvis.save_html(visxx, "lda_final_model.html")

Conclusion and Analysis of toy-Topic Modelling Results

In the realm of unsupervised classification, Topic Modelling has emerged as a powerful tool. It meticulously unravels the underlying topics within a corpus of text, illuminating the subtle narratives interwoven within large textual data. This technique finds its prowess particularly accentuated when applied to a heterogeneous assortment of tweets. By incorporating a more substantial and varied datas enhanced. Here, we simply made a basic aproximation for Topic Modelling to show how tidyX can be useful in order to process social media data preparing it to NLP tasks like this one.

For those eager to delve deeper into this subject, we recommend reaching out to Barómetro de Xenofobia, a reservoir of comprehensive data that can greatly augment research in this field.

Given λ = 0.5, and navigating through a dataset comprising 1000 tweets, our toy exploration has yielded the following intriguing topics:

Topic 1: Migrant Necessities and Frontier Struggles

Some Interesting Relevant Words: ayuda, pedir, querer, primero, niño, frontera, gobierno, vida.

This topic unveils the urgent necessities and pleas echoed by the migrants. It portrays a vivid picture of their journey, marked by vulnerability and struggle, especially among children at the frontiers. The narrative fluctuates between government interventions and intrinsic human endeavors for survival.

Topic 2: Migrant Flows and Economic Perceived Competition for Resources

Some Interesting Relevant Words: Bogotá, migrante, venir, salir, regresar, cualquiera, trabajar, quitar, sector, mil

Focusing on the economic dynamics within principal cities such as Bogotá, this topic elucidates the Perceived Competition for Resources and flow of the migrant populace. It highlights the intricate tapestry of employment, competition, and the transformative economic landscapes molded by the presence of migrants. There is a vast interesting literature that studies this labour market externalities and impact evaluation when migrants start seeking jobs in a foreign country.

Topic 3: Advocacy and Critique—The Landscape of Migrant Rights and Initiatives

Some Interesting Relevant Words: mal, poder, decir, ser, deber, criticar, derecho, igual, tratar, programa, fundación, presencia

This topic blossoms into a vibrant discourse revolving around rights, responsibilities, and critiques of migrants in Colombia. It encapsulates a conversation of political and social agendas related to migrants context in Colombia.

In closing, our analysis, though constrained by the volume of the dataset, serves as a gateway to exploring the vast universe of modelling tasks that tidyX** attemps to address. It invites further exploration, promising a richer and evolving way to analyze social media data.**