Stemming and Lemmatizing Texts Efficiently

The stemmer() and lemmatizer() functions each accept a single token as input. Thus, if we aim to normalize an entire text or a corpus, we would need to iterate over each token in the string using these functions. This approach might be inefficient, especially if the input contains repeated words.

This tutorial demonstrates how to utilize the unnest_tokens() function to apply normalization functions just once for every unique word.

# Import tidyX modules
from tidyX import TextPreprocessor as tp
from tidyX import TextNormalization as tn
from tidyX import TextVisualizer as tv

# Import auxiliary libraries
import pandas as pd

# First, load a dataframe containing 1000 tweets from Colombia discussing Venezuela.
tweets = tp.load_data(file = "spanish")
tweets.head()
Tweet
0 RT @emilsen_manozca ¿Me regala una moneda pa u...
1 RT @CriptoNoticias Banco venezolano activa ser...
2 Capturado venezolano que asesinó a comerciante...
3 RT @PersoneriaVpar @PersoneriaVpar acompaña al...
4 Bueno ya sacaron la carta de "amenaza de atent...
# Firstly we would clean the text easily using our preprocess function
tweets['clean'] = tweets['Tweet'].apply(lambda x: tp.preprocess(x,
                                                                delete_emojis = False,
                                                                remove_stopwords = True,
                                                                language_stopwords = "spanish"))
tweets.head()
Tweet clean
0 RT @emilsen_manozca ¿Me regala una moneda pa u... regala moneda pa cafe venezolano no tuitero ah...
1 RT @CriptoNoticias Banco venezolano activa ser... banco venezolano activa servicio usuarios crip...
2 Capturado venezolano que asesinó a comerciante... capturado venezolano asesino comerciante merca...
3 RT @PersoneriaVpar @PersoneriaVpar acompaña al... acompa grupo especial migratorio cesar reunion...
4 Bueno ya sacaron la carta de "amenaza de atent... bueno sacaron carta amenaza atentado president...

In this step, we will utilize the unnest_token() function to divide each tweet into multiple rows, assigning one token to each row. This structure allows us to aggregate identical terms, thereby creating an auxiliary dataframe that acts as a dictionary for lemmas or stems.

dictionary_normalization = tp.unnest_tokens(df = tweets.copy(), input_column = "clean", id_col = None, unique = True)
dictionary_normalization
clean id
0 246
1 abajo 352, 577
2 abandonar 337, 509
3 abarrotarse 993
4 abiertos 72
... ... ...
5878 🤪 519
5879 🤬 483, 520, 908, 908
5880 🤯 615
5881 🤷 482, 736, 841, 947, 947, 947
5882 🥺 833, 851

5883 rows × 2 columns

Note that the id column represents the indices of the tweets that contain each token from the clean column. Now we can proceed using the stemmer() and lemmatizer() functions to create new columns of dictionary_normalization

# Apply spanish_lemmatizer function to lemmatize the token
dictionary_normalization["stemm"] = dictionary_normalization["clean"].apply(lambda x: tn.stemmer(token = x, language = "spanish"))

Don’t forget to download the corresponding SpaCy model for lemmatization. For Spanish lemmatization, we suggest the es_core_news_sm model:

!python -m spacy download es_core_news_sm

For English lemmatization, we suggest the en_core_web_sm model:

!python -m spacy download en_core_web_sm

To see a full list of available models for different languages, visit Spacy’s documentation

import spacy

# Load model
model_es = spacy.load("es_core_news_sm")

# Apply lemmatizer function to lemmatize the token
dictionary_normalization["lemma"] = dictionary_normalization["clean"].apply(lambda x: tn.lemmatizer(token = x, model = model_es))

# Lemmatizing could produce stopwords, therefore we applied remove_words function
dictionary_normalization["lemma"] = dictionary_normalization["lemma"].apply(lambda x: tp.remove_words(x, remove_stopwords = True, language = "spanish"))

dictionary_normalization
clean id stemm lemma
0 246
1 abajo 352, 577 abaj abajo
2 abandonar 337, 509 abandon abandonar
3 abarrotarse 993 abarrot abarrotar
4 abiertos 72 abiert abierto
... ... ... ... ...
5878 🤪 519 🤪 🤪
5879 🤬 483, 520, 908, 908 🤬 🤬
5880 🤯 615 🤯 🤯
5881 🤷 482, 736, 841, 947, 947, 947 🤷 🤷
5882 🥺 833, 851 🥺 🥺

5883 rows × 4 columns

To rebuild our original tweets we will use again unnest_tokens function

tweets_long = tp.unnest_tokens(df = tweets.copy(), input_column = "clean", id_col = None, unique = False)
tweets_long
Tweet clean id
0 RT @emilsen_manozca ¿Me regala una moneda pa u... regala 0
0 RT @emilsen_manozca ¿Me regala una moneda pa u... moneda 0
0 RT @emilsen_manozca ¿Me regala una moneda pa u... pa 0
0 RT @emilsen_manozca ¿Me regala una moneda pa u... cafe 0
0 RT @emilsen_manozca ¿Me regala una moneda pa u... venezolano 0
... ... ... ...
999 RT infopresidencia: "Sin lugar a dudas hay uno... recibido 999
999 RT infopresidencia: "Sin lugar a dudas hay uno... cerca 999
999 RT infopresidencia: "Sin lugar a dudas hay uno... venezolanos 999
999 RT infopresidencia: "Sin lugar a dudas hay uno... presidente 999
999 RT infopresidencia: "Sin lugar a dudas hay uno... i 999

13557 rows × 3 columns

tweets_normalized = tweets_long \
    .merge(dictionary_normalization, how = "left", on = "clean") \
        .groupby(["id_x", "Tweet"])[["lemma", "stemm"]] \
            .agg(lambda x: " ".join(x)) \
                .reset_index()
tweets_normalized.head()
id_x Tweet lemma stemm
0 0 RT @emilsen_manozca ¿Me regala una moneda pa u... regalar moneda pa cafar venezolano tuitero ah... regal moned pa caf venezolan no tuiter ah 😂 👋
1 1 RT @CriptoNoticias Banco venezolano activa ser... banco venezolano activo servicio usuario cript... banc venezolan activ servici usuari criptomoned
2 2 Capturado venezolano que asesinó a comerciante... capturado venezolano asesino comerciante merca... captur venezolan asesin comerci merc public
3 3 RT @PersoneriaVpar @PersoneriaVpar acompaña al... acompa grupo especial migratorio cesar reunion... acomp grup especial migratori ces reunion real...
4 4 Bueno ya sacaron la carta de "amenaza de atent... bueno sacar cartar amenazar atentado president... buen sac cart amenaz atent president duqu func...
for i in range(3):
    print("-"*50)
    print("Example", i + 1)
    print("Original tweet:", tweets_normalized.loc[i, "Tweet"])
    print("Lemmatized tweet:", tweets_normalized.loc[i, "lemma"])
    print("Stemmed tweet:", tweets_normalized.loc[i, "stemm"])
--------------------------------------------------
Example 1
Original tweet: RT @emilsen_manozca ¿Me regala una moneda pa un café? -¿Eres venezolano? Noo! Tuitero. -Ahhh 😂😂😂👋
Lemmatized tweet: regalar moneda pa cafar venezolano  tuitero ah 😂 👋
Stemmed tweet: regal moned pa caf venezolan no tuiter ah 😂 👋
--------------------------------------------------
Example 2
Original tweet: RT @CriptoNoticias Banco venezolano activa servicio para usuarios de criptomonedas #ServiciosFinancieros https://t.co/1r2rZIUdlo
Lemmatized tweet: banco venezolano activo servicio usuario criptomoneda
Stemmed tweet: banc venezolan activ servici usuari criptomoned
--------------------------------------------------
Example 3
Original tweet: Capturado venezolano que asesinó a comerciante del Mercado Público https://t.co/XrmWKVYMR8 https://t.co/CfMLaB25jI
Lemmatized tweet: capturado venezolano asesino comerciante mercado publico
Stemmed tweet: captur venezolan asesin comerci merc public