Stemming and Lemmatizing Texts Efficiently

The stemmer() and lemmatizer() functions each accept a single token as input. Thus, if we aim to normalize an entire text or a corpus, we would need to iterate over each token in the string using these functions. This approach might be inefficient, especially if the input contains repeated words.

This tutorial demonstrates how to utilize the unnest_tokens() function to apply normalization functions just once for every unique word.

# Import tidyX modules
from tidyX import TextPreprocessor as tp
from tidyX import TextNormalization as tn
from tidyX import TextVisualizer as tv

# Import auxiliary libraries
import pandas as pd

# First, load a dataframe containing 1000 tweets from Colombia discussing Venezuela.
tweets = tp.load_data(file = "spanish")
tweets.head()

	Tweet
0	RT @emilsen_manozca ¿Me regala una moneda pa u...
1	RT @CriptoNoticias Banco venezolano activa ser...
2	Capturado venezolano que asesinó a comerciante...
3	RT @PersoneriaVpar @PersoneriaVpar acompaña al...
4	Bueno ya sacaron la carta de "amenaza de atent...

# Firstly we would clean the text easily using our preprocess function
tweets['clean'] = tweets['Tweet'].apply(lambda x: tp.preprocess(x,
                                                                delete_emojis = False,
                                                                remove_stopwords = True,
                                                                language_stopwords = "spanish"))
tweets.head()

	Tweet	clean
0	RT @emilsen_manozca ¿Me regala una moneda pa u...	regala moneda pa cafe venezolano no tuitero ah...
1	RT @CriptoNoticias Banco venezolano activa ser...	banco venezolano activa servicio usuarios crip...
2	Capturado venezolano que asesinó a comerciante...	capturado venezolano asesino comerciante merca...
3	RT @PersoneriaVpar @PersoneriaVpar acompaña al...	acompa grupo especial migratorio cesar reunion...
4	Bueno ya sacaron la carta de "amenaza de atent...	bueno sacaron carta amenaza atentado president...

In this step, we will utilize the unnest_token() function to divide each tweet into multiple rows, assigning one token to each row. This structure allows us to aggregate identical terms, thereby creating an auxiliary dataframe that acts as a dictionary for lemmas or stems.

dictionary_normalization = tp.unnest_tokens(df = tweets.copy(), input_column = "clean", id_col = None, unique = True)
dictionary_normalization

	clean	id
0		246
1	abajo	352, 577
2	abandonar	337, 509
3	abarrotarse	993
4	abiertos	72
...	...	...
5878	🤪	519
5879	🤬	483, 520, 908, 908
5880	🤯	615
5881	🤷	482, 736, 841, 947, 947, 947
5882	🥺	833, 851

5883 rows × 2 columns

Note that the id column represents the indices of the tweets that contain each token from the clean column. Now we can proceed using the stemmer() and lemmatizer() functions to create new columns of dictionary_normalization

# Apply spanish_lemmatizer function to lemmatize the token
dictionary_normalization["stemm"] = dictionary_normalization["clean"].apply(lambda x: tn.stemmer(token = x, language = "spanish"))

Don’t forget to download the corresponding SpaCy model for lemmatization. For Spanish lemmatization, we suggest the es_core_news_sm model:

!python -m spacy download es_core_news_sm

For English lemmatization, we suggest the en_core_web_sm model:

!python -m spacy download en_core_web_sm

To see a full list of available models for different languages, visit Spacy’s documentation

import spacy

# Load model
model_es = spacy.load("es_core_news_sm")

# Apply lemmatizer function to lemmatize the token
dictionary_normalization["lemma"] = dictionary_normalization["clean"].apply(lambda x: tn.lemmatizer(token = x, model = model_es))

# Lemmatizing could produce stopwords, therefore we applied remove_words function
dictionary_normalization["lemma"] = dictionary_normalization["lemma"].apply(lambda x: tp.remove_words(x, remove_stopwords = True, language = "spanish"))

dictionary_normalization

	clean	id	stemm	lemma
0		246
1	abajo	352, 577	abaj	abajo
2	abandonar	337, 509	abandon	abandonar
3	abarrotarse	993	abarrot	abarrotar
4	abiertos	72	abiert	abierto
...	...	...	...	...
5878	🤪	519	🤪	🤪
5879	🤬	483, 520, 908, 908	🤬	🤬
5880	🤯	615	🤯	🤯
5881	🤷	482, 736, 841, 947, 947, 947	🤷	🤷
5882	🥺	833, 851	🥺	🥺

5883 rows × 4 columns

To rebuild our original tweets we will use again unnest_tokens function

tweets_long = tp.unnest_tokens(df = tweets.copy(), input_column = "clean", id_col = None, unique = False)
tweets_long

	Tweet	clean	id
0	RT @emilsen_manozca ¿Me regala una moneda pa u...	regala	0
0	RT @emilsen_manozca ¿Me regala una moneda pa u...	moneda	0
0	RT @emilsen_manozca ¿Me regala una moneda pa u...	pa	0
0	RT @emilsen_manozca ¿Me regala una moneda pa u...	cafe	0
0	RT @emilsen_manozca ¿Me regala una moneda pa u...	venezolano	0
...	...	...	...
999	RT infopresidencia: "Sin lugar a dudas hay uno...	recibido	999
999	RT infopresidencia: "Sin lugar a dudas hay uno...	cerca	999
999	RT infopresidencia: "Sin lugar a dudas hay uno...	venezolanos	999
999	RT infopresidencia: "Sin lugar a dudas hay uno...	presidente	999
999	RT infopresidencia: "Sin lugar a dudas hay uno...	i	999

13557 rows × 3 columns

tweets_normalized = tweets_long \
    .merge(dictionary_normalization, how = "left", on = "clean") \
        .groupby(["id_x", "Tweet"])[["lemma", "stemm"]] \
            .agg(lambda x: " ".join(x)) \
                .reset_index()
tweets_normalized.head()

	id_x	Tweet	lemma	stemm
0	0	RT @emilsen_manozca ¿Me regala una moneda pa u...	regalar moneda pa cafar venezolano tuitero ah...	regal moned pa caf venezolan no tuiter ah 😂 👋
1	1	RT @CriptoNoticias Banco venezolano activa ser...	banco venezolano activo servicio usuario cript...	banc venezolan activ servici usuari criptomoned
2	2	Capturado venezolano que asesinó a comerciante...	capturado venezolano asesino comerciante merca...	captur venezolan asesin comerci merc public
3	3	RT @PersoneriaVpar @PersoneriaVpar acompaña al...	acompa grupo especial migratorio cesar reunion...	acomp grup especial migratori ces reunion real...
4	4	Bueno ya sacaron la carta de "amenaza de atent...	bueno sacar cartar amenazar atentado president...	buen sac cart amenaz atent president duqu func...

for i in range(3):
    print("-"*50)
    print("Example", i + 1)
    print("Original tweet:", tweets_normalized.loc[i, "Tweet"])
    print("Lemmatized tweet:", tweets_normalized.loc[i, "lemma"])
    print("Stemmed tweet:", tweets_normalized.loc[i, "stemm"])

--------------------------------------------------
Example 1
Original tweet: RT @emilsen_manozca ¿Me regala una moneda pa un café? -¿Eres venezolano? Noo! Tuitero. -Ahhh 😂😂😂👋
Lemmatized tweet: regalar moneda pa cafar venezolano  tuitero ah 😂 👋
Stemmed tweet: regal moned pa caf venezolan no tuiter ah 😂 👋
--------------------------------------------------
Example 2
Original tweet: RT @CriptoNoticias Banco venezolano activa servicio para usuarios de criptomonedas #ServiciosFinancieros https://t.co/1r2rZIUdlo
Lemmatized tweet: banco venezolano activo servicio usuario criptomoneda
Stemmed tweet: banc venezolan activ servici usuari criptomoned
--------------------------------------------------
Example 3
Original tweet: Capturado venezolano que asesinó a comerciante del Mercado Público https://t.co/XrmWKVYMR8 https://t.co/CfMLaB25jI
Lemmatized tweet: capturado venezolano asesino comerciante mercado publico
Stemmed tweet: captur venezolan asesin comerci merc public