Stemming and Lemmatizing Texts Efficiently
The stemmer() and lemmatizer() functions each accept a single token as input. Thus, if we aim to normalize an entire text or a corpus, we would need to iterate over each token in the string using these functions. This approach might be inefficient, especially if the input contains repeated words.
This tutorial demonstrates how to utilize the unnest_tokens() function to apply normalization functions just once for every unique word.
# Import tidyX modules
from tidyX import TextPreprocessor as tp
from tidyX import TextNormalization as tn
from tidyX import TextVisualizer as tv
# Import auxiliary libraries
import pandas as pd
# First, load a dataframe containing 1000 tweets from Colombia discussing Venezuela.
tweets = tp.load_data(file = "spanish")
tweets.head()
| Tweet | |
|---|---|
| 0 | RT @emilsen_manozca ¿Me regala una moneda pa u... |
| 1 | RT @CriptoNoticias Banco venezolano activa ser... |
| 2 | Capturado venezolano que asesinó a comerciante... |
| 3 | RT @PersoneriaVpar @PersoneriaVpar acompaña al... |
| 4 | Bueno ya sacaron la carta de "amenaza de atent... |
# Firstly we would clean the text easily using our preprocess function
tweets['clean'] = tweets['Tweet'].apply(lambda x: tp.preprocess(x,
delete_emojis = False,
remove_stopwords = True,
language_stopwords = "spanish"))
tweets.head()
| Tweet | clean | |
|---|---|---|
| 0 | RT @emilsen_manozca ¿Me regala una moneda pa u... | regala moneda pa cafe venezolano no tuitero ah... |
| 1 | RT @CriptoNoticias Banco venezolano activa ser... | banco venezolano activa servicio usuarios crip... |
| 2 | Capturado venezolano que asesinó a comerciante... | capturado venezolano asesino comerciante merca... |
| 3 | RT @PersoneriaVpar @PersoneriaVpar acompaña al... | acompa grupo especial migratorio cesar reunion... |
| 4 | Bueno ya sacaron la carta de "amenaza de atent... | bueno sacaron carta amenaza atentado president... |
In this step, we will utilize the unnest_token() function to divide
each tweet into multiple rows, assigning one token to each row. This
structure allows us to aggregate identical terms, thereby creating an
auxiliary dataframe that acts as a dictionary for lemmas or stems.
dictionary_normalization = tp.unnest_tokens(df = tweets.copy(), input_column = "clean", id_col = None, unique = True)
dictionary_normalization
| clean | id | |
|---|---|---|
| 0 | 246 | |
| 1 | abajo | 352, 577 |
| 2 | abandonar | 337, 509 |
| 3 | abarrotarse | 993 |
| 4 | abiertos | 72 |
| ... | ... | ... |
| 5878 | 🤪 | 519 |
| 5879 | 🤬 | 483, 520, 908, 908 |
| 5880 | 🤯 | 615 |
| 5881 | 🤷 | 482, 736, 841, 947, 947, 947 |
| 5882 | 🥺 | 833, 851 |
5883 rows × 2 columns
Note that the id column represents the indices of the tweets that
contain each token from the clean column. Now we can proceed using
the stemmer() and lemmatizer() functions to create new columns
of dictionary_normalization
# Apply spanish_lemmatizer function to lemmatize the token
dictionary_normalization["stemm"] = dictionary_normalization["clean"].apply(lambda x: tn.stemmer(token = x, language = "spanish"))
Don’t forget to download the corresponding SpaCy model for
lemmatization. For Spanish lemmatization, we suggest the
es_core_news_sm model:
!python -m spacy download es_core_news_sm
For English lemmatization, we suggest the en_core_web_sm model:
!python -m spacy download en_core_web_sm
To see a full list of available models for different languages, visit Spacy’s documentation
import spacy
# Load model
model_es = spacy.load("es_core_news_sm")
# Apply lemmatizer function to lemmatize the token
dictionary_normalization["lemma"] = dictionary_normalization["clean"].apply(lambda x: tn.lemmatizer(token = x, model = model_es))
# Lemmatizing could produce stopwords, therefore we applied remove_words function
dictionary_normalization["lemma"] = dictionary_normalization["lemma"].apply(lambda x: tp.remove_words(x, remove_stopwords = True, language = "spanish"))
dictionary_normalization
| clean | id | stemm | lemma | |
|---|---|---|---|---|
| 0 | 246 | |||
| 1 | abajo | 352, 577 | abaj | abajo |
| 2 | abandonar | 337, 509 | abandon | abandonar |
| 3 | abarrotarse | 993 | abarrot | abarrotar |
| 4 | abiertos | 72 | abiert | abierto |
| ... | ... | ... | ... | ... |
| 5878 | 🤪 | 519 | 🤪 | 🤪 |
| 5879 | 🤬 | 483, 520, 908, 908 | 🤬 | 🤬 |
| 5880 | 🤯 | 615 | 🤯 | 🤯 |
| 5881 | 🤷 | 482, 736, 841, 947, 947, 947 | 🤷 | 🤷 |
| 5882 | 🥺 | 833, 851 | 🥺 | 🥺 |
5883 rows × 4 columns
To rebuild our original tweets we will use again unnest_tokens
function
tweets_long = tp.unnest_tokens(df = tweets.copy(), input_column = "clean", id_col = None, unique = False)
tweets_long
| Tweet | clean | id | |
|---|---|---|---|
| 0 | RT @emilsen_manozca ¿Me regala una moneda pa u... | regala | 0 |
| 0 | RT @emilsen_manozca ¿Me regala una moneda pa u... | moneda | 0 |
| 0 | RT @emilsen_manozca ¿Me regala una moneda pa u... | pa | 0 |
| 0 | RT @emilsen_manozca ¿Me regala una moneda pa u... | cafe | 0 |
| 0 | RT @emilsen_manozca ¿Me regala una moneda pa u... | venezolano | 0 |
| ... | ... | ... | ... |
| 999 | RT infopresidencia: "Sin lugar a dudas hay uno... | recibido | 999 |
| 999 | RT infopresidencia: "Sin lugar a dudas hay uno... | cerca | 999 |
| 999 | RT infopresidencia: "Sin lugar a dudas hay uno... | venezolanos | 999 |
| 999 | RT infopresidencia: "Sin lugar a dudas hay uno... | presidente | 999 |
| 999 | RT infopresidencia: "Sin lugar a dudas hay uno... | i | 999 |
13557 rows × 3 columns
tweets_normalized = tweets_long \
.merge(dictionary_normalization, how = "left", on = "clean") \
.groupby(["id_x", "Tweet"])[["lemma", "stemm"]] \
.agg(lambda x: " ".join(x)) \
.reset_index()
tweets_normalized.head()
| id_x | Tweet | lemma | stemm | |
|---|---|---|---|---|
| 0 | 0 | RT @emilsen_manozca ¿Me regala una moneda pa u... | regalar moneda pa cafar venezolano tuitero ah... | regal moned pa caf venezolan no tuiter ah 😂 👋 |
| 1 | 1 | RT @CriptoNoticias Banco venezolano activa ser... | banco venezolano activo servicio usuario cript... | banc venezolan activ servici usuari criptomoned |
| 2 | 2 | Capturado venezolano que asesinó a comerciante... | capturado venezolano asesino comerciante merca... | captur venezolan asesin comerci merc public |
| 3 | 3 | RT @PersoneriaVpar @PersoneriaVpar acompaña al... | acompa grupo especial migratorio cesar reunion... | acomp grup especial migratori ces reunion real... |
| 4 | 4 | Bueno ya sacaron la carta de "amenaza de atent... | bueno sacar cartar amenazar atentado president... | buen sac cart amenaz atent president duqu func... |
for i in range(3):
print("-"*50)
print("Example", i + 1)
print("Original tweet:", tweets_normalized.loc[i, "Tweet"])
print("Lemmatized tweet:", tweets_normalized.loc[i, "lemma"])
print("Stemmed tweet:", tweets_normalized.loc[i, "stemm"])
--------------------------------------------------
Example 1
Original tweet: RT @emilsen_manozca ¿Me regala una moneda pa un café? -¿Eres venezolano? Noo! Tuitero. -Ahhh 😂😂😂👋
Lemmatized tweet: regalar moneda pa cafar venezolano tuitero ah 😂 👋
Stemmed tweet: regal moned pa caf venezolan no tuiter ah 😂 👋
--------------------------------------------------
Example 2
Original tweet: RT @CriptoNoticias Banco venezolano activa servicio para usuarios de criptomonedas #ServiciosFinancieros https://t.co/1r2rZIUdlo
Lemmatized tweet: banco venezolano activo servicio usuario criptomoneda
Stemmed tweet: banc venezolan activ servici usuari criptomoned
--------------------------------------------------
Example 3
Original tweet: Capturado venezolano que asesinó a comerciante del Mercado Público https://t.co/XrmWKVYMR8 https://t.co/CfMLaB25jI
Lemmatized tweet: capturado venezolano asesino comerciante mercado publico
Stemmed tweet: captur venezolan asesin comerci merc public