TextNormalization

class tidyX.text_normalization.TextNormalization[source]

static is_emoji(s: str) → bool[source]: Check if a given string is an emoji.

static lemmatizer(token: str, model: Language) → str[source]

Lemmatizes a given token using Spacy’s Spanish language model.

Lemmatization is the process of reducing a word to its base or dictionary form. For example, the word “running” would be lemmatized to “run”. Lemmatization takes into account the meaning of the word in the sentence, leveraging vocabulary and morphological analysis.

Note: Before using this function, a Spacy model should be downloaded. Use python -m spacy download name_of_model to download a model. Available models for Spanish are: “es_core_news_sm”, “es_core_news_md”, “es_core_news_lg”, “es_dep_news_trf”. For more information, visit https://spacy.io/models/

Args:: token (str): The token to be lemmatized. model (spacy.language.Language): A Spacy language model object.
Returns:: str: The lemmatized version of the token, with accents removed.

stemmer(language: str = 'spanish') → str[source]

Stems a given token using Snowball stemmer.

Stemming is the process of reducing a word to its base or root form, often by stripping suffixes. For instance, the word “running” might be stemmed to “run”. Unlike lemmatization, stemming doesn’t always produce a valid word and doesn’t consider the meaning of a word in the context.

This function uses the Snowball stemmer, which supports multiple languages including Spanish.

Note: Before using this function, you might need to install nltk if not done already. Use pip install nltk.

Args:: token (str): The token to be stemmed. language (str, optional): The language of the token. Defaults to “spanish”.
Returns:: str: The stemmed version of the token.