Welcome to tidyX’s documentation!

tidyX is a Python package designed for cleaning and preprocessing text for machine learning applications, especially for text written in Spanish and originating from social networks. This library provides a complete pipeline to remove unwanted characters, normalize text, group similar terms, etc. to facilitate NLP applications.

Installation

Install the package using pip:

pip install tidyX

Make sure you have the necessary dependencies installed. If you plan on lemmatizing, you’ll need spaCy along with the appropriate language models. For Spanish lemmatization, we recommend downloading the es_core_web_sm model:

python -m spacy download es_core_news_sm

For English lemmatization, we suggest the en_core_web_sm model:

python -m spacy download en_core_web_sm

To see a full list of available models for different languages, visit Spacy’s documentation.

Usage

Tutorials

User Documentation

Contributing

Contributions to enhance tidyX are welcome! Feel free to open issues for bug reports, feature requests, or submit pull requests in our github repo. If this package has been helpful, please give us a star :D