Group similar terms

When working with a corpus sourced from social networks, it’s common to encounter texts with grammatical errors or words that aren’t formally included in dictionaries. These irregularities can pose challenges when creating Term Frequency matrices for NLP algorithms. To address this, we developed the create_bol() function, which allows you to create specific bags of terms to cluster related terms.

from tidyX import TextPreprocessor as tp
import numpy as np

# Create a numpy array of words to cluster
words = np.array(['apple', 'aple', 'apples', 'banana', 'banan', 'bananas', 'cherry', 'cheri', 'cherries'])

# Apply create_bol function to group similar words
bol_df = tp.create_bol(lemmas = words)

print(bol_df)

bow_id	bow_name	lemma	similarity	threshold
1	apple	apple	100	86
1	apple	aple	89	86
1	apple	apples	91	86
2	banana	banana	100	85
2	banana	banan	91	85
2	banana	bananas	92	85
3	cherry	cherry	100	85
4	cheri	cheri	100	86
5	cherries	cherries	100	85

Note that bol_df is a dataframe where each row corresponds to a word from the words array. In this case, the function groups all the words into three categories: apple, banana, and cherry.