Group similar terms

When working with a corpus sourced from social networks, it’s common to encounter texts with grammatical errors or words that aren’t formally included in dictionaries. These irregularities can pose challenges when creating Term Frequency matrices for NLP algorithms. To address this, we developed the create_bol() function, which allows you to create specific bags of terms to cluster related terms.

from tidyX import TextPreprocessor as tp
import numpy as np

# Create a numpy array of words to cluster
words = np.array(['apple', 'aple', 'apples', 'banana', 'banan', 'bananas', 'cherry', 'cheri', 'cherries'])

# Apply create_bol function to group similar words
bol_df = tp.create_bol(lemmas = words)

print(bol_df)

bow_id

bow_name

lemma

similarity

threshold

1

apple

apple

100

86

1

apple

aple

89

86

1

apple

apples

91

86

2

banana

banana

100

85

2

banana

banan

91

85

2

banana

bananas

92

85

3

cherry

cherry

100

85

4

cheri

cheri

100

86

5

cherries

cherries

100

85

Note that bol_df is a dataframe where each row corresponds to a word from the words array. In this case, the function groups all the words into three categories: apple, banana, and cherry.