Clustering text in Python

from janome.tokenizer import Tokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.cluster.hierarchy import linkage, fcluster

Data reading

It is assumed that one document is described in one line in input.txt.

with open('input.txt') as f:
    org_sentences = f.readlines()

Morphological analysis

Separate each document with a single-byte space for each word.

t = Tokenizer()
sentences = []
for s in org_sentences:
    tmp = ' '.join(t.tokenize(s, wakati=True))
    sentences.append(tmp)

Vectorization

This time, Tf-Idf is used to vectorize the document. Other means such as BoW / LSI / LDA / Word2Vec average / Doc2Vec / FastText average / BERT.

vectorizer = TfidfVectorizer(use_idf=True, token_pattern=u'(?u)\\b\\w+\\b')
vecs = vectorizer.fit_transform(sentences)
v = vecs.toarray()

Distance calculation

The cosine distance, which is common in natural language processing tasks, defines the distance between each vector. Based on that distance, documents are bundled into clusters by hierarchical clustering (single link method).

z = linkage(v, metric='cosine')

Clustering

An example in which the final cluster is determined with a distance of 0.2 as the threshold value. If the number of documents becomes huge, it takes a considerable amount of time to calculate the distance, so if you want to try multiple thresholds, you should verify by saving the above distance calculation result in pickle once. It is also possible to use the number of clusters as a threshold. The cluster number of each document is stored in group.

group = fcluster(z, 0.2, criterion='distance')
print(group)

Recommended Posts

Clustering text in Python
Text processing in Python
UTF8 text processing in python
Speech to speech in python [text to speech]
GOTO in Python with Sublime Text 3
Extract text from images in Python
Sort large text files in Python
Reading and writing text in Python
Quadtree in Python --2
Python in optimization
CURL in python
Metaprogramming in Python
Python 3.3 in Anaconda
Geocoding in python
SendKeys in Python
Meta-analysis in Python
Unittest in python
Epoch in Python
Discord in Python
Sudoku in Python
DCI in Python
quicksort in python
nCr in python
N-Gram in Python
Programming in python
Plink in Python
Constant in python
Lifegame in Python.
FizzBuzz in Python
Sqlite in python
StepAIC in Python
N-gram in python
LINE-Bot [0] in Python
Csv in python
Disassemble in Python
Reflection in Python
Constant in python
nCr in Python.
format in python
Scons in Python3
Puyo Puyo in python
python in virtualenv
PPAP in Python
Quad-tree in Python
Reflection in Python
Chemistry in Python
Hashable in python
DirectLiNGAM in Python
LiNGAM in Python
Flatten in python
flatten in python
Try text mining your diary in Python
Read text in images with python OCR
Sorted list in Python
Daily AtCoder # 36 in Python
Daily AtCoder # 2 in Python
Implement Enigma in python
Daily AtCoder # 32 in Python
Daily AtCoder # 6 in Python
Daily AtCoder # 18 in Python
Edit fonts in Python