[PYTHON] How to use x-means

k-means system summary

--k-means: Minimize the squared error from the center of gravity of the cluster. --k-medoids: Perform the EM procedure so that the sum of dissimilarities from the cluster medoids (points belonging to the cluster that minimize the sum of dissimilarities) is minimized. --x-means: Controls cluster division based on BIC. --g-menas: Control cluster division by Anderson darling test, assuming the data is based on a normal distribution. --gx-means: The above two extensions. --etc (See the readme of pyclustering. There are various)

Determining the number of clusters

It would be nice if humans could see the data immediately and know the number of clusters, but that is rare, so I want a quantitative judgment method.

According to sklearn cheat sheet

Is recommended.

-Elbow method

Is also useful, but in my experience, it was rare for me to get a beautiful elbow (a point where the graph becomes jerky), and I was often confused about the number of clusters.

There is x-means as a method of clustering with the number of clusters fully automatically.

Below, how to use the library "pyclustering" that contains various clustering methods including x-means.

How to use pyclustering

pyclustering is a library of clustering algorithms implemented in both python and C ++.


Dependent packages: scipy, matplotlib, numpy, PIL

pip install pyclustering

x-means usage example

Source code

In addition to the EM step in k-means, x-means determines a new step: whether it is appropriate for a cluster to be represented by two or one normal distributions, and two are If appropriate, the operation is to divide the cluster into two.

Below, jupyter notebook is used.

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from sklearn import cluster, preprocessing
#Wine dataset
df_wine_all=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)
#Variety(Row 0, 1-3)And color (10 rows) and amount of proline(13 rows)To use
df_wine.columns = [u'class', u'color', u'proline']
#Data shaping
%matplotlib inline
plt.subplot(4, 1, 1)
plt.scatter(x,y, c=z)

# x-means
from pyclustering.cluster.xmeans import xmeans
from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
xm_c = kmeans_plusplus_initializer(X_norm, 2).initialize()
xm_i = xmeans(data=X_norm, initial_centers=xm_c, kmax=20, ccore=True)

#Plot the results
z_xm = np.ones(X_norm.shape[0])
for k in range(len(xm_i._xmeans__clusters)):
    z_xm[xm_i._xmeans__clusters[k]] = k+1

plt.subplot(4, 1, 2)
plt.scatter(x,y, c=z_xm)
centers = np.array(xm_i._xmeans__centers)
plt.scatter(centers[:,0],centers[:,1],s=250, marker='*',c='red')


The top is a figure colored for each original data class, and the bottom is the clustering result by x-means. The ★ mark is the center of gravity of each class.

In the code xm_c = kmeans_plusplus_initializer (X_norm, 2) .initialize (), the initial value of the number of clusters is set to 2, but it clusters properly to 3.

I am running x-means with xm_i.process ().

For the x-means instance (xm_i in the above code), if you look at the instance variables before and after learning, you can see what the learning result looks like. For example




Can be obtained with

dict_keys(['_xmeans__pointer_data', '_xmeans__clusters', '_xmeans__centers', '_xmeans__kmax', '_xmeans__tolerance', '_xmeans__criterion', '_xmeans__ccore'])

I think you should look at various things such as.


A copy of the data to be clustered.


A list showing which line of the original data (\ _xmeans__pointer_data) belongs to each cluster.

The number of elements in the list is the same as the number of clusters, each element is also a list, and the number of the line belonging to the cluster is stored.


A list consisting of the coordinates (list) of the centroid of each cluster


Maximum number of clusters (set value)


A constant that defines the stop condition for x-means iteration. The algorithm terminates when the maximum change in the center of gravity of the cluster falls below this constant.


It is a judgment condition of cluster division. Default: BIC


This is the setting value for whether to use C ++ code instead of python code.

Recommended Posts

How to use x-means
How to use xml.etree.ElementTree
How to use Python-shell
How to use tf.data
How to use virtualenv
How to use Seaboan
How to use image-match
How to use shogun
How to use Pandas 2
How to use Virtualenv
How to use numpy.vectorize
How to use partial
How to use Bio.Phylo
How to use SymPy
How to use WikiExtractor.py
How to use IPython
How to use virtualenv
How to use Matplotlib
How to use iptables
How to use numpy
How to use TokyoTechFes2015
How to use venv
How to use dictionary {}
How to use Pyenv
How to use list []
How to use python-kabusapi
How to use OptParse
How to use return
How to use dotenv
How to use pyenv-virtualenv
How to use Go.mod
How to use imutils
How to use import
How to use Qt Designer
[gensim] How to use Doc2Vec
python3: How to use bottle (2)
Understand how to use django-filter
How to use the generator
[Python] How to use list 1
How to use FastAPI ③ OpenAPI
How to use Python argparse
How to use IPython Notebook
How to use Pandas Rolling
[Note] How to use virtualenv
How to use redis-py Dictionaries
Python: How to use pydub
[Python] How to use checkio
[Go] How to use "... (3 periods)"
How to use Django's GeoIp2
[Python] How to use input ()
How to use the decorator
[Introduction] How to use open3d
How to use Python lambda
How to use Jupyter Notebook
[Python] How to use virtualenv
python3: How to use bottle (3)
python3: How to use bottle
How to use Google Colaboratory
How to use Python bytes
How to use cron (personal memo)
Python: How to use async with