[PYTHON] Try cluster analysis using the K-means method

What is cluster analysis? What is the K-means method?

It's like this.

スクリーンショット 2014-03-19 23.59.06.png

Cluster analysis: An analysis that groups similar data together. K-means clustering: One of the cluster analysis methods.

There are various fields of application, but the most obvious example is A program that brings together the points scattered on the coordinates with each other. This is what I will make this time.

how can I do?

According to Wikipedia, this is what it does.

The K-means method is generally implemented in the following flow. Let n be the number of data and K be the number of clusters.

  1. Randomly allocate clusters for each data x_i (i = 1 \ ... n).
  2. Calculate the center V_j (j = 1 \ ... K) of each cluster based on the allocated data. The calculation usually uses the arithmetic mean of each element of the assigned data.
  3. Find the distance between each x_i and each V_j and reassign x_i to the closest central cluster.
  4. If the allocation of all x_i clusters does not change in the above process, or if the amount of change falls below a certain preset threshold, it is determined that the process has converged and the process ends. If not, recalculate V_j from the newly allocated cluster and repeat the above process.

It's easy to do, but it's difficult. To do

  1. Randomly scatter the dots. I'll divide them into groups without any reason.
  2. Find the center of each group.
  3. The previous grouping was appropriate, so I'll regroup based on the center of the closest group.
  4. Repeat steps 1 to 3 until there is no change.

I think this is what it is.

Tried to make it.

Yes. スクリーンショット 2014-03-19 23.59.06.png

This at the beginning. I made this this time.

Source

99 lines in total. It fits within 100 lines. happy. I'll explain it later.

k-means.py


# -*- coding: utf-8 -*-
import wx
import random
import math


class MyMainWindow(wx.Frame):
    def __init__(self, parent=None, id=-1, title=None):
        #Set panel on frame
        wx.Frame.__init__(self, parent, id, title)
        self.panel = wx.Panel(self, size=(300, 300))
        self.panel.SetBackgroundColour('000000')
        self.Fit()
        #Event
        self.panel.Bind(wx.EVT_PAINT, self.on_paint)
        self.panel.Bind(wx.EVT_LEFT_DOWN, self.on_left_click)
        self.panel.Bind(wx.EVT_RIGHT_DOWN, self.on_right_click)
        #Variable initialization
        self.dots = []
        self.dc = None
        #There are three types of clusters: red, green, and blue
        self.cluster_types = ('#FF0000', '#00FF00', '#0000FF')
        self.clusters = [(0, 0), (0, 0), (0, 0)]
        #Initial set of dots
        self.shuffle_dots()

    def on_paint(self, event):
        u"""Drawing event"""
        self.dc = wx.PaintDC(self.panel)
        #Write a square
        self.dc.SetPen(wx.Pen('#CCCCCC', 1))
        for x in range(0, 300, 10):
            self.dc.DrawLine(x, 0, x, 300)
            for y in range(0, 300, 10):
                self.dc.DrawLine(0, y, 300, y)
            #Draw a dot
        for dot in self.dots:
            self.dc.SetPen(wx.Pen(self.cluster_types[dot['cluster']], 5))
            self.dc.DrawPoint(dot['x'], dot['y'])
            #Draw the center of gravity of the cluster.
        self.draw_cluster()

    def on_left_click(self, evt):
        u"""Left click to recalculate cluster"""
        self.change_cluster()
        self.Refresh()

    def on_right_click(self, evt):
        u"""Right click to reset dot"""
        self.shuffle_dots()
        self.Refresh()

    def shuffle_dots(self):
        u"""Arrange dots randomly."""
        self.dots = []
        for i in range(30):
            self.dots.append({'x': random.randint(0, 300),
                              'y': random.randint(0, 300),
                              'cluster': random.randint(0, len(self.cluster_types) - 1)})

    def draw_cluster(self):
        u"""Draw a cluster."""
        self.clusters = []
        for c in range(len(self.cluster_types)):
            #Center of gravity of cluster = average of coordinates of dots belonging to cluster
            self.dc.SetPen(wx.Pen(self.cluster_types[c], 1))
            count = sum(1 for dot in self.dots if dot['cluster'] == c)
            x = sum(dot['x'] for dot in self.dots if dot['cluster'] == c) // count if count != 0 else 150
            y = sum(dot['y'] for dot in self.dots if dot['cluster'] == c) // count if count != 0 else 150
            self.clusters.append({'x': x, 'y': y})
            #Draw the cluster with a cross
            self.dc.DrawLine(x - 5, y - 5, x + 5, y + 5)
            self.dc.DrawLine(x + 5, y - 5, x - 5, y + 5)
            #Draw a line from the cluster to each dot.
            self.dc.SetPen(wx.Pen(self.cluster_types[c], 0.8))
            for dot in self.dots:
                if dot['cluster'] == c:
                    self.dc.DrawLine(x, y, dot['x'], dot['y'])

    def change_cluster(self):
        u"""Change the affiliation of each dot to the nearest cluster."""
        for d in range(len(self.dots)):
            near_dist = 99999
            #Distance between two points = √( (X1-X2)^2+(Y1-Y2)^2 )
            for c in range(len(self.cluster_types)):
                dist = math.sqrt(
                    (self.dots[d]['x'] - self.clusters[c]['x']) ** 2 + (self.dots[d]['y'] - self.clusters[c]['y']) ** 2)
                #Change to the nearest cluster
                if near_dist >= dist:
                    self.dots[d]['cluster'] = c
                    near_dist = dist


if __name__ == '__main__':
    app = wx.PySimpleApp()
    w = MyMainWindow(title='K-Means Test')
    w.Center()
    w.Show()
    app.MainLoop()

Commentary

wxPython This time I used a GUI library called wxPython. It works on Mac, Windows, and Linux, and displays it in a way that is familiar to each platform.

class MyMainWindow(wx.Frame):

Create a class that inherits wx.Frame and

if __name__ == '__main__':
    app = wx.PySimpleApp()
    w = MyMainWindow(title='K-Means Test')
    w.Center()
    w.Show()
    app.MainLoop()

call. Such a flow.

Yes. The actual logic is explained from here.

Randomly scatter dots and group appropriately

shuffle_dots() This function scatters the points and puts them in an array called dots. At that time, groups are assigned appropriately.

Seeking the center of the group

draw_cluster() The average of each point belonging to the group is calculated and used as the center of the group. By the way, draw a line from the center to each point to make it easier to understand.

Change to the nearest group

change_cluster()

Distance between two points = √ ((X1-X2) ^ 2 + (Y1-Y2) ^ 2)

Use this formula to find the distance between each point and the center of the group, and switch to the closest group. There was such a formula. It was useful for the first time since junior high school.

Draw on screen

on_paint() This function is called at the timing of drawing. Because it was linked with self.panel.Bind (wx.EVT_PAINT, self.on_paint) at the beginning. The order of the explanation and the processing has changed, but the processing explained earlier is called in this function.

Repeat

Every time you click, the flow so far is recalculated. If you press it several times, it will be classified into the completed form. Left-click to reposition.

Summary

It's an analysis method that seems to have a high threshold, but When I actually try it, it's relatively simple.

However, this K-means method tends to give a biased result, so There seems to be an improved analysis method.

Even if you type points into the coordinates and divide them into groups, it's just empty. It may be interesting to analyze the similarity of your favorite animals by assigning the X-axis to the cat liking and the Y-axis to the dog liking.

I also like to increase the coordinate system by two and do four-dimensional cluster analysis. There is a sense of the future.

Recommended Posts

Try cluster analysis using the K-means method
Try using the Chinese morphological analysis engine jieba
Try using scikit-learn (1) --K-means clustering
Try using the Twitter API
Try using the Twitter API
Try using the PeeringDB 2.0 API
Try using the Python Cmd module
I implemented the K-means method (clustering method)
Feature extraction by TF method using the result of morphological analysis
[Anomaly detection] Try using the latest method of deep distance learning
Try using the Wunderlist API in Python
Try using the web application framework Flask
Saddle point search using the gradient method
Try using the Kraken API in Python
Try using the $ 6 discount LiDAR (Camsense X1)
Try using the HL band in order
Try using the camera with Python's OpenCV
Try using Tkinter
Understand k-means method
I tried cluster analysis of the weather map
Try using docker-py
Shortening the analysis time of Openpose using sound
Try using cookiecutter
Try using PDFMiner
Try using the BitFlyer Ligntning API in Python
Region extraction method using cellular automaton Try region extraction from the image with growcut (Python)
Python: Try using the UI on Pythonista 3 on iPad
Try using geopandas
Try using Selenium
Try using scipy
Try using the Python web framework Tornado Part 1
Try using LINE Notify for the time being
Try using pandas.DataFrame
Regression analysis method
Try using the collections module (ChainMap) of python3
Generate a hash value using the HMAC method.
Try using django-swiftbrowser
Try using matplotlib
Try using the Python web framework Tornado Part 2
Try using tf.metrics
Try implementing the Monte Carlo method in Python
Try using PyODE
Try using an object-oriented class in R (R6 method)
Try using the DropBox Core API in Python
One of the cluster analysis methods, k-means, is executed with scikit-learn or implemented without scikit-learn.
Calculation of the shortest path using the Monte Carlo method
Explanation of the concept of regression analysis using python Part 2
Big data analysis using the data flow control framework Luigi
Determine the threshold using the P tile method in python
Try using the temperature sensor (LM75B) on the Raspberry Pi.
Clustering and principal component analysis by K-means method (beginner)
I tried clustering ECG data using the K-Shape method
Explanation of the concept of regression analysis using Python Part 1
Explanation of the concept of regression analysis using Python Extra 1
Data analysis using Python 0
Try using virtualenv (virtualenvwrapper)
[Azure] Try using Azure Functions
Try using virtualenv now
Try using W & B
Try using Django templates.html
[Kaggle] Try using LGBM