Introduction

This article uses Python 2.7, numpy 1.11, scipy 0.17, scikit-learn 0.18, matplotlib 1.5, seaborn 0.7, pandas 0.17. It has been confirmed to work on jupyter notebook. (Please modify% matplotlib inline appropriately) Use seaborn boxplot and violinplot.

Data generation
Boxplot
Violinplot
Finally
Reference

1. Data generation

If you have your own data, please ignore this.

Use make_classification of here to create 1000 samples of 2D 2 class data. Furthermore, let A and B be the two numerical data, and let sex be the label data. In addition, numpy.random.binomial () randomly generates 0, 1, 2 and Concatenate them to make types.

`make_classification.py`


import numpy as np
from sklearn.datasets import make_classification
import pandas as pd

x, y = make_classification(n_samples=1000, n_features=2, n_redundant=0, n_informative=2,n_clusters_per_class=2, n_classes=2)
data = np.c_[np.c_[x, y], np.random.binomial(2, .5, len(x))]
data = pd.DataFrame(data).rename(columns={0:'A', 1:'B', 2:'sex', 3:'types'})

The contents of data look like this

          A         B  sex  types
0  2.131411 -1.754907    0      1
1 -0.046614 -1.009540    0      2
2  0.136387 -0.236662    1      1
3 -3.515190  2.117925    1      1
4 -2.099287  1.647548    1      1
5 -0.536360 -0.920529    0      0
6  0.281726 -0.572448    1      2
7  2.202351 -3.214435    0      1
8 -0.825666  0.847394    1      0
9 -1.602873  1.338847    1      2

With this, we have generated two numerical data including two types of category data.

Boxplot It is suitable for visualizing the variation of numerical data including two types of category data. Use seaborn's boxplot.

`boxplot.py`


import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

sns.boxplot(x='types', y="A", hue='sex', data=data, palette="PRGn")
sns.despine(offset=10, trim=True)

download (2).png

I was able to draw a box plot for each sex and type. As you can see in wikipedia, the middle line is The median and the top and bottom of the box are the 1st and 3rd quartiles, respectively, and the top and bottom of the beard are the maximum and minimum values, respectively. The upper and lower points mean "outliers" when judged from the 1st and 3rd quartiles. Let's visualize other numerical values B as well.

download (1).png

The numerical value B seems to represent the gender difference significantly. (Data is different each time it is generated) Regarding types, it may be difficult to classify just by looking at this data. In this way, boxplot can be used to express the variation of two types of category data in an easy-to-understand manner.

violinplot Similar to boxplot, it visualizes the difference between numerical data including two types of category data. Here, each numerical data is expressed as a distribution.

Data preparation

Deform the DataFrame using the melt function of pandas.

`melt.py`


data_batch = pd.melt(data, id_vars = ['types', 'sex'], value_vars = data.columns[:-2].tolist())
print data_batch[:10]

By doing this, you can "Unpivot" the DataFrame. Here is the execution result.

   types  sex variable     value
0      1    0        A  2.131411
1      2    0        A -0.046614
2      1    1        A  0.136387
3      1    1        A -3.515190
4      1    1        A -2.099287
5      0    0        A -0.536360
6      2    1        A  0.281726
7      1    0        A  2.202351
8      0    1        A -0.825666
9      2    1        A -1.602873

The column name of the numerical data is variable, and the numerical value is value.

Create violin plot

Visualize the prepared "Unpivot" data using violinplot.

`violinplot.py`


data_batch_A = data_batch[data_batch.variable=='A']
sns.violinplot(x = 'types',  y = 'value', hue = 'sex', data = data_batch_A, split=True)
sns.despine(offset=10, trim=True)

download (3).png

The plot looks like the left and right objects are emphasized. In boxplot, I looked at the median and quartile, so I felt that the whole was a normal distribution. On the other hand, since violin plot visualizes the cumulative value itself, it is possible to observe multiple peaks (multimodal) in each type of data. Similarly, visualize the numerical data of B.

download (4).png

As with boxplot, you can see that the distribution of the numerical value B is clearly divided for each sex. Regarding the type, isn't it impossible to classify by looking at the shape of the distribution? I feel that.

Finally

I introduced boxplot and violinplot. Boxplot may be useful if you want to focus on the quartile and median, and violinplot if you want to see the shape and multimodality of the distribution. Either way, it is convenient for visualizing the data when viewed as an independent variable without considering the correlation between the data.

reference

Summary of scikit-learn data sources that can be used when writing analysis articles boxplot violinplot

[PYTHON] About Boxplot and Violinplot that visualize the variability of independent data