This article uses Python 2.7, numpy 1.11, scipy 0.17, scikit-learn 0.18, matplotlib 1.5, seaborn 0.7, pandas 0.17. It has been confirmed to work on jupyter notebook. (Please modify% matplotlib inline appropriately) Use seaborn boxplot and violinplot.
If you have your own data, please ignore this.
Use make_classification of here to create 1000 samples of 2D 2 class data. Furthermore, let A and B be the two numerical data, and let sex be the label data. In addition, numpy.random.binomial () randomly generates 0, 1, 2 and Concatenate them to make types.
make_classification.py
import numpy as np
from sklearn.datasets import make_classification
import pandas as pd
x, y = make_classification(n_samples=1000, n_features=2, n_redundant=0, n_informative=2,n_clusters_per_class=2, n_classes=2)
data = np.c_[np.c_[x, y], np.random.binomial(2, .5, len(x))]
data = pd.DataFrame(data).rename(columns={0:'A', 1:'B', 2:'sex', 3:'types'})
The contents of data look like this
A B sex types
0 2.131411 -1.754907 0 1
1 -0.046614 -1.009540 0 2
2 0.136387 -0.236662 1 1
3 -3.515190 2.117925 1 1
4 -2.099287 1.647548 1 1
5 -0.536360 -0.920529 0 0
6 0.281726 -0.572448 1 2
7 2.202351 -3.214435 0 1
8 -0.825666 0.847394 1 0
9 -1.602873 1.338847 1 2
With this, we have generated two numerical data including two types of category data.
boxplot.py
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
sns.boxplot(x='types', y="A", hue='sex', data=data, palette="PRGn")
sns.despine(offset=10, trim=True)
I was able to draw a box plot for each sex and type. As you can see in wikipedia, the middle line is The median and the top and bottom of the box are the 1st and 3rd quartiles, respectively, and the top and bottom of the beard are the maximum and minimum values, respectively. The upper and lower points mean "outliers" when judged from the 1st and 3rd quartiles. Let's visualize other numerical values B as well.
The numerical value B seems to represent the gender difference significantly. (Data is different each time it is generated) Regarding types, it may be difficult to classify just by looking at this data. In this way, boxplot can be used to express the variation of two types of category data in an easy-to-understand manner.
Deform the DataFrame using the melt function of pandas.
melt.py
data_batch = pd.melt(data, id_vars = ['types', 'sex'], value_vars = data.columns[:-2].tolist())
print data_batch[:10]
By doing this, you can "Unpivot" the DataFrame. Here is the execution result.
types sex variable value
0 1 0 A 2.131411
1 2 0 A -0.046614
2 1 1 A 0.136387
3 1 1 A -3.515190
4 1 1 A -2.099287
5 0 0 A -0.536360
6 2 1 A 0.281726
7 1 0 A 2.202351
8 0 1 A -0.825666
9 2 1 A -1.602873
The column name of the numerical data is variable, and the numerical value is value.
Visualize the prepared "Unpivot" data using violinplot.
violinplot.py
data_batch_A = data_batch[data_batch.variable=='A']
sns.violinplot(x = 'types', y = 'value', hue = 'sex', data = data_batch_A, split=True)
sns.despine(offset=10, trim=True)
The plot looks like the left and right objects are emphasized. In boxplot, I looked at the median and quartile, so I felt that the whole was a normal distribution. On the other hand, since violin plot visualizes the cumulative value itself, it is possible to observe multiple peaks (multimodal) in each type of data. Similarly, visualize the numerical data of B.
As with boxplot, you can see that the distribution of the numerical value B is clearly divided for each sex. Regarding the type, isn't it impossible to classify by looking at the shape of the distribution? I feel that.
I introduced boxplot and violinplot. Boxplot may be useful if you want to focus on the quartile and median, and violinplot if you want to see the shape and multimodality of the distribution. Either way, it is convenient for visualizing the data when viewed as an independent variable without considering the correlation between the data.
Summary of scikit-learn data sources that can be used when writing analysis articles boxplot violinplot