** Principal component analysis ** is a technique for summarizing many variables into a small number of new variables. In other words, it is a technique that compresses the information held by a number of variables, synthesizes new variables, and reconstructs the data to reveal the overall tendency and individual characteristics. This new variable is called the ** principal component **.
Suppose that the scores of the tests of 5 subjects of mathematics, science, social studies, English, and Japanese are for 20 students per class. Therefore, when you want to know how much academic ability you have, you usually look at the "total score". Let's express the five subjects as variables $ x_ {1}, x_ {2}, x_ {3}, x_ {4}, x_ {5} $.
Total points= 1×x_{1} + 1×x_{2} + 1×x_{3} + 1×x_{4} + 1×x_{5}
I purposely wrote $ 1 × $, but the total score is the sum of the five subjects multiplied by the same weight of $ 1 $. What is happening here is that the information that originally had 5 subjects = 5 dimensions is summarized in 1 dimension called "scholastic ability". The method of looking at academic ability based on this total score is based on the premise that anyone with academic ability will have a good score in any subject. In other words, there will be a correlation between each subject.
Therefore, let's prepare dummy data for 20 people in one class and see the whole data for the time being.
#Library for numerical calculations and data frame operations
import numpy as np
import pandas as pd
#Library for drawing graphs
import matplotlib.pyplot as plt
%matplotlib inline
#Machine learning library
import sklearn
from sklearn.decomposition import PCA
#Test score data for 20 people
arr = np.array([[71,64,83,100,71], [34,48,67,57,68], [58,59,78,87,66], [41,51,70,60,72],
[69,56,74,81,66], [64,65,82,100,71], [16,45,63,7,59], [59,59,78,59,62],
[57,54,84,73,72], [46,54,71,43,62], [23,49,64,33,70], [39,48,71,29,66],
[46,55,68,42,61], [52,56,82,67,60], [39,53,78,52,72], [23,43,63,35,59],
[37,45,67,39,70], [52,51,74,65,69], [63,56,79,91,70], [39,49,73,64,60]])
#Convert to data frame
df = pd.DataFrame(data = arr, columns = ['Math', 'Science', 'society', 'English', 'National language'])
#Import Pandas Plot method
from pandas import plotting
#Draw a scatter plot
plotting.scatter_matrix(df, alpha=0.5, figsize=(8, 8))
plt.show()
Of the code pandas.plotting.scatter_matrix (frame, alpha, figsize)
that draws a scatter plot, the argument ʻalpha is the transparency of the drawing color (0 to 1), and
figsize` is the drawing size (width, height). ) Is specified in inches.
How about. For example, you can see a linear distribution that rises to the right, such as science and mathematics. Higher science scores tend to have higher math scores, meaning there is a correlation between the data. If so, the idea of ** principal component analysis ** is that information can be compressed more compactly.
The aim of principal component analysis is to compress the data with as little information as possible. We want to minimize the loss of information caused by reducing the number of dimensions, in other words, we seek a new variable that maximizes the original amount of information. Now, let's confirm that the principal component analysis actually takes these steps.
First, set the vector of 5 subjects = 5 variables as follows.
x = \left(
\begin{array}{ccc}
x_{1} \\
x_{2} \\
x_{3} \\
x_{4} \\
x_{5}
\end{array}
\right)
** Calculation of the first principal component ** The main component is defined as "the value of each component with a coefficient added to each variable" as follows.
z_{1} = w_{11}x_{1} + w_{12}x_{2} + w_{13}x_{3} + w_{14}x_{4} + w_{15}x_{5} = w_{1}・ X
The vector of $ w_ {1} $ on the right side is as follows, but in the previous example of total points, $ 1 $ was included here.
w_{1} = \left(
\begin{array}{ccc}
w_{11} \\
w_{12} \\
w_{13} \\
w_{14} \\
w_{15}
\end{array}
\right)
Then, what kind of $ w_ {1} $ to look for is, again, "$ w_ {1} $ that maximizes the amount of information". What is this amount of information? Principal component analysis says "information = variance". Dispersion is the degree of data distribution, but why is it that the amount of information is large when the distribution is large? It's just an analogy, but let's say everyone got a perfect score of 10 on a very simple quiz at the end of the lesson. In other words, the variance is $ 0 $, but in this case there is no information that characterizes the individual. On the other hand, if the variance is reasonable as in the 5 subjects in the example, it can be evaluated that, for example, 90 points or more is good, and 20 points is not good. That is why we can say "maximum variance = maximum information".
However, even though the variance is maximum, it can actually be as large as you like. If you multiply all the values from $ w_ {1} $ to $ w_ {5} $ by 100, then $ z_ {1} $ will also be multiplied by 100 and the variance will be multiplied by 10000. It's not that you can make it endlessly large, but what you want to know is the percentage of weight for each subject. I want to know what kind of ratio divides the amount of information. So keep the size constant.
\|w_{1}\| = 1
Find $ w_ {1} $ that maximizes the amount of information $ V [z_ {1}] $ under a certain rule that adding all the weights gives $ 1 $. In other words, if you find out in which direction and at what ratio the amount of information is maximized, this is called the ** first principal component **. This is repeated.
** Calculation of the second principal component ** However, I already know that the maximum amount of information is $ w_ {1} $, so this time I would like to find a different type of weighting method from $ w_ {1} $ that maximizes the amount of information. .. So, I want $ w_ {2} $ to point in a different direction from $ w_ {1} $, so I'll add one condition.
\|w_{2}\| = 1, w_{2}\perp{w_{1}}
The condition is that $ w_ {2} $ is orthogonal to $ w_ {1} $. Now $ w_ {2} $ has a different kind of information than $ w_ {1} $. Next, when it is $ w_ {3} $, the condition is that it is orthogonal to both $ w_ {1} $ and $ w_ {2} $, and when it is $ w_ {4} $, the condition is ... By repeating while adding, the following can be obtained.
\left(
\begin{array}{ccc}
z_{1} \\
\vdots \\
z_{5}
\end{array}
\right)
= \left(
\begin{array}{cccc}
w_{11} & \ldots & w_{15} \\
\vdots & \ddots & \vdots \\
w_{51} & \ldots & w_{55}
\end{array}
\right)x
z=Wx
** Principal component analysis ** uses such mathematical formulas to compress tens of thousands or hundreds of thousands of dimensions of data to hundreds of dimensions so that the original information remains as much as possible. In other words, the kth principal component is the direction in which the data variation is the kth largest. By the way, the weights $ w_ {1}, w_ {2}, w_ {3}, w_ {4}, w_ {5} $ of each variable are also called ** principal component load **. So, finally, using scikit-learn of the machine learning library, this ** principal component $ z_ {1} $ ** and ** principal component load $ w_ {1}, w_ {2}, w_ { Find 3}, w_ {4}, w_ {5} $ **.
#Create an instance of the model
pca = PCA()
#Create a model based on the data
pca.fit(df)
#Apply data to the model
values = pca.transform(df)
Principal component analysis is called Principal Component Analysis in English, so it is abbreviated as PCA.
First, ➀ create an instance that serves as a model for the model, and ➁ pass data to that instance with the fit
function to generate the model. ➂ There are 3 steps: if you apply the data to this model again, the score for each principal component will be calculated.
It's hard to see, so I'll convert it to a data frame.
df_pca = pd.DataFrame(data = values,
columns = ["Main component{}".format(x+1) for x in range(len(df.columns))])
From the 1st to the 5th principal components, each student has 5 points. This score is called the ** principal component score **. Since the original is 5 subjects and 5 dimensions, the main component is also the most 5 dimensions.
#Calculate the contribution of model pca
ev_ratio = pca.explained_variance_ratio_
#Convert contribution to data frame
df_evr = pd.DataFrame(data = ev_ratio,
columns = ['Contribution rate'],
index = ["Main component{}".format(x+1) for x in range(len(df.columns))])
Contribution rate is an index showing the explanatory power of each main component. The value of $ 0 ≤ c ≤ 1 $ is taken because the ratio of the amount of information that the data originally has = the variance of the main component = the percentage of the amount of information. Anyway, the first principal component has the largest amount of information, the other largest amount of information is the next largest, and so on, so the contribution rate is the largest in the first principal component and decreases in order. The sum of all contributions is $ 1 $. You can see this by making a graph called ** Cumulative Contribution Rate **.
#Accumulate contribution rate
cc_ratio = np.cumsum(ev_ratio)
#Concatenate 0
cc_ratio = np.hstack([0, cc_ratio])
#Draw graph
plt.plot(cc_ratio, "-o")
plt.xlabel("Main component")
plt.ylabel("Cumulative contribution rate")
plt.grid()
plt.show()
Since the contribution rate of the first principal component alone exceeds 90%, I feel that the second principal component and beyond are not necessary, but this is not always the case.
#Calculate the variance of the principal components
eigen_value = pca.explained_variance_
pd.DataFrame(eigen_value,
columns = ["Distributed"],
index = ["Main component{}".format(x+1) for x in range(len(df.columns))])
You can see that the size of the variance reflects the contribution rate. The variance of the first principal component is overwhelmingly larger than that of the other principal components, and it has a large amount of information.
#Calculate the main component load
eigen_vector = pca.components_
#Convert to data frame
pd.DataFrame(eigen_vector,
columns = [df.columns],
index = ["Main component{}".format(x+1) for x in range(len(df.columns))])
What does the main component mean? ** 1st principal component **: All 5 subjects are marked with a minus sign. This is like reversing the total points, and the point is that the data are most scattered in the direction of whether they are all high or all low. Since there is a minus, the higher the total score, the smaller the main component score. Among them, the coefficient of English is particularly large, that is, even if the score of English is slightly different, the score of the main component will be significantly different. ** Second main component **: English / Japanese is negative and large, and mathematics / science is positive and large. If the score of mathematics / science is high, the main component score will be high, and if the score of English / Japanese is low, the main component score will be high. So to speak, the data is scattered in the direction of science or humanities, and the more people who are good at science subjects and not good at humanities subjects, the higher the main component score. ** Third main component **: The national language is outstanding and negative. In other words, the data is scattered in the direction of whether or not you can speak the national language, and the higher the score of the national language, the smaller the main component score.
In summary, is the total score high or low in the direction in which the data is most dispersed (** 1st principal component )? The next ( 2nd principal component ) is closer to the humanities or sciences, and the direction with the next largest variation ( 3rd principal component **) is whether or not you can speak Japanese. It seems that the data is scattered. This analysis shows that if you use these three main components, you can get the most abundant information in 3D compression.
At the end, how does the student's academic ability ranking differ between the total score, which is simply the sum of the scores of the five subjects, and the score of the first principal component, which is 90% of the amount of information for the five subjects? I would like to confirm.
##Create ranking by total score
#Calculate total points
sum = np.sum(np.array(df), axis=1)
#Convert to a 20x1 two-dimensional array
sum.reshape(len(sum), 1)
#Convert to data frame
df_sum = pd.DataFrame(sum,
columns = ['Total points'],
index = ["ID{}".format(x+1) for x in range(len(df.index))])
#Descending sort
df_sum_rank = df_sum.sort_values('Total points', ascending=False)
##Create ranking by 1st principal component score
#Extract principal component 1
df_PC1 = df_pca["Main component 1"]
#Convert to array
pc1 = np.array(df_PC1)
#Convert to a 20x1 two-dimensional array
pc1 = pc1.reshape(len(pc1), 1)
#Give an ID
df_pca = pd.DataFrame(pc1,
columns = ['Main component 1'],
index = ["ID{}".format(x+1) for x in range(len(df.index))])
#Ascending sort
df_pca_rank = df_pca.sort_values('Main component 1')
The ** 1st principal component ** is whether the total score is high or low, so the rankings of the upper group and the lower group are the same. In the meantime, the middle class is a little different, but I found that the feeling of the total score, "I think people with academic ability can do any subject," is generally appropriate.
In the next section, I would like to further solve the calculation mechanism of principal component analysis without using scikit-learn.
Recommended Posts