In the previous section, we learned the concept of principal component analysis using scikit-learn. Here, let's think about the calculation mechanism of principal component analysis without scikit-learn.
The same example. Assuming that the scores of the tests of 5 subjects of mathematics, science, social studies, English, and Japanese are for 20 students per class, 5 subjects = 5 dimensions are compressed, and the student's academic ability is simply evaluated in a smaller dimension. I want to do it.
5 subjects = 5 variables are $ x_ {1}, x_ {2}, x_ {3}, x_ {4}, x_ {5} $, and the desired principal component $ z_ {1} $ is shown.
z = w_{1}x_{1} + w_{2}x_{2} + w_{3}x_{3} + w_{4}x_{4} + w_{5}x_{5}
The principal component score $ z $ is the sum of the values of each component with the coefficients $ w_ {1}, w_ {2}, w_ {3}, w_ {4}, w_ {5} $ added to each variable. .. What I want to know is that the amount of information = variance is maximized by dividing this coefficient $ w_ {1}, w_ {2}, w_ {3}, w_ {4}, w_ {5} $ by what ratio. That is. Therefore, the following conditions are added.
{w_{1}}^2 + {w_{2}}^2 + {w_{3}}^2 + {w_{4}}^2 + {w_{5}}^2 = 1
Find the distribution ratio that maximizes the amount of information = variance under the condition that the sum of the squared values of the coefficients is $ 1 $. In this way, there is the ** Lagrange's undetermined multiplier method ** as a method for finding the maximum and minimum values of a multivariable function under certain constraints.
When the variables $ x, y $ move while satisfying the constraint $ g (x, y) = 0 $, the function $ z = f (x, y) $ becomes $ x, y $ which is the maximum / minimum value. The following equation holds.
F(x, y, λ) = f(x, y) - λg(x, y)Then
\displaystyle \frac{∂F}{∂x} = 0, \frac{∂F}{∂y} = 0, \frac{∂F}{∂λ} = 0
The above is the concept of Lagrange's undetermined multiplier method with the simplest two variables and one condition. The symbol is $ λ $ (lambda), but here we just understand that this is a constant. Also, the symbol $ ∂ $, which has many names such as round, partial, and del, represents ** partial differential **. To give one example, $ \ displaystyle \ frac {∂F} {∂x} = 0 $ means that the value of the function $ F $ partially differentiated by the variable $ x $ is $ 0 $.
First, let's check the meaning of ordinary differentiation. Let's draw a curve of the function $ y = f (x) $ on the $ x, y $ plane. Focusing on the point $ x $ on the $ x $ coordinate, if you look at the height at this point, it is $ f (x) $ because it is a point on $ y = f (x) $. Also note that $ x + Δx $ is slightly off $ x $. The height at this point is the function $ y = f (x) $ plus $ x + Δx $, so the $ y $ coordinate is $ f (x + Δx) $. Then, the slope between the points is calculated. The slope is the amount of change in $ y $ for the amount of change in $ x $. The amount of change in $ x $ in the denominator is $ Δx $, and the amount of change in $ y $ in the numerator is $ f (x + Δx) -f (x) $ before subtraction.
What is differentiation is to bring this $ Δx $ closer to $ 0 $. In other words, when $ x + Δx $ approaches $ x $ to the limit, a ** tangent ** that represents the momentary slope at the point of $ x $ can be drawn. This slope is the meaning of differentiation.
\frac{dy}{dx}=\lim_{\Delta x \to 0} \frac{f(x+\Delta x)-f(x)}{\Delta x}
How much $ y $ changes when $ x $ changes a little, and the derivative is the ratio of the change of $ y $ to the minute change of $ x $.
So far, it has been the derivative in the case of a one-variable function, where $ y $ is determined if one $ x $ is determined. However, there is a two-variable function $ f (x, y) $ whose value is determined only after two values are determined, and a three-variable function $ f (x, y, z) whose value is determined only when three values are determined. There is also $. Generally, there is a multivariable function, and in fact, it is ** partial differential ** that considers the differentiation of multivariable functions. Let's draw a graph of a two-variable function. Once $ x $ and $ y $ are determined, the height $ z $ is determined. There is a height $ z = f (x, y) $ corresponding to every point on the lower plane, which makes it look like a curved surface. And the height $ z $ changes depending on whether it advances a little in the $ x $ axis direction or a little in the $ y $ axis direction. There are two main directions for how the height changes when you move a little. What is it on the $ x $ axis and what is it on the $ y $ axis is ** partial differential **. The blue line is the slope when $ y $ is fixed and $ x $ moves a little, that is, the tangent line in the $ x $ axis direction is drawn. It is expressed in the formula.
\frac{∂z}{∂x}=\lim_{\Delta x \to 0} \frac{f(x+\Delta x, y)-f(x, y)}{\Delta x}
$ y $ is fixed. How much the height $ z $ moves when it moves a little in the $ x $ direction, and the ratio of the change of $ z $ to the minute change of $ x $ is ** partial differential **. Therefore, the partial differential of $ y $ is
\frac{∂z}{∂y}=\lim_{\Delta y \to 0} \frac{f(x, y + \Delta y)-f(x, y)}{\Delta y}
The red line is the ratio of the change of $ z $ to the slight change of $ y $, how much the height $ z $ moves when $ x $ is fixed and moves a little in the direction of $ y $. is.
Now, let's consider the meaning that the partially differentiated value is $ 0 $. With the function $ z = f (x, y) $, if you move the variable $ x $ a little, the change of $ z $ is $ 0 $. In other words, if there is no change, and if it cannot be larger or smaller, it is the maximum or the minimum. Furthermore, as Lagrange's method of undetermined multipliers shows, when the value obtained by partially differentiating $ x $ and $ y $ at the same time is $ 0 $, the true peak (maximum value) or true valley bottom on the curved surface of the two-variable function. It means that it is (minimum value).
5 Return to the subject example.
Again, the main components are as follows:
The coefficients $ a, b, c, d, e $ of the principal component $ p $ are constants and are determined so that the variance of this $ p $ is maximized. What you want to know is the balance of the ratio of the constants $ a, b, c, d, e $, so $ a, b, c, d, e $ has the following conditions.
Define the variance $ {s_ {p}} ^ 2 $ of the principal component for which you want the maximum value. It is expressed as follows, where the population is $ n $.
Now we apply Lagrange's method of undetermined multipliers to define the function $ L $.
Pay attention to the first formula in the attempt.
Bring the formula of the function $ L $ to the numerator and partially differentiate it with $ a $ in the denominator. I said that everything except $ a $ is fixed, but specifically, it is regarded as a constant and $ b, c, d, e $ are deleted.
Bring the formula of the principal component $ p $, expand it in {}, and transform it using the definitions of variance and covariance.
Further organizing gives the following equation:
A similar expression can be obtained from the remaining $ b, c, d, e $.
The following is a summary of these five equations in a matrix.
\begin{pmatrix}
{Sx}^2 & S_{xy} & S_{xu} & S_{xv} & S_{xw}\\
S_{xy} & {Sy}^2 & S_{yu} & S_{yv} & S_{yw}\\
S_{xu} & S_{yu} & {Su}^2 & S_{uv} & S_{uw}\\
S_{xv} & S_{yv} & S_{uv} & {Sv}^2 & S_{vw}\\
S_{xw} & S_{yw} & S_{uw} & S_{vw} & {Sw}^2
\end{pmatrix}
\begin{pmatrix}
a\\
b\\
c\\
d\\
e
\end{pmatrix}
=
λ
\begin{pmatrix}
a\\
b\\
c\\
d\\
e
\end{pmatrix}
This square matrix on the left side (matrix with the same number of elements in rows and columns) is a ** covariance matrix **. The column vectors $ a, b, c, d, e $ that satisfy this equation are called ** eigenvectors **, and $ λ $ is called ** eigenvalues **. This eigenvalue $ λ $ is the variance of the principal components.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#Create scores for 5 subjects with 20 students
arr = np.array([[71,64,83,100,71], [34,48,67,57,68], [58,59,78,87,66], [41,51,70,60,72],
[69,56,74,81,66], [64,65,82,100,71], [16,45,63,7,59], [59,59,78,59,62],
[57,54,84,73,72], [46,54,71,43,62], [23,49,64,33,70], [39,48,71,29,66],
[46,55,68,42,61], [52,56,82,67,60], [39,53,78,52,72], [23,43,63,35,59],
[37,45,67,39,70], [52,51,74,65,69], [63,56,79,91,70], [39,49,73,64,60]])
#Create a data frame by specifying the column name
df = pd.DataFrame(data = arr,
columns = ['Math', 'Science', 'society', 'English', 'National language'])
#Convert dataframe to Numpy array type
arr = np.array(df)
#Find the covariance matrix
cov = np.cov(arr, rowvar=0, bias=0)
The variance-covariance matrix (variance cov </ font> ariance matrix) is calculated using Numpy's cov
function.
In the form of numpy.cov (data, rowvar = 0, bias = 0)
, the argument bias = 0
specifies the unbiased variance (bias = 1 for sample variance). rowvar = 0
is when the data is arranged vertically, and when it is horizontal, specify rowvar = 1.
#Find eigenvalues / eigenvectors
eig_val, eig_vec = np.linalg.eig(cov)
print("eigenvalue:\n", eig_val)
print("Eigenvector:\n", eig_vec)
Numpy's linalg.eig ()
function simply specifies a matrix and returns two arrays of eigenvalues and eigenvectors in tuples. A tuple is a data type that combines multiple data and is an object that cannot be modified once it has been created.
Here, the eigenvalues are stored separately as ʻeig_val and the eigenvectors are stored separately as ʻeig_vec
.
Please note the sequence of eigenvalue data. They are not in the order of contributions as seen in scikit-learn. Therefore, we will sort them and convert them into data frames to make them easier to see.
#Concatenate eigenvalues and eigenvectors into one array
eigen = np.vstack((eig_val,eig_vec))
#Sort based on the first line
eigen[:, eigen[0,:].argsort()[::-1]]
Concatenate vertically (axis = 0) with Numpy's vstack ((a, b))
function.
Numpy's ʻargsortfunction sorts by a specific row. At this time, the descending order is specified by adding
[:: -1]]`.
#Convert eigenvalues to data frames
eval_df = pd.DataFrame(data = eigen[0,:].reshape([1, 5]),
columns = ["Main component{}".format(x+1) for x in range(len(df.columns))],
index = ["eigenvalue"])
eval_df
#Convert main component load to data frame
evec_df = pd.DataFrame(data = eigen[1:,:].reshape([5, 5]),
columns = ["Main component{}".format(x+1) for x in range(len(df.columns))],
index = ['Math', 'Science', 'society', 'English', 'National language'])
evec_df
We end up with the so-called eigenvalue problem, a problem famous for linear algebra. Now I would like to deepen the eigenvalue problem, but let's move on. In the next section, factor analysis will be performed using scikit-learn.
Recommended Posts