We implemented a self-encoder and principal component analysis as a dimensional compression method. As textbooks, ["Deep Learning"](https://www.amazon.co.jp/ Deep Learning-Machine Learning Professional Series-Okatani-Takayuki / dp / 4061529021) and ["First Pattern Recognition"](https: / /www.amazon.co.jp/ First pattern recognition-Hirai-Yuzo / dp / 4627849710) was used.
Structure of this article
** autoencoder ** is a neural network that uses inputs as training data and acquires features that represent the data well. It is classified as unsupervised learning that does not use teacher data, and is used for pre-learning neural networks and determining initial values. Now, let's move on to the explanation of the self-encoder. Consider a three-layer network as shown in the figure below. The activation function is the conformal mapping function $ g (a) = a $.
The value of the unit in the middle layer $ z_j $ and the value of the unit in the output layer $ y_k $ are obtained by propagating the input forward.
\begin{align}
& z_j = \sum_i w_{ji} x_i \\
& \\
& y_k = \sum_j \tilde w_{kj} z_j
\end{align}
The vector with the values of the units in the middle layer is $ \ boldsymbol z $, and the vector with the values of the units in the output layer is $ \ boldsymbol y $, If the matrix with $ w_ {ji} $ as the $ (\ j, i ) $ component is $ \ boldsymbol W $, it can be expressed as follows.
\begin{align}
& \boldsymbol z = \boldsymbol W \boldsymbol x \\
& \\
& \boldsymbol y = \boldsymbol{\tilde W} \boldsymbol z
\end{align}
The conversion to the middle layer $ \ boldsymbol z $ is called ** encoding **, and the conversion to the output layer $ \ boldsymbol y $ is called ** decoding **. The self-encoder trains so that the decoded $ \ boldsymbol y $ is close to the input $ \ boldsymbol x $. If the number of units in the middle layer is smaller than the input dimension and the input can be reproduced, then the dimension can be compressed. Please refer to "Neural network implementation with python" for the detailed learning method. Since the activation function is an identity mapping function, the derivative value is always $ 1 $, which is easy to calculate.
** Principal component analysis (PCA) ** is a linear transformation of the training data $ \ boldsymbol x_i = (x_ {i1}, ..., x_ {id}) ^ T $ in the direction of maximum variance. This is the method to find. A data matrix consisting of $ N $ data is $ \ boldsymbol X = (\ boldsymbol x_1, ..., \ boldsymbol x_N) ^ T $, and the average vector is $ \ boldsymbol {\ bar x} = (\ bar x_1, ..., \ bar x_d) ^ T $. The data matrix $ \ boldsymbol {\ bar X} $, which is the data matrix minus the mean vector, and the covariance matrix $ \ boldsymbol \ Sigma $ are as follows.
\begin{align}
& \boldsymbol{\bar X} = (\boldsymbol x_1 - \boldsymbol{\bar x}, ..., \boldsymbol x_N - \boldsymbol{\bar x})^T \\
& \\
& \boldsymbol \Sigma = Var\bigl\{\boldsymbol{\bar X}\bigr\} = \frac{1}{N} \boldsymbol{\bar X}^T \boldsymbol{\bar X}
\end{align}
If the data matrix $ \ boldsymbol {\ bar X} $ is linearly converted by a certain coefficient vector $ \ boldsymbol a_j $ and $ \ boldsymbol s_j $ is used, the variance of the converted data is as follows.
\begin{align}
& \boldsymbol s_j = (s_{1j}, ..., s_{Nj})^T \\
& \\
& Var\bigl\{\boldsymbol s_j \bigr\} \varpropto \boldsymbol s_j^T \boldsymbol s_j = \bigl(\boldsymbol{\bar X} \boldsymbol a_j \bigr)^T \boldsymbol{\bar X} \boldsymbol a_j = \boldsymbol a_j^T \boldsymbol{\bar X}^T \boldsymbol{\bar X} \boldsymbol a_j \varpropto \boldsymbol a_j^T Var\bigl\{\boldsymbol{\bar X}\bigr\} \boldsymbol a_j
\end{align}
The projection vector $ \ boldsymbol a_j $ that maximizes this variance is obtained by maximizing the Lagrange function with the norm constrained to $ 1 $.
\begin{align}
& E(\boldsymbol a_j) = \boldsymbol a_j^T Var\bigl\{\boldsymbol{\bar X}\bigr\} \boldsymbol a_j - \lambda (\boldsymbol a_j^T \boldsymbol a_j - 1) \\
& \\
& \frac{\partial E(\boldsymbol a_j)}{\partial \boldsymbol a_j} = 2 Var\bigl\{\boldsymbol{\bar X}\bigr\} \boldsymbol a_j - 2 \lambda \boldsymbol a_j = 0 \\
& \\
& Var\bigl\{\boldsymbol{\bar X}\bigr\} \boldsymbol a_j = \lambda \boldsymbol a_j \tag{*}
\end{align}
The equation ($ \ ast $) shows that the projection vector $ \ boldsymbol a_j $ that maximizes the variance can be obtained by solving the eigenvalue problem for the covariance matrix of the original data. Let the eigenvalues obtained by solving the equation ($ \ ast $) be $ \ lambda_1 \ geq ... \ geq \ lambda_d $, and the corresponding eigenvectors be $ \ boldsymbol a_1, ..., \ boldsymbol a_d $. Since the covariance matrix is a real symmetric matrix, the eigenvectors are orthogonal to each other.
\boldsymbol a_i^T \boldsymbol a_j = \delta_{ij} = \begin{cases}
1 \ ( \ i = j \ ) \\
\\
0 \ ( \ i \ne j \ )
\end{cases}
The feature quantity linearly transformed by the eigenvector corresponding to the maximum eigenvalue is the $ 1 $ principal component, The feature amount linearly transformed by the eigenvector corresponding to the $ k $ th eigenvalue is called the $ k $ principal component. The total variance $ V_ {total} $, the contribution rate of the $ k $ principal component $ c_k $, and the cumulative contribution rate $ r_k $ up to the $ k $ principal component are as follows.
\begin{align}
& V_{total} = \sum_{i = 1}^d \lambda_i \\
& \\
& c_k = \frac{\lambda_k}{V_{total}} \\
& \\
& r_k = \frac{\sum_{i = 1}^k \lambda_i}{V_{total}}
\end{align}
Select $ D_y $ eigenvectors of $ \ boldsymbol \ Sigma $ in descending order of eigenvalues, and let the matrix that stores them as row vectors be $ \ boldsymbol U_ {D_y} $. Assuming that the number of dimensions of the input data is $ D_x $, the dimensions are compressed when $ D_y <D_x $. $ \ Boldsymbol U_ {D_y} $ and $ \ boldsymbol {\ bar x} $ are the solutions to the following minimization problem $ (\ boldsymbol \ Gamma, \ boldsymbol \ xi) = (\ boldsymbol U_ {D_y}, \ boldsymbol It is {\ bar x}) $.
\min_{\boldsymbol \Gamma, \boldsymbol \xi} \sum_{n = 1}^N \bigl\| \ (\boldsymbol x_n - \boldsymbol \xi) - \boldsymbol \Gamma^T \boldsymbol \Gamma(\boldsymbol x_n - \boldsymbol \xi) \ \bigr\|^2 \tag{**}
Principal component analysis can be interpreted as giving each data in the $ D_x $ dimensional space the $ D_y $ dimensional subspace that best represents it in the sense that the squared distance is minimized.
Middle and output layer weights and biases $ \ boldsymbol W = \ boldsymbol \ Gamma, \ boldsymbol b =-\ boldsymbol \ Gamma \ boldsymbol \ xi, \ boldsymbol {\ tilde W} = \ boldsymbol \ Gamma ^ T, \ If boldsymbol {\ tilde b} = \ boldsymbol \ xi
I implemented it as follows. The self-encoder and principal component analysis are listed separately.
auto encoder
The compression dimension is processed as $ 50 $. With a learning rate of $ \ varepsilon = 0.00001 $, we are turning $ 10000 $ epoch.
auto_encoder.py
import numpy
class AutoEncoder:
def __init__(self, n_input, n_hidden):
self.hidden_weight = numpy.random.randn(n_hidden, n_input + 1)
self.output_weight = numpy.random.randn(n_input, n_hidden + 1)
def fit(self, X, epsilon, epoch):
self.error = numpy.zeros(epoch)
N = X.shape[0]
for epo in range(epoch):
print u'epoch: %d' % epo
for x in X:
self.__update_weight(x, epsilon)
self.error[epo] = self.__calc_error(X)
def encode(self, X):
Z = numpy.zeros((X.shape[0], self.hidden_weight.shape[0]))
Y = numpy.zeros(X.shape)
for (i, x) in enumerate(X):
z, y = self.__forward(x)
Z[i, :] = z
Y[i, :] = y
return (Z, Y)
def __forward(self, x):
z = self.hidden_weight.dot(numpy.hstack((1, x)))
y = self.output_weight.dot(numpy.hstack((1, z)))
return (z, y)
def __update_weight(self, x, epsilon):
z, y = self.__forward(x)
# update output_weight
output_delta = y - x
self.output_weight -= epsilon * output_delta.reshape(-1, 1) * numpy.hstack((1, z))
# update hidden_weight
hidden_delta = self.output_weight[:, 1:].T.dot(output_delta)
self.hidden_weight -= epsilon * hidden_delta.reshape(-1, 1) * numpy.hstack((1, x))
def __calc_error(self, X):
N = X.shape[0]
err = 0.0
for x in X:
_, y = self.__forward(x)
err += (y - x).dot(y - x) / 2.0
return err
main.py
import numpy
import cv2
from sklearn.datasets import fetch_mldata
from auto_encoder import AutoEncoder
if __name__ == '__main__':
print 'read data...'
mnist = fetch_mldata('MNIST original', data_home = '.')
_max = mnist.data[:, :-1].max()
X = mnist.data[:, :-1] * 1.0 / _max
input_size = X.shape[1]
hidden_size = 50
epsilon = 0.00001
epoch = 10000
stride = 50
print 'auto encoder init...'
auto = AutoEncoder(input_size, hidden_size)
print 'train...'
auto.fit(X[::stride], epsilon, epoch)
print 'encode...'
Z, Y = auto.encode(X[::stride])
for (i, y) in enumerate(Y * _max):
cv2.imwrite('result/%04d.png' % i, y.reshape(28, 28))
PCA
The compression dimension is processed as $ 50 $. The projection vector and the decoded image are saved.
pca.py
import numpy
class PCA:
def __init__(self, n_components):
self.n_components = n_components
def fit(self, X):
self.bar = self.__bar(X)
self.cov = self.__cov()
self.lamda, self.A, self.ccr = self.__solve_eigen_porblem()
self.S = self.__transformation()
def decode(self):
return self.S.dot(self.A.T)
def __bar(self, X):
return X - X.mean(axis = 0)
def __cov(self):
return numpy.cov(self.bar, rowvar = 0)
def __solve_eigen_porblem(self):
_lamda, _A = numpy.linalg.eig(self.cov)
lamda = _lamda[:self.n_components]
A = _A[:, :self.n_components]
ccr = lamda.sum() / _lamda.sum()
return (lamda, A, ccr)
def __transformation(self):
return self.bar.dot(self.A)
main.py
import cv2
from matplotlib import pyplot
from sklearn.datasets import fetch_mldata
import os
from pca import PCA
if __name__ == '__main__':
print 'read data...'
mnist = fetch_mldata('MNIST original', data_home = '.')
X = mnist.data
n_components = 50
p_dir = 'projection/'
d_dir = 'decoded/'
print 'train...'
pca = PCA(n_components)
pca.fit(X)
print 'cumulative contribution ratio: %f' % pca.ccr
print 'decode...'
Xhat = pca.decode()
print 'save projection vector...'
if not os.path.exists(p_dir):
os.mkdir(p_dir)
for (i, a) in enumerate(pca.A.T):
fig, ax = pyplot.subplots()
heatmap = ax.pcolor(a.reshape(28, 28)[:, ::-1], cmap = pyplot.cm.RdYlBu)
pyplot.savefig(p_dir + '%04d.png' % i, dpi = 25)
pyplot.clf()
print 'save decoded images...'
if not os.path.exists(d_dir):
os.mkdir(d_dir)
for (i, xhat) in enumerate(Xhat[::50]):
cv2.imwrite(d_dir + '%04d.png' % i, xhat.reshape(28, 28))
n_components = 30
pca = PCA(n_components)
pca.fit(X)
Xhat = pca.A.dot(pca.S.T)
for (i, xhat) in enumerate(Xhat.T):
cv2.imwrite('result/%04d.png' % i, xhat.reshape(28, 28))
The following results were obtained.
auto encoder
Decrypted image
error
I chose a good-looking image for the decoded image. Features compressed to $ 50 $ dimension can be decrypted to $ 784 $ dimension. In the error graph, the horizontal axis is the number of epochs and the vertical axis is the error value. The difference between the digits of the error is too large, so the logarithm is taken. It is not an exact error, but you can see that it is decreasing monotonically. ..
PCA
Visualized projection vector
Decrypted image
For the projection vector, 10 corresponding vectors were selected in descending order of eigenvalues. The projection vector on the upper left projects to the first principal component with the largest variance. The decoded image gives good results as well as the self-encoder.
We were able to compress the dimensions by self-encoder and principal component analysis.
Recommended Posts