Introduction

We implemented a self-encoder and principal component analysis as a dimensional compression method. As textbooks, ["Deep Learning"](https://www.amazon.co.jp/ Deep Learning-Machine Learning Professional Series-Okatani-Takayuki / dp / 4061529021) and ["First Pattern Recognition"](https: / /www.amazon.co.jp/ First pattern recognition-Hirai-Yuzo / dp / 4627849710) was used.

Structure of this article

Introduction
Self-encoder
Principal component analysis
Relationship between the two methods
Implementation in python
- auto encoder
- PCA
Result
in conclusion

Self-encoder

** autoencoder ** is a neural network that uses inputs as training data and acquires features that represent the data well. It is classified as unsupervised learning that does not use teacher data, and is used for pre-learning neural networks and determining initial values. Now, let's move on to the explanation of the self-encoder. Consider a three-layer network as shown in the figure below. The activation function is the conformal mapping function $ g (a) = a $.

The value of the unit in the middle layer $ z_j $ and the value of the unit in the output layer $ y_k $ are obtained by propagating the input forward.

\begin{align}
& z_j = \sum_i w_{ji} x_i \\
& \\
& y_k = \sum_j \tilde w_{kj} z_j
\end{align}

The vector with the values of the units in the middle layer is $ \ boldsymbol z $, and the vector with the values of the units in the output layer is $ \ boldsymbol y $, If the matrix with $ w_ {ji} $ as the $ (\ j, i ) $ component is $ \ boldsymbol W $, it can be expressed as follows.

\begin{align}
& \boldsymbol z = \boldsymbol W \boldsymbol x \\
& \\
& \boldsymbol y = \boldsymbol{\tilde W} \boldsymbol z
\end{align}

The conversion to the middle layer $ \ boldsymbol z $ is called ** encoding **, and the conversion to the output layer $ \ boldsymbol y $ is called ** decoding **. The self-encoder trains so that the decoded $ \ boldsymbol y $ is close to the input $ \ boldsymbol x $. If the number of units in the middle layer is smaller than the input dimension and the input can be reproduced, then the dimension can be compressed. Please refer to "Neural network implementation with python" for the detailed learning method. Since the activation function is an identity mapping function, the derivative value is always $ 1 $, which is easy to calculate.

Principal component analysis

** Principal component analysis (PCA) ** is a linear transformation of the training data $ \ boldsymbol x_i = (x_ {i1}, ..., x_ {id}) ^ T $ in the direction of maximum variance. This is the method to find. A data matrix consisting of $ N $ data is $ \ boldsymbol X = (\ boldsymbol x_1, ..., \ boldsymbol x_N) ^ T $, and the average vector is $ \ boldsymbol {\ bar x} = (\ bar x_1, ..., \ bar x_d) ^ T $. The data matrix $ \ boldsymbol {\ bar X} $, which is the data matrix minus the mean vector, and the covariance matrix $ \ boldsymbol \ Sigma $ are as follows.

\begin{align}
& \boldsymbol{\bar X} = (\boldsymbol x_1 - \boldsymbol{\bar x}, ..., \boldsymbol x_N - \boldsymbol{\bar x})^T \\
& \\
& \boldsymbol \Sigma = Var\bigl\{\boldsymbol{\bar X}\bigr\} = \frac{1}{N} \boldsymbol{\bar X}^T \boldsymbol{\bar X}
\end{align}

If the data matrix $ \ boldsymbol {\ bar X} $ is linearly converted by a certain coefficient vector $ \ boldsymbol a_j $ and $ \ boldsymbol s_j $ is used, the variance of the converted data is as follows.

\begin{align}
& \boldsymbol s_j = (s_{1j}, ..., s_{Nj})^T \\
& \\
& Var\bigl\{\boldsymbol s_j \bigr\} \varpropto \boldsymbol s_j^T \boldsymbol s_j = \bigl(\boldsymbol{\bar X} \boldsymbol a_j \bigr)^T \boldsymbol{\bar X} \boldsymbol a_j =  \boldsymbol a_j^T \boldsymbol{\bar X}^T \boldsymbol{\bar X} \boldsymbol a_j \varpropto \boldsymbol a_j^T Var\bigl\{\boldsymbol{\bar X}\bigr\} \boldsymbol a_j
\end{align}

The projection vector $ \ boldsymbol a_j $ that maximizes this variance is obtained by maximizing the Lagrange function with the norm constrained to $ 1 $.

\begin{align}
& E(\boldsymbol a_j) = \boldsymbol a_j^T Var\bigl\{\boldsymbol{\bar X}\bigr\} \boldsymbol a_j - \lambda (\boldsymbol a_j^T \boldsymbol a_j - 1) \\
& \\
& \frac{\partial E(\boldsymbol a_j)}{\partial \boldsymbol a_j} = 2 Var\bigl\{\boldsymbol{\bar X}\bigr\} \boldsymbol a_j - 2 \lambda \boldsymbol a_j = 0 \\
& \\
& Var\bigl\{\boldsymbol{\bar X}\bigr\} \boldsymbol a_j = \lambda \boldsymbol a_j \tag{*}
\end{align}

The equation ($ \ ast $) shows that the projection vector $ \ boldsymbol a_j $ that maximizes the variance can be obtained by solving the eigenvalue problem for the covariance matrix of the original data. Let the eigenvalues obtained by solving the equation ($ \ ast $) be $ \ lambda_1 \ geq ... \ geq \ lambda_d $, and the corresponding eigenvectors be $ \ boldsymbol a_1, ..., \ boldsymbol a_d $. Since the covariance matrix is a real symmetric matrix, the eigenvectors are orthogonal to each other.

\boldsymbol a_i^T \boldsymbol a_j = \delta_{ij} = \begin{cases}
1 \ ( \ i = j \ ) \\
\\
0 \ ( \ i \ne j \ )
\end{cases}

The feature quantity linearly transformed by the eigenvector corresponding to the maximum eigenvalue is the $ 1 $ principal component, The feature amount linearly transformed by the eigenvector corresponding to the $ k $ th eigenvalue is called the $ k $ principal component. The total variance $ V_ {total} $, the contribution rate of the $ k $ principal component $ c_k $, and the cumulative contribution rate $ r_k $ up to the $ k $ principal component are as follows.

\begin{align}
& V_{total} = \sum_{i = 1}^d \lambda_i \\
& \\
& c_k = \frac{\lambda_k}{V_{total}} \\
& \\
& r_k = \frac{\sum_{i = 1}^k \lambda_i}{V_{total}}
\end{align}

Relationship between the two methods

Select $ D_y $ eigenvectors of $ \ boldsymbol \ Sigma $ in descending order of eigenvalues, and let the matrix that stores them as row vectors be $ \ boldsymbol U_ {D_y} $. Assuming that the number of dimensions of the input data is $ D_x $, the dimensions are compressed when $ D_y <D_x $. $ \ Boldsymbol U_ {D_y} $ and $ \ boldsymbol {\ bar x} $ are the solutions to the following minimization problem $ (\ boldsymbol \ Gamma, \ boldsymbol \ xi) = (\ boldsymbol U_ {D_y}, \ boldsymbol It is {\ bar x}) $.

\min_{\boldsymbol \Gamma, \boldsymbol \xi} \sum_{n = 1}^N \bigl\| \ (\boldsymbol x_n - \boldsymbol \xi) - \boldsymbol \Gamma^T \boldsymbol \Gamma(\boldsymbol x_n - \boldsymbol \xi) \ \bigr\|^2 \tag{**}

Principal component analysis can be interpreted as giving each data in the $ D_x $ dimensional space the $ D_y $ dimensional subspace that best represents it in the sense that the squared distance is minimized. Middle and output layer weights and biases $ \ boldsymbol W = \ boldsymbol \ Gamma, \ boldsymbol b =-\ boldsymbol \ Gamma \ boldsymbol \ xi, \ boldsymbol {\ tilde W} = \ boldsymbol \ Gamma ^ T, \ If boldsymbol {\ tilde b} = \ boldsymbol \ xi , the error function of the self-encoder matches the expression ( \ ast \ ast $). That is, in a self-encoder, $ (\ boldsymbol \ Gamma, \ boldsymbol \ xi) = (\ boldsymbol U_ {D_y}, \ boldsymbol {\ bar x}) $ is the network parameter that minimizes the error function. ..

Implementation in python

I implemented it as follows. The self-encoder and principal component analysis are listed separately.

auto encoder

The compression dimension is processed as $ 50 $. With a learning rate of $ \ varepsilon = 0.00001 $, we are turning $ 10000 $ epoch.

`auto_encoder.py`


import numpy

class AutoEncoder:

    def __init__(self, n_input, n_hidden):
        self.hidden_weight = numpy.random.randn(n_hidden, n_input + 1)
        self.output_weight = numpy.random.randn(n_input, n_hidden + 1)

    def fit(self, X, epsilon, epoch):
        self.error = numpy.zeros(epoch)
        N = X.shape[0]
        for epo in range(epoch):
            print u'epoch: %d' % epo
            for x in X:
                self.__update_weight(x, epsilon)

            self.error[epo] = self.__calc_error(X)

    def encode(self, X):
        Z = numpy.zeros((X.shape[0], self.hidden_weight.shape[0]))
        Y = numpy.zeros(X.shape)
        for (i, x) in enumerate(X):
            z, y = self.__forward(x)
            Z[i, :] = z
            Y[i, :] = y

        return (Z, Y)

    def __forward(self, x):
        z = self.hidden_weight.dot(numpy.hstack((1, x)))
        y = self.output_weight.dot(numpy.hstack((1, z)))

        return (z, y)

    def __update_weight(self, x, epsilon):
        z, y = self.__forward(x)

        # update output_weight
        output_delta = y - x
        self.output_weight -= epsilon * output_delta.reshape(-1, 1) * numpy.hstack((1, z))

        # update hidden_weight
        hidden_delta = self.output_weight[:, 1:].T.dot(output_delta)
        self.hidden_weight -= epsilon * hidden_delta.reshape(-1, 1) * numpy.hstack((1, x))

    def __calc_error(self, X):
        N = X.shape[0]
        err = 0.0
        for x in X:
            _, y = self.__forward(x)
            err += (y - x).dot(y - x) / 2.0

        return err

`main.py`


import numpy
import cv2
from sklearn.datasets import fetch_mldata
from auto_encoder import AutoEncoder

if __name__ == '__main__':

    print 'read data...'
    mnist = fetch_mldata('MNIST original', data_home = '.')
    _max = mnist.data[:, :-1].max()
    X = mnist.data[:, :-1] * 1.0 / _max
    input_size = X.shape[1]
    hidden_size = 50
    epsilon = 0.00001
    epoch = 10000
    stride = 50

    print 'auto encoder init...'
    auto = AutoEncoder(input_size, hidden_size)

    print 'train...'
    auto.fit(X[::stride], epsilon, epoch)

    print 'encode...'
    Z, Y = auto.encode(X[::stride])
    for (i, y) in enumerate(Y * _max):
        cv2.imwrite('result/%04d.png' % i, y.reshape(28, 28))

PCA

The compression dimension is processed as $ 50 $. The projection vector and the decoded image are saved.

`pca.py`


import numpy

class PCA:

    def __init__(self, n_components):
        self.n_components = n_components

    def fit(self, X):
        self.bar = self.__bar(X)
        self.cov = self.__cov()
        self.lamda, self.A, self.ccr = self.__solve_eigen_porblem()
        self.S = self.__transformation()

    def decode(self):
        return self.S.dot(self.A.T)

    def __bar(self, X):
        return X - X.mean(axis = 0)

    def __cov(self):
        return numpy.cov(self.bar, rowvar = 0)

    def __solve_eigen_porblem(self):
        _lamda, _A = numpy.linalg.eig(self.cov)
        lamda = _lamda[:self.n_components]
        A = _A[:, :self.n_components]
        ccr = lamda.sum() / _lamda.sum()

        return (lamda, A, ccr)

    def __transformation(self):
        return self.bar.dot(self.A)

`main.py`


import cv2
from matplotlib import pyplot
from sklearn.datasets import fetch_mldata
import os
from pca import PCA

if __name__ == '__main__':

    print 'read data...'
    mnist = fetch_mldata('MNIST original', data_home = '.')
    X = mnist.data
    n_components = 50
    p_dir = 'projection/'
    d_dir = 'decoded/'

    print 'train...'
    pca = PCA(n_components)
    pca.fit(X)
    print 'cumulative contribution ratio: %f' % pca.ccr

    print 'decode...'
    Xhat = pca.decode()

    print 'save projection vector...'
    if not os.path.exists(p_dir):
        os.mkdir(p_dir)
    for (i, a) in enumerate(pca.A.T):
        fig, ax = pyplot.subplots()
        heatmap = ax.pcolor(a.reshape(28, 28)[:, ::-1], cmap = pyplot.cm.RdYlBu)
        pyplot.savefig(p_dir + '%04d.png' % i, dpi = 25)
        pyplot.clf()

    print 'save decoded images...'
    if not os.path.exists(d_dir):
        os.mkdir(d_dir)
    for (i, xhat) in enumerate(Xhat[::50]):
        cv2.imwrite(d_dir + '%04d.png' % i, xhat.reshape(28, 28))

    n_components = 30
    pca = PCA(n_components)
    pca.fit(X)

    Xhat = pca.A.dot(pca.S.T)
    for (i, xhat) in enumerate(Xhat.T):
        cv2.imwrite('result/%04d.png' % i, xhat.reshape(28, 28))

result

The following results were obtained.

auto encoder

Decrypted image

error

I chose a good-looking image for the decoded image. Features compressed to $ 50 $ dimension can be decrypted to $ 784 $ dimension. In the error graph, the horizontal axis is the number of epochs and the vertical axis is the error value. The difference between the digits of the error is too large, so the logarithm is taken. It is not an exact error, but you can see that it is decreasing monotonically. ..

PCA

Visualized projection vector

Decrypted image

For the projection vector, 10 corresponding vectors were selected in descending order of eigenvalues. The projection vector on the upper left projects to the first principal component with the largest variance. The decoded image gives good results as well as the self-encoder.

in conclusion

We were able to compress the dimensions by self-encoder and principal component analysis.

[PYTHON] Dimensional compression with self-encoder and principal component analysis

Introduction

Self-encoder

Principal component analysis

Relationship between the two methods

Implementation in python

auto_encoder.py

main.py

pca.py

main.py

result

in conclusion

`auto_encoder.py`

`main.py`

`pca.py`

`main.py`