Using sklearn.RandomizedPCA to drastically reduce PCA running times.

Hello again, the last post (for now) about dimensionality reduction tackles the problem that, even if the trick that we talked about in the last post can reduce memory consumptions and execution times, sometimes it is still not enough.

We experience this when working on facial recognition here at Meerkat, in which we had a huge set of training data with points of large dimensionality. We tried a NIPALS implementation that reduces memory consumption, but it did not improve the performance (i.e. it was too slow). Sklearn came to the rescue! The lib has a nice and easy to use implementation of a randomized PCA technique.

We made a small class to apply this method, which reads and writes .mat files for MATLAB. This was useful for us because we often implement MATLAB prototypes before going to Python/C++. We thought this code could be helpful for a lot of people searching for this, so here it goes:

import numpy as np
import scipy.io as sio
from sklearn.decomposition import RandomizedPCA

class PcaReduction:
    def __init__(self, reduce_dim):
        self.reduce_dim = int(reduce_dim);
 
    def reduce(self, dataset_dir):
        self.rand_pca = RandomizedPCA(n_components=self.reduce_dim)
        print('Randomizing PCA extaction...')
        self.rand_pca.fit(self.data)
        print('done.')
 
    def load_np_data(self, filename):
        print('Reading numpy data from file...')
        self.data = np.load(filename)
        print('done. Matrix size ', self.data.shape)

    def load_mat_data(self, filename):
        print('Reading MATLAB data from file...')
        values = sio.loadmat(filename)
        self.data = values.X
        print('done. Matrix size ', self.data.shape)
 
    def save_mat_file(self, matlab_filename):
        mean_X = self.rand_pca.mean_
        pca_base = self.rand_pca.components_
        d_values = {'pca_base': pca_base,
                    'mean_X': mean_X}

        sio.savemat(matlab_filename, d_values)

The RandomizedPCA from sklearn is much faster than the original PCA even when the “transpose-matrix-trick” is implemented. To get an idea, with this code, we were able to reduce the execution time from around 6 hours to merely 15 minutes! Notice that the resulting PCA base of this method is not perfect, but for our facial recognition method we did not encounter any problems in the end results.

Enjoy!

References:

Doc for RandomizedPCA in sklearn

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s