Our new product: logo detection for brand analysis

We recent developed a new product using our new object detector technology to brand analytics. You can check all about this technology including videos and an online demo at our website page.

In the following, we reproduce a study made by us using our technology that was first published in our medium.

We hope you like it.

Why logo detection matters?

Social media analytics has greatly increased over the past decades, which showed to be a great way for the companies to know their public better since social networks took over our lives. These kinds of study can provide clues on how the public perceives and interacts with a specific brand. Some companies provide platforms that are dedicated to social media analytics, such as Buffer, Hootsuite and Sproutsocial among others. In the recent years, however, much of the social engaging has been through the use of pictures and videos. This became a trend not only on image based social networks such as Instagram and Snapchat, but also on previously more text-based cases as Twitter and Facebook.

Nowadays is estimated that more than 80% of the posts in social medias contain images or videos. To analyze only the text and metadata of such posts is clearly not enough.

This can pose a great challenge for the companies to extract meaningful information from them automatically. But, fear not, Computer Vision (CV) comes to the rescue. Analyzing images in order to extract meaningful information has been the objective of this field for many decades. Many examples of CV have gain big commercial importance in past years. Systems able to detect and recognize faces, read documents, etc.

Despite these applications success, it was until recently very difficult to detect logos from unconstrained (in-the-wild) images, such as the ones present in social media. One of the major issue is the variability of objects in which a brand can appear. Think of a RedBull logo appearing in a can, outdoor, t-shirt or a F1 car. The system must be able to model the huge variability in which a logo can be used.

Here on Meerkat we started developing a complete new system based on Deep Learning, which is revolutionizing several areas of Computer Vision, including object (and logo) detection. We developed a Deep Learning system that given some real image examples of a logo (a learning phase), it learns to identify and localize this logo on any image.

Once we have these images with associated metadata from social media, we can generate useful information. In this post, we describe a sample study evaluating 6 brands of beer on Twitter posts

In order to briefly show the potential of information that can be retrieved from detecting logos on social network images, we gathered tweets of the last 6 months that contained words commonly used in the context of beer, such as beer, cerveza, bar, barbecue, etc. We trained our system to detect the logo of the following beer brands: Budweiser, Bud Light, Corona, Guinness, Heineken and Stella Artois.

Within a period of 6 months we retrieved more than one million of tweets with images. Only a small fraction of these images contained a logo from these types of beers. All the images in these posts were gathered in this automatic fashion without any manual input.

Take a look at some of the images gathered by the system for the Heineken brand:

A powerful detector is a must in this type of system, because a logo can appear in different contexts and with different appearances.

We can clearly observe several different instances of the same logo. The fact that our detector can correctly model this huge variability is a direct result from modern deeplearning-based systems proposed in the last couple of years.

Here is some more cool examples of other brands:

Screen Shot 2017-05-25 at 17.28.49

So, we found which tweets contained a picture of a given brand, now what to do with this information? A no-brainer is to index the tweet metadata so we search on these information and extract some statistics. We made exactly this using ElasticSearch and Kibana. More details on the whole pipeline will be provided in a different Medium post ;-). In the meantime, here is a video showing the system working:

Searching for logo and logo with female face in the system. See full video

Twitter posts stats

Disclaimer: Because the tweets were extracted from a relatively small period (six months), the following statistics are an illustration of a logo detection and brand analysis system and should be taken with a grain of salt since they may be prone to outliers.

So, here it comes the stats, ready? First, let’s take a look at the raw number of posts for each type of beer and how these numbers relate to the market share of each company:

Market share statistics taken from statista.com

Notice that the more present beers in tweets were Corona, followed by Heineken and BudLight. Also there is no correlation between the number of posts with a logo in it and the market share associated with that brand. A clear example comes from Guinness, which has a small market share (~1%), yet has a big presence in social media (~11% in our dataset).

This type of information is obviously of great importance for the company, since it shows a high/low level of connection with the brand. This can be associated with different causes. From the work of Doorn and colleagues: “[…] customer engagement behavior can be associated with antecedents Customer-Based factors like customer satisfaction, brand commitment, trust, brand attachment, and brand performance perceptions. Generally speaking, very high or very low levels of these factors can lead to engagement.”

Additionally, it would be nice to see how the brand presence appears in different geographic locations. However, only 4% of the tweets images had geo coordinates, which does not amount to a significant value. What we choose to do was to extract the string of the location informed by the user itself in its profile page. Upon processing those entries with Google Maps API we were able to extract the location of around 73% of the dataset.

To evaluate this data, we plot the percentage of each beer for the top 5 countries that appeared in the dataset (graph on left). It’s interesting to see that Heineken and Guinness appear to have a more geographically distributed posts, while BudLight is greatly concentrated in North America.

More Computer Vision: face and gender analysis

Within our system we also detected and classified persons according to their gender, which is right on Meerkat’s know-how. This can lead to interesting data as well. One of the aspects that we can investigate is to see the probability of a share (in this case a retweet = RT) given that there is a face appearing in the image.

It’s well know that faces can contribute greatly to the number of shares/likes/retweets a post can have. In the graph in the left we compute the ratio between the number of retweets over tweets (RT/T) for posts with no face or at least a single face present. This ratio is 56% larger when faces are present in our dataset.

Another interesting possibility here is to find the correlation between gender and posts. In the graph we show the overall percentage of each detected gender across all posts.

It’s clear that combining different sources of image information can further increase the relevance of the extract data. Below are some pictures showing the detection of different faces/genders for different beers:

Indexed images with gender estimation

Indexed images with gender estimation

That’s it! The takeaway here is that the recent leap in Computer Vision provided by Deep Learning has still not yet translated into useful applications. This study was just a small sample on what is possible to do with a logo detection technology. In the future we plan to add more general brands on our system and get much more data from social networks. This will allow us to get many different insights such as the relation across brands. For instance, we may discover a high correlation between Nike users and McDonald’s. Also we can use computer vision methods to detect inappropriate content linked with a brand. This technology is also being ported to the mobile and can serve to many AR applications.

So, how is your branding being advertised by your consumers? Are you following the references to your brand on social medias? How your brand is positioned in relation to your competitors?

frAPI 5.0 is here: pushing the boundaries of Face Recognition systems

At Meerkat, we improved our facial recognition by 40% at 10k distractors, with real-time performance and an easy interface.

This article is also available in our Medium: https://medium.com/@meerkat.cv

Facial recognition (FR) technology has come a long way in recent years in terms of applicability. However, the standard FR deploy still presents several difficulties with it. They may range from the method accuracy and performance, requirement of specific setups to ease integration and mobile support. With the latest release of our facial recognition API (frAPI) version 5.0 we aim to addressed all those problems together such that our clients can be up and running their FR system within fifteen minutes.

Accuracy and Setup Dependency

The facial recognition, as well as other Computer Vision areas, had a recent breakthrough with the use of Deep Learning. They allowed the creation of highly accurate recognition systems, however they usually require several millions of images to train and they present a high computational cost, usually requiring a GPU for decent performance.

Our previous frAPI technology was based on a method not based on Deep Learning, yet with good accuracy and low computational requirements, which allowed us to process a video stream with up to 40 frames per second. For the last year or so, we start diving into Deep Learning and CNNs (Convolutional Neural Networks) inspired by all the explosive improvements in performance in such problems, such as object detection in the ImageNet or Pascal-VOC challenges. However, the challenge of running a network on CPU and keeping the real-time performance was quite challenging.

By carefully choosing which parts of the system will be based on neural networks, we boost our accuracy by ~40% for large databases while maintaining real-time performance on CPU.

On the image below you can see an example of the robustness of the new deeplearning version when comparing to the previous version. It is remarkable that with only one image the system was able to detect both IronMan with shades and Captain America that is partially occluded by Scarlet Witch — all of the heroes are not wearing suits, of course.

Face recognition using a single image as input

It’s clear that the new CNN-based method is able to improve recall and give more assertive recognition in terms of confidence values.

The importance of this recognition robustness is two folded: the system becomes much more reliable and there is no need for a more controlled setup, such as use several images to train a person or heavily constrain the camera setup.

For a quantitative evaluation we used the LFW (Labeled Faces in the Wild) dataset, which is standard dataset to compare facial recognition techniques. On the standard facial verification protocol we achieved 98.5% of accuracy, which is quite awesome. We can see on the graphs below how we stack against other facial recognition solutions.

We have great results on LFW, however it is becoming a “saturated dataset”, i.e. it is no longer presenting a difficult enough test to rank the recent and most accurate facial recognition systems. Also, their evaluation protocol is more suited for the verification problem (one to one), and not for the recognition problem (one to many).

In real world problems the N from one to many (1:N) can be quite big. So in order to provide a more hard and real-life scenario to our test we used the same protocol as proposed by the MegaFace challenge: within N images, find the single image containing a given person. In the graph below we can see how our system does with N up to 10,000, i.e. in those 10,000 images there is only one of the person we are searching for, and the system must be able to find it.

Here, we are plotting the recognition accuracy with 10,000 distractors for different ranks. A given X rank indicates that the correct person was found on the first X results. For example, for example a correct recognition with rank 10 indicates that the correct person was within the first 10 persons from the recognition process.

What the graph is showing us is that, given a single image input and 10k images from 10k people, we are able to identify the person 82% of the time.

Given all that above, we have a pretty amazing technology for facial recognition. But, as we stated in the beginning, we are interested in also providing a complete system to our clients. That means several things such as real-time performance for IP cameras, providing a easy and intuitive interface and porting our technology to mobile.

Enrollment

A common difficulty in implementing Computer Vision systems is on the enrollment phase — this is also the case for face recognition. Usually this is done on an active manner, with the enrolled person being on a controlled environment where some pictures can be taken. It is clear that this is a burden, especially if the person to be enrolled is a client. A setup of a database for hundreds of people can be almost impractical — yet, the quality of acquisition will influence directly the results at inference time.

To tackle this problem we created a “Smart Enroll” option that does the enrollment process completely passive, without the need of special setups. You just need a camera in which several people will appear, and the system will extract the faces of every person and automatically cluster them apart. This allows the enrollment of people on real use case scenarios, with several people are the video and without the need of a special setup. Take a look at this processing running in our system:

https://gfycat.com/ifr/AlarmedIllfatedHairstreak

Clustering can also be used for other purposes as well, such as assigning temporary labels for people that are not present in the database — our Restful API has this option also.

Perfomance

In passive face recognition, speed matters a lot. Take the case of using a camera at a store’s entrance to detect recurring consumers. If the recognition system has a low throughput, many will not be detected since the gap between two consecutive recognitions is too large to collect evidence from the person entering.

We already talked about our core technology change to use deeplearning method, and a first concern that one might have is the speed performance when using it with CPU. We are glad to announce that our hand-crafted neural network is able to process a video stream with up to 25 frames per second on a common machine (i5 3.2GHz, 4GB RAM).

With this performance it is possible to use the facial recognition directly on a video stream such as your common IP camera. On the near future we are planning to launch a GPU version which should be much faster and able to process a large number of cameras at real-time in a single machine.

https://gfycat.com/ifr/ComplicatedElaborateHermitcrab

Depending on the facial recognition usage, one common problem is the use of IP cameras. They are indeed useful for many applications, however they usually require a power connection and are not portable. So we decided to use a high quality camera that most of us already have on our pockets, the cell phone.

We developed an small (< 2MB) app called frAPI Eye that transforms your cellphone on a IP camera that connects directly to frAPI. With this app you can launch a recognition stream from the cellphone camera. And since we use H264 video encoding for the video transfer the image quality is really high while using a small bandwidth on your network. For the more tech-savyy, we implement the data stream using websockets: the code is public available at https://github.com/meerkat-cv/h264_decoder .

Finally, our new face verification system was ported to smartphones where a biometric process can be performed without the requirement of internet connection. All is done locally and the Android/iOS SDK is much smaller than current competitors (< 30MB). Notice that this port has exactly the same accuracy as the on-premise version of the system (i.e. 98.5% on LFW).

https://gfycat.com/ifr/KindDetailedBlackbear

This huge set of improvements places Meerkat as a top provider of face recognition, yet we still are looking for new ways to improve our customer experience with it, so let us know if you have any ideas/requirements in the comments.

If that made you interested, contact us at [email protected] and we can arrange a trial at no time. Also, follows us on Medium; interesting things will be launched soon 🙂

Using SVM with HOG object detector in OpenCV

Hi everyone! For this post I will give you guys a quick and easy tip on how to use a trained SVM classifier on the HOG object detector from OpenCV. But first, one big shout-out to Dalal and Triggs for their great work on the HOG (Histogram of Oriented Gradients) descriptor! If you still don’t know about it, it is worth to check it out.

But back to the subject: why am I writing about this, since the OpenCV already have the implementations of both SVM and HoG which are quite easy to use? Well, they may be easy to use, but they don’t work very well together. The HoG object detector may be called with an SVM classifier, but not in the format that the SVM classifier from OpenCV works. That really means that if you train a SVM using HoG features, it is not possible to use it on the cv::HOGDescriptor::detect() function.

Fortunately, this is easy to solve: we just need to convert the trained SVM classifier to the Primal Form. This can be done by first creating the class PrimalSVM, which is an inheritance from the the class SVM:

class PrimalSVM: public cv::SVM {
    public:
    void getSupportVector(std::vector<float> &support_vector) const;
};

And then, to the magical part:

void PrimalSVM::getSupportVector(std::vector<float> &support_vector) const {
   int sv_count = get_support_vector_count();
   const CvSVMDecisionFunc* df = decision_func;
   const double* alphas = df[0].alpha;

   double rho = df[0].rho;
   int var_count = get_var_count();
   support_vector.resize(var_count, 0);

   for (unsigned int r = 0; r < (unsigned)sv_count; r++) {
       float myalpha = alphas[r];
       const float* v = get_support_vector(r);
       for (int j = 0; j &lt; var_count; j++,v++)
           support_vector[j] += (-myalpha) * (*v);
       }
       support_vector.push_back(rho);
}

Now you can use the PrimalSVM to train a classifier just like you would do with cv::SVM, and then call getSupportVector that will give you the support vectors in the format that cv::HOGDescriptor::setSVMDetector expects. And here you go! Now you can easily create an object detector entirely on OpenCV, and using only a few lines of codes :D! You may be surprised with the results that you can achieve when training with only a handful of images. Actually, I may get into more details on the process of creating an object detector in the future…

And last but not least, another shout-out goes to DXM from Stack Overflow, which was, as far as I know, the first one to propose this solution.

 

PS: For the ones with more attention to details, you will notice that the signals of rho and the alphas are not the same. This may be due to some characteristics of the (older) libSVM, which was the base of the SVM OpenCV code. I don’t quite understands this particular SVM implementation details, but I don’t lose sleep over it :P.

Writing Python wrappers for C++ (OpenCV) code, part I.

As I mentioned a couple of posts ago, we love to use C++ to make our methods to run fast. And, as many Vision engineers, we use OpenCV. However, there are a couple of things that can be tricky in C++, such as web services. Instead, for an upcoming API that we are developing (!), we decided to go with Python for the web part while not losing the performance gain of C++. Hence, there is a need to build a Python wrapper for the C++ code.

There are a couple of library options to that, but for our needs the Boost.Python is by far the best fit. The major aspect of writing these wrappers is to convert the data from/to Python to/from C++. The first conversion that it will be required is between matrices types: cv::Mat from OpenCV to/from numpy nd arrays.

Fortunately, there is a very useful lib that implements this converter called numpy-opencv-converter. However, there are a lot of these conversions that need to be implemented manually as it is impossible to predict the combination of data types one can write in his/her code.

We will begin with a simple example that uses the boost library to convert the Python parameters (tuples) to a C++ function (cv::Rect).

The definition of the method is this:

cv::Mat FaceDetection::getPose(const cv::Mat &image, const cv::Rect &rect);

Notice that, using the converter lib mentioned above, we will not have problems for the return value and the image parameter as they are cv::Mat. However, the Python code has no idea what is a cv::Rect, therefore we need a helper function to call this method.

Usually, you can export a method using boost.python as simple as this:

py::class_("FaceDetection")
       .def("getPose", &FaceDetection::getPose)

But again, if we do this without a converter (will be discussed in Part II), there will be a problem when calling this method. Now, we can create a helper function to allow the conversion to cv::Rect when calling this method. In the Python version of OpenCV, a rectangle is defined by tuples, so the helper function becomes:

cv::Mat FaceDetection_getPose(FaceDetection& self, const cv::Mat &image, py::tuple rect) {
 py::tuple tl = py::extract(rect[0])();
 py::tuple br = py::extract(rect[1])();

 int tl_x = py::extract(tl[0])();
 int tl_y = py::extract(tl[1])();
 int br_x = py::extract(br[0])();
 int br_y = py::extract(br[1])();

 cv::Rect cv_rect(cv::Point(tl_x, tl_y), cv::Point(br_x, br_y));

 return self.getPose(image, cv_rect);
}

Now, if I call this function with a tuple the system knows what to do. The only thing is to bind this helper function to the method of our class:

py::class_("FaceDetection")
       .def("getPose", &FaceDetection_getPose)

And voilà. I can call the method FaceDetection.getPose(…) from the Python (once the module is imported, of course) without any problem. This is nice and all, but you may be wondering if you have to do this kind of functions every time your data is not natively support by the boost.python. The answer is no, and it is fairly simple to create some converters for your datatype. We’ll show that in a future post, Part II.

Using sklearn.RandomizedPCA to drastically reduce PCA running times.

Hello again, the last post (for now) about dimensionality reduction tackles the problem that, even if the trick that we talked about in the last post can reduce memory consumptions and execution times, sometimes it is still not enough.

We experience this when working on facial recognition here at Meerkat, in which we had a huge set of training data with points of large dimensionality. We tried a NIPALS implementation that reduces memory consumption, but it did not improve the performance (i.e. it was too slow). Sklearn came to the rescue! The lib has a nice and easy to use implementation of a randomized PCA technique.

We made a small class to apply this method, which reads and writes .mat files for MATLAB. This was useful for us because we often implement MATLAB prototypes before going to Python/C++. We thought this code could be helpful for a lot of people searching for this, so here it goes:

import numpy as np
import scipy.io as sio
from sklearn.decomposition import RandomizedPCA

class PcaReduction:
    def __init__(self, reduce_dim):
        self.reduce_dim = int(reduce_dim);
 
    def reduce(self, dataset_dir):
        self.rand_pca = RandomizedPCA(n_components=self.reduce_dim)
        print('Randomizing PCA extaction...')
        self.rand_pca.fit(self.data)
        print('done.')
 
    def load_np_data(self, filename):
        print('Reading numpy data from file...')
        self.data = np.load(filename)
        print('done. Matrix size ', self.data.shape)

    def load_mat_data(self, filename):
        print('Reading MATLAB data from file...')
        values = sio.loadmat(filename)
        self.data = values.X
        print('done. Matrix size ', self.data.shape)
 
    def save_mat_file(self, matlab_filename):
        mean_X = self.rand_pca.mean_
        pca_base = self.rand_pca.components_
        d_values = {'pca_base': pca_base,
                    'mean_X': mean_X}

        sio.savemat(matlab_filename, d_values)

The RandomizedPCA from sklearn is much faster than the original PCA even when the “transpose-matrix-trick” is implemented. To get an idea, with this code, we were able to reduce the execution time from around 6 hours to merely 15 minutes! Notice that the resulting PCA base of this method is not perfect, but for our facial recognition method we did not encounter any problems in the end results.

Enjoy!

References:

Doc for RandomizedPCA in sklearn

Small trick to compute PCA for large training datasets

In the last post, we talked about the choice of method for dimensionality reduction and how PCA is sometimes overused due to its popularity and available implementation in several libs. Having said that, we still gonna talk about PCA, because we use it a lot! 🙂 If you do not know how PCA works, a very good introduction to PCA is the one by Lindsay Smith, check it out. The PCA is simply defined as the eigenvectors of the data covariance matrix. When you want to reduce dimensionality, only the N highest eigenvalues associated with eigenvectors are kept, forming a base for reduction. If this sounds weird, don’t worry, the important part here is the covariance matrix and it’s the main cause of headaches for large datasets.

Imagine that we a have data points that have a very large dimensionality, such as 50k, for instance (this is very common in Computer Vision!). The covariance matrix of any number of points it’s going to be a 50k X 50k matrix, which is huge. To get an idea of how huge, if we use the standard 8 bytes for each float, this matrix will be around 18GB of memory, just to describe it! This problem is well known and one nice (and quite old) trick to compute PCA is described in the seminal paper of Eigenfaces [1]. So, assuming that we arrange our P points of M-dimensionality in an M by P matrix, the algebraic way of computing the covariance matrix is:

CodeCogsEqn

If M = 50k as in our example, this matrix will be huge, as we said. But, if we are interested in extracting only the eigenvalue/vectors from this matrix, we can compute it from the following matrix:

CodeCogsEqn

It turns out that we can extract the original eigenvalues/vectors using this matrix, excluding the eigenvalues (and associated vectors) which are equal to zero. In order to do that we multiply the eigenvectors after extraction with A, v = Av. This is such a nice trick because usually P << M, i.e. the number of examples of our training dataset is much smaller than the dimensions. Both MATLAB and OpenCV have this small tweak for PCA reduction implemented:

coeff = pca(X, 'Economy', true); % for MATLAB
cv::calcCovarMatrix(..., cv::CV_COVAR_SCRAMBLED) // in OpenCV

There are still some cases where P is also very large, so the main approach is to extract an approximation of PCA. We’ll be talking about it in the next post.

References:

[1] Turk, Matthew, and Alex P. Pentland. “Face recognition using eigenfaces.”Computer Vision and Pattern Recognition, 1991. Proceedings CVPR’91., IEEE Computer Society Conference on. IEEE, 1991.

Dimensionality reduction in large datasets

Hello guys, this is the first post of this tech blog about our discoveries and tips while working on the Meerkat’s products. Today we are going to talk about a problem that we faced a couple of weeks ago: how to reduce high-dimensional data and which dimensionality reduction technique to use.

First, let’s contextualize a little bit more here. So, we were working on facial recognition and extracting features from faces that will be latter used for classification. When faced with this problem in Computer Vision, usually you have two alternatives: 1) you design your features to be very discriminant and compact or 2) to hell with small features I use a dimensionality reduction afterwards to make it work. I might be exaggerating in alternative 2, but in the problem of facial recognition there is a nice paper about high dimensional features [1] (with a very good name) which shows that features in higher dimensions can be used in practice. For that, you need to reduce them to a sub-space that, while maintaining the discriminant aspect of features, make them easy to put in classifier.

Now, we are faced with the choice of a dimensionality reduction technique. There are a lot of methods for this, but Principal Component Analysis (PCA) is by far the most used. Let’s make some considerations on that. What the PCA will try to do is to find a subspace by maximising the variance between dimensions. This is super cool, since in classification we want to maintain features points that are far away in the original space also far away in the sub-space. However, if you have a supervised learning, it’s more intelligent to use this information to create your basis for the sub-space since we can create a subspace in which the classes are distant from each other. That is what methods like LDA and Partial Least Squares (PLS) does. They aim to maximize inter-class variance while minimising the intra-class variance (isn’t that clever?).

For instance, take a look at these plots extracted from the work of Schwartz [2] that uses PLS dimensionality
reduction to the problem of pedestrian detection:

pca_vs_pls

 

There are also some important works that use LDA, PLS and derivate techniques for face identification, but that will be a subject for another post. One thing that is not well addressed in the literature (in my knowledge) is what implies to your feature if a dimensionality reduction can be excessively used without
any loss in the algorithm performance. What I mean is, if you can reduce a 100-d feature to a 2-d feature without loosing much information, this might means that your original feature is not discriminant at all! Why else that will be so much redundancy in the dimensions?

References:

[1] – Chen, Dong, et al. “Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification.” Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013.

[2] – Schwartz, William Robson, et al. “Human detection using partial least squares analysis.” Computer vision, 2009 IEEE 12th international conference on. IEEE, 2009.