Testing Results From Apptrace's Sentiment Analyzers

Perhaps it’s understandable why not everything I thought must be intriguing, actually ended up being published in the recent media coverage of Apptrace’s new Sentiment Analaysis feature. Some of it is simply too technical.

This post is about the cross-validation results from the performance testing of our review-sentiments classifiers.

The background

In short - we extract sentiments from english language app reviews - around 24 million of them. Each review gets classified as either carrying or not Addictiveness, Crash, Positivity and Negativity.

Binary classification errors

Binary classification is when a machine (a classifier) is to select the better of two possible categories for an object - in our case a review, exhibiting or not each of the above sentiment types.

There’re two ways binary classification could go wrong:

  • Alpha errors ( Type I errors, false positives ) - say that a review carries certain sentiment, when in fact it doesn’t.

  • Beta errors ( Type II errors, false negatives ) - say that a review doesn’t carry a sentiment, while the truth is that it does.

Two metrics derived from their Alpha and Beta error probabilities characterize the performance of binary classifiers - Sensitivity and Specificity.

Read the wikipedia link for the details if you’re interested, but here quickly: Sensitivity is how much the classifier would recognize from what there is to be recognized, while Specificity is how much from what it recognized would actually be true.

Naturally, good classifiers are only such if their Sensitivity and Specificity measures are both high.

Our results

I must say that I find going through random app reviews very far from entertaining. However, to collect a sample of reliable review classification examples that would be used for estimating the Alpha and Beta error probabilities of the classifiers, I had to read and label accordingly ~ 600 random reviews.

( Thankfully, I first spent time to properly build tools that made the process less painful for me. )

So below, the highlight of this post, a table that might help people put the Sentiment Analaysis data from apptrace.com in numerical context.

Addictiveness Crash Positivity Negativity
Sensitivity 97.35% 96.31% 96.49% 97.69%
Specificity 78.50% 76.32% 74.59% 75.64%

Reading the results

We managed to tune the classifiers so that false-positive rates are kept well below the 5% level, while still maintaining high sensitivity.

Because Specificity/Sensitivity ratios across the four classifiers are near equal, we are confident that the data we show on Apptrace reflects the true trends of users’ feedback as expressed in app reviews (Specificity/Sensitivity ratios are near equal for all classifiers). Furthermore, no overestimation is to be expected in the derived apps sentiment-scores.


While that was a short post, stay tuned for what will come next from the exciting sentiment analysis project on Apptrace, and surely for my following posts on text-mining residuals that PR people find less important.