Parallel R Daemons Behind Apptrace's Global Rank

Introducing GNU R in a blog post is difficult and what makes it so, is the many faces it has.

Introducing GNU R

Statisticians love it. It has the right tools for pretty much any statistical task you might be facing. Be it Frequentist or Bayesian (!) inference, parametrics, nonparametrics, bootstrapping, MCMC, simulations, data (and missing data) analyses, or simply basic stuff: explorations, tests, model-fitting, etc. It’s all there - open source and cross-platform.

Machine Learning / Data Mining people love it. R is written in C (and R) - existing C and Fortran libraries are perfectly portable. It has excellent packaging facilities (several thousand well-tested and maintained packages in the CRAN repository). Reading and writing data (CSV, SQL, key-value or document stores) as well as quickly reshaping, grouping, aggregating and joining large datasets is well provided for (notice the fabulous DataTable).

People who do plots love it. Visual data art can be drawn and automatically saved in bulks in any format including PNG, PDF, SVG, etc. Interactive web SVG is also in business.

Programmers… well guys - it takes getting used to its few peculiarities for it to become your top buddy in doing stats or plots. It’s an object-oriented scripting language and, once we got used to it at adeven, we have been using it for nice, automated batch crunching of things.

Global Rank of iTunes Store apps

Each day Apptrace.com processes roughly 6 million app position entries from various countries, categories and devices to bring order to the global app world. I’m talking about our app and artist global rank algorithms that dynamically weigh each of those positions to calculate overall ranks for apps and artists.

To quickly illustrate, here’s a skeleton of how this is done for apps:

(1) Read in all positions for a given day;
(2) Calculate country, category, device, etc weights for that day;
(3) Group by app, apply a formula and obtain an app score;
(4) Order.
(5) Write!

We stopped optimizing when computation finished in ~ 90 seconds on every app ranking in the world - over 6 million records - including database IO, as well as additional steps for artist ranks.

Rscript and R executables

We have this in place as a daily cron job thanks to another R face - the Rscript binary. Our UNIX executable looks roughly so:

(I did wrap up a small framework to power all our R jobs and development, for which I’m even fostering hopes that it might alone become the topic of a blog post one day..)

Parallel processing in vanilla R

So far so good. For a day. Once a new version of our algorithm is deployed, however, it’d mean a history run on all earlier days needs to follow in order to keep our global ranks consistent.

For the 5 months of operation so far we are already close to 1 billion app positions all together. Processing those in memory at once is not an option, so we need to read-crunch-write in batches.

So let’s talk a bit about parallel computing in R.

Independent parallel computations

Pretend we want to map a list of arguments to the results of a function doing long, independent computations on each of them. From a master process we’ll create a cluster of workers that would do the whole job in parallel. Here’s all:

compute(arg) will be ‘computing’ for a random Uniform(0,1) multiplied by arg seconds and return that time. Run this into your R copy and it will print a mapping of the (1..8) range to random computation seconds as well as an overall execution time for all 8 computations.

Maximum sleeping lasted 2.218 seconds while the full computation took 2.223 seconds on my Macbook’s 8 cores. Thus all 8 tasks in parallel took not much longer than the longest single task.

Note that this has no other dependencies than R itself.

Distributing dependent computations

Above, the skeleton of a basic but very common scenario is given. R’s take on topics like cluster load balancing, variables sharing, multiple clusters, multiple workstations etc will be something we’ll soon be certainly getting more shareable experience with, so stay tuned.

Finally

R’s popularity among data sportists is not without a reason. It has about 20 years behind it, powerful Mac OS and Windows GUIs and a wonderful UNIX interpreter, IDEs and tools (notably ESS, R-Studio, etc), its own periodic and an ever-growing fanbase.

And as for this blog post - if you didn’t hit close until now, then please do get in touch in the comments below.

Comments