Rport - Business Intelligence Apps With R

Rport is an R package that greatly facilitates common tasks found in many R Business Intelligence apps. It bridges R and SQL analytics similarly to how Rails bridges Ruby and Web Development.

Introduction

From our analytics work on data from Adjust and Apptrace, we’ve identified several tasks needed for nearly any new R project.

  • Handling multiple database connections within one R session

  • Caching results from long SQL statements in development

  • Parallel jobs processing

  • Building UNIX executables

  • Organizing the growing codebase

While R supports all these, there isn’t a framework to contain the repeated code that you’ll end up with when building a multi-projects analytics app.

For our needs we built such a framework and this article presents it. We named it Rport and it’s open-sourced as an R extension.

Quick Start

Rport is distributed as an R package and you can get the most up-to-date version from GitHub, directly from within an R session:

> library(devtools)
> install_github('rport', 'adeven')

Then if you want to set up a fresh Rport app do:

> library(rport)

> rport.app.new('BI kingdom', root.dir='~/my_apps/')
# help(rport.app.new) for full list of options

This will create the file structure of a clean Rport app. Next you’ll probably want to start setting up different projects for this and that in your new Rport app. To do this:

> rport.project.new('weekly kpi report', root.dir='~/my_apps/bi_kingdom')
# again help(rport.project.new) for more

Now you’ve bootstrapped your first Rport app with one project in it for reporting Weekly KPIs. The generators created some code for you already so you can already see some output by going:

$ cd ~/my_apps/bi_kingdom
$ ./bin/weekly_kpi_report.R

Go ahead and browse those generated files, they’re well commented and you will find some next steps there too.

Rport Apps

Now that you’ve set up an Rport app, let’s take a more detailed look at it. An Rport app would likely contain multiple projects, often serving different purposes. You might have:

  • Cron jobs for analytics, reports, calculation and other tasks.
  • A lot of one-off exploration scripts
  • Asynchronous processing of tasks from web apps
  • Standalone web services
  • Others

An Rport app with two projects might have the following folder structure:

.
├── bin
│   ├── apptrace_newsletter
│   └── new_apps_report
├── config
│   └── database.yml
│   └── settings.R
├── doc
├── lib
│   ├── functions
│   │   ├── apptrace_newsletter
│   │   │   └── main.R
│   │   └── new_apps_report
│   │       └── main.R
│   └── opts
│       ├── apptrace_newsletter.R
│       └── new_apps_report.R
├── log
├── script
└── spec

For an illustration of Rport’s features we created a Demo Rport App. Make sure you refer to it along with reading this post.

Rport Features

Rport will take care of tedious background tasks, while you can focus on the actual explorations and analytics of your data. Let’s introduce the core features of the package below.

Database Connectivity

If you’re directly interfacing R’s SQL drivers, you’ll likely find yourself often using the dbGetQuery(connection, query) routine, meaning that you need to carry your connection object around every time you issue a database query.

Multiple Connections Handling

Using Rport in the scenario above, you’ll define all your connections in a config/database.yml file, similarly to what you’d do in other frameworks (e.g. Rails). The difference here is that with Rport you can not only define multiple environments, but also multiple connections within each environment.

1
2
3
4
5
6
7
8
9
10
11
12
13
    # bootstrap production environment
    rport('production')

    # use the handy accessor method for the `read` connection, generated by Rport
    dat <- rport.read('select me from you')

    # access another database and get more data:
    old.dat <- rport.backup('select me from old_you')

    # `dat` and `old.dat` are now `data.table` objects with results from the
    # `production->read` and `production->backup` connections respectively


Few things are worth mentioning in this snippet:

  • Rport created the rport.read and rport.backup methods magically based on the read and backup database configurations in the config/database.yml. Check out the Example app to see more of this.

  • You can have as many database configurations as you like and combine results from all of them into a single R session. We use this feature a lot when doing adjust.io <-> apptrace stuff or when we offload heavy reads to a replication server.

  • Note that different configurations could also mean entirely different database servers. Nothing stops you from having PostgreSQL and MySQL results brought together in R by Rport. And you wouldn’t even care about the underlying connection mechanics, because Rport will do that for you.

  • Rport works with the Data Table package and all the results returned from database queries are data.table objects. If you’re wondering why we introduced this dependency, just check out this fantastic package from the link above.

Caching Query Results

When working on scripts, you might want to load larger chunks of raw data into memory and crunch them in R (ideally with Data Table) to produce your results. When just ‘playing’ or exploring the raw data, you won’t really want to wait for an unchanged SQL query to run again when rerunning a script to test your R code.

Rport’s connection accessors allow you to cache results using R’s load and save routines.

1
2
3
4
5
6
7
    rport('development')

    # Read the data in memory only if not found in the cache.
    dat <- rport.read('select app_id, rank from application_ranks', cache=TRUE)

    # do crazy crunching on `dat`

This query will now only run once and subsequent executions of the script will read and return the cached R object from the file system.

To ensure that nothing bad ever happens to you using this caching, it only works in development and it logs clearly when reading from cache. Furthermore, it works on a per-connection basis, so the same queries under different connections will be cached separately.

Parallel report compilation

Since R 2.15 the parallel package is part of R Core. Rport provides some wrappers around that package to address specific use cases.

For example a newsletter or report will likely be constituted of several independent items that could probably be generated independently to gain performance.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
    rport('production')

    users.stats <- function(opts) {
      sql <- '
        SELECT count(*)
        FROM users
        WHERE created_at >= %s
      '

      rport.apptrace(sprintf(sql, opts$start_date))
    }

    products.stats <- function(opts) {
      sql <- '
        SELECT count(*)
        FROM products
        WHERE created_at >= %s
      '

      rport.apptrace(sprintf(sql, opts$start_date))
    }

    rport.bootstrap('parallel', cluster.size=8)

    # run the components in parallel
    result = rport.parallel (
      useres   = { users.stats, opts },
      products = { products.stats, opts }
    )

    # result is now a list with the results like:
    # list (users=data.table(..), products=data.table(..))

Working with Executables

Executables are a common interface to our Rport apps. We use them to schedule cron jobs (e.g. reports generation, aggregations, etc.) or to run other analytics tasks. Rport uses Rscript for creation of cross-platform executables.

CLI options

Rport manages CLI options using the R standard lib optparse package and the convention of placing opts files under lib/opts/my_script_name.R. Check what rport.project.new() generated for you above or the sample app for an illustration.

Logging

Rport writes a lot about what it’s doing either interactively or in log files. You can use these logs to get an idea about query and script execution times as well as debugging.

The convention for executables is that all output is sinked (including output from parallel workers) to log/my_script_name.log.

Summary

Rport is an ambitious project under ongoing development. Be sure to follow the GitHub repository for all updates.

Comments