Rport is an R package that greatly facilitates common tasks found in many R Business Intelligence apps. It bridges R and SQL analytics similarly to how Rails bridges Ruby and Web Development.
Handling multiple database connections within one R session
Caching results from long SQL statements in development
Parallel jobs processing
Building UNIX executables
Organizing the growing codebase
While R supports all these, there isn’t a framework to contain the repeated code that you’ll end up with when building a multi-projects analytics app.
For our needs we built such a framework and this article presents it. We named it Rport and it’s open-sourced as an R extension.
Rport is distributed as an R package and you can get the most up-to-date version from GitHub, directly from within an R session:
> library(devtools) > install_github('rport', 'adeven')
Then if you want to set up a fresh Rport app do:
> library(rport) > rport.app.new('BI kingdom', root.dir='~/my_apps/') # help(rport.app.new) for full list of options
This will create the file structure of a clean Rport app. Next you’ll probably want to start setting up different projects for this and that in your new Rport app. To do this:
> rport.project.new('weekly kpi report', root.dir='~/my_apps/bi_kingdom') # again help(rport.project.new) for more
Now you’ve bootstrapped your first Rport app with one project in it for reporting Weekly KPIs. The generators created some code for you already so you can already see some output by going:
$ cd ~/my_apps/bi_kingdom $ ./bin/weekly_kpi_report.R
Go ahead and browse those generated files, they’re well commented and you will find some next steps there too.
Now that you’ve set up an Rport app, let’s take a more detailed look at it. An Rport app would likely contain multiple projects, often serving different purposes. You might have:
- Cron jobs for analytics, reports, calculation and other tasks.
- A lot of one-off exploration scripts
- Asynchronous processing of tasks from web apps
- Standalone web services
An Rport app with two projects might have the following folder structure:
. ├── bin │ ├── apptrace_newsletter │ └── new_apps_report ├── config │ └── database.yml │ └── settings.R ├── doc ├── lib │ ├── functions │ │ ├── apptrace_newsletter │ │ │ └── main.R │ │ └── new_apps_report │ │ └── main.R │ └── opts │ ├── apptrace_newsletter.R │ └── new_apps_report.R ├── log ├── script └── spec
For an illustration of Rport’s features we created a Demo Rport App. Make sure you refer to it along with reading this post.
Rport will take care of tedious background tasks, while you can focus on the actual explorations and analytics of your data. Let’s introduce the core features of the package below.
If you’re directly interfacing R’s SQL drivers, you’ll likely find yourself often
dbGetQuery(connection, query) routine, meaning that you need to
carry your connection object around every time you issue a database query.
Multiple Connections Handling
Using Rport in the scenario above, you’ll define all your connections in a
config/database.yml file, similarly to what you’d do in other frameworks (e.g.
Rails). The difference here is that with Rport you can not only define multiple
environments, but also multiple connections within each environment.
1 2 3 4 5 6 7 8 9 10 11 12 13
Few things are worth mentioning in this snippet:
Rport created the
rport.backupmethods magically based on the
backupdatabase configurations in the
config/database.yml. Check out the Example app to see more of this.
You can have as many database configurations as you like and combine results from all of them into a single R session. We use this feature a lot when doing adjust.io <-> apptrace stuff or when we offload heavy reads to a replication server.
Note that different configurations could also mean entirely different database servers. Nothing stops you from having PostgreSQL and MySQL results brought together in R by Rport. And you wouldn’t even care about the underlying connection mechanics, because Rport will do that for you.
Rport works with the Data Table package and all the results returned from database queries are
data.tableobjects. If you’re wondering why we introduced this dependency, just check out this fantastic package from the link above.
Caching Query Results
When working on scripts, you might want to load larger chunks of raw data into memory and crunch them in R (ideally with Data Table) to produce your results. When just ‘playing’ or exploring the raw data, you won’t really want to wait for an unchanged SQL query to run again when rerunning a script to test your R code.
Rport’s connection accessors allow you to cache results using R’s
1 2 3 4 5 6 7
This query will now only run once and subsequent executions of the script will read and return the cached R object from the file system.
To ensure that nothing bad ever happens to you using this caching, it only works in development and it logs clearly when reading from cache. Furthermore, it works on a per-connection basis, so the same queries under different connections will be cached separately.
Parallel report compilation
Since R 2.15 the
parallel package is part of R Core. Rport provides some
wrappers around that package to address specific use cases.
For example a newsletter or report will likely be constituted of several independent items that could probably be generated independently to gain performance.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
Working with Executables
Executables are a common interface to our Rport apps. We use them to schedule cron jobs (e.g. reports generation, aggregations, etc.) or to run other analytics tasks. Rport uses Rscript for creation of cross-platform executables.
Rport manages CLI options using the R standard lib
optparse package and the
convention of placing opts files under
lib/opts/my_script_name.R. Check what
rport.project.new() generated for you above or the sample app
for an illustration.
Rport writes a lot about what it’s doing either interactively or in log files. You can use these logs to get an idea about query and script execution times as well as debugging.
The convention for executables is that all output is
sinked (including output
from parallel workers) to
Rport is an ambitious project under ongoing development. Be sure to follow the GitHub repository for all updates.