Crunch CSVs With APSW - or Pivot Tables on Crack

As part of our export functions for adjust.com, we generate CSV reports for our clients, which are typically used to create pivot tables in Office. Sometimes those reports can be 100’s of megabytes in size and Office will not be able to open them. And even those files that can be opened often take minutes to load.

Being asked one too many times to “quickly” check one of those reports I came across a very helpful little tool to transform CSVs into SQL without all the hustle of creating complex table structures.

So my basic requirement was to dynamically create the table for a CSV file and then import it into an SQL database. The “dynamically” part is rather important as our CSVs can easily have 50 or more columns and typing all that into a CREATE TABLE statement isn’t exactly fun.

After some searching I found APSW. It does exactly what I was looking for and is super simple to use.

Installation

Since we work with OS X at adjust I’ll only describe how to get going there, but APSW is available for almost any OS incl. Windows.

  • Go to the download page and pick your OS appropriate file - OS X users pick the source .zip

  • Install dependencies via brew

1
brew install pkg-config sqlite xz fuse4x
  • Unzip the file, and run setup.py - with the arguments pointing to the SQLite brew
1
python2.7 setup.py build_ext -R /usr/local/opt/sqlite/lib -I /usr/local/opt/sqlite/include --enable-all-extensions install --user
  • Confirm your setup worked by typing
1
python -c "import apsw;apsw.main()"

into your console, this should take you into a SQLite shell.

1
2
3
4
SQLite version 3.8.4.3 (APSW 3.8.2-r1)
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite>

Getting started

Once you got the setup out of the way APSW is quite easy to use.

Let’s say we have a report like this.

1
2
3
Date         | Country | Clicks | Installs | Sessions | DAUs | MAUs
===================================================================
2014-08-01   |      de |     10 |        5 |     100  |    50|  200

Open an ASPW console and use the autoimport command to read the CSV.

1
2
3
4
5
6
7
8
9
10
11
12
python -c "import apsw;apsw.main()"
SQLite version 3.8.4.3 (APSW 3.8.2-r1)
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .autoimport "report.csv"
Detected Format excel  Columns 7  Rows 1240
Auto-import into table "report" complete
sqlite> .tables
report
sqlite> .schema report
CREATE TABLE report(Date, Country, Clicks, Installs, Sessions, "Daily Active Users", "Monthly Active Users");
sqlite>

As you can see ASPW correctly detected the file format and the columns. Now you can use simple SQL instead of wrangling with pivot tables.

1
2
3
4
5
sqlite> SELECT country, COUNT(sessions) AS count FROM report GROUP BY country ORDER BY count DESC LIMIT 4;
de|29
dk|10
ae|8
ar|8

The best part is, that SQLite is a lot faster than Excel and you can handle millions of lines without any problem.

Quirks and recommendations

Working with APSW a bit it turned out to be quite robust. The only real issue I encountered is that it tries to parse dates as US format yyyy-dd-mm and thus has problems figuring out the correct format if you have standard ISO format yyyy-mm-dd with no days > 12 like in the data range 2014-08-[01-12]. My solution was to simply add a line with one day that clarifies it for APSW, so for example one with the date 2013-12-31.

Now if you are reading this and are not familiar enough with SQL to get started right away I can recommend a couple of good places to start:

  • http://sqlzoo.net/
  • http://www.1keydata.com
  • http://www.postgresql.org

Have fun.

Comments