Istore: PostgreSQL Documents for Analytical Workloads

Inspired by the PostgreSQL key/value data-type hstore, we developed the istore extension with support for operators like + and aggregates like SUM for semi-structured integer-based data.

While the hstore allows arbitrary textual-data as its keys and values, in an istore document both keys and values are represented and stored as integers. Therefore istore fits nicely in an analytical workload. User journeys, cohort or funnel data, distributional data and many other scenarios can be efficiently modeled and stored in PostgreSQL using istore.

The extension comes with two data types: istore and bigistore, the former having int and the latter bigint as values; keys being int for both. This article demonstrates the efficiency of istore and some of its applications through two examples - aggregating logs and analyzing event funnels.

Log Aggregation

Say you have an event_log table with the following structure:

1
2
3
4
5
6
7
8
9
10
11
CREATE TABLE event_log AS
SELECT
  d::date AS date,
  j AS segment,
  i AS id,
  (random()*1000)::int AS count,
  (random()*100000)::int AS revenue
FROM
  generate_series(1,50) i,
  generate_series(1,1000) j,
  generate_series(current_date - 99, current_date, '1 day') d;

This creates a sample table with 5M rows. In each row, you store the information that on day X in segment Y, the event Z hit count number of times, and this brought revenue amount of revenue.

Let’s now say you want to look at the hit-counts per event ID and revenue per event ID distributions for each (date, segment) pair.

You could define a table with two istore fields per (date, segment). The first field would have event IDs as keys and hit-counts as values, and the other field would have event IDs as keys and revenue as values:

1
2
3
4
5
6
7
8
9
CREATE EXTENSION istore;
CREATE TABLE istore_event_log AS
SELECT
  date,
  segment,
  istore(array_agg(id), array_agg(count)) AS counts,
  istore(array_agg(id), array_agg(revenue)) AS revenue
FROM event_log
GROUP BY date, segment;

This will create a table with 100K rows.

Now that we have istore and non-istore models of the data, let’s do some analytics and compare performance.

Fetching an istore value for a given key

Similarly to the hstore, the -> operator retrieves the value of an istore for a given key.

For example, to get the total event hits for a specific event ID over all segments and all time, you would write:

1
2
3
4
5
6
7
istore_test=# SELECT SUM(counts->35) from istore_event_log;
   sum
----------
 50213687
(1 row)

Time: 29,032 ms

In the non-istore example, the same would mean:

1
2
3
4
5
6
7
istore_test=# SELECT SUM(count) from event_log where id = 35;
   sum
----------
 50213687
(1 row)

Time: 374,806 ms

Here we already see more than 10 times the performance benefit, mostly due to the reduced I/O:

1
2
3
4
5
6
7
istore_test=# SELECT
  pg_size_pretty(pg_table_size('event_log')) as "without istore",
  pg_size_pretty(pg_table_size('istore_event_log')) as "with istore";

 without istore | with istore
----------------+-------------
 249 MB         | 87 MB

SQL aggregation and division of istore documents

Typically, you’d be interested in aggregated distributions for all event IDs instead of just a single event ID. Let’s say you want the revenue per single event-hit for each event ID. With the non-istore setup, you could write:

1
2
istore_test=# SELECT id, SUM(revenue)/SUM(count) FROM event_log GROUP BY 1;
Time: 1099,832 ms

And using istore:

1
2
istore_test=# SELECT SUM(revenue)/SUM(counts) FROM istore_event_log;
Time: 163,134 ms

This illustrates how you can use the SQL SUM aggregate-function to perform aggregations on istore data. The result from the SUM application would be an istore with event IDs as keys and the revenues and counts as values, respectively. The istore / istore division operator will subsequently result in an istore with event IDs as keys and the desired ratios as values.

If you prefer the result as a set instead of an istore, you can simply apply the each(istore) on the result from the division, the same way you would with an hstore.

Note again the improved efficiency of the istore v.s. non-istore data model.

Filtering istore documents

Suppose you want a report of all segments that triggered event ID 5 at least once, but never triggered event ID 100.

Without istore, you would probably write something like:

1
2
3
4
5
SELECT segment FROM event_log WHERE id = 5 AND count > 0
EXCEPT
SELECT segment FROM event_log WHERE id = 100 AND count > 0;

Time: 1015,032 ms

With istore, you can write this as:

1
2
3
4
5
SELECT segment FROM istore_event_log
GROUP BY segment
HAVING compact(SUM(counts)) ? 5 AND NOT compact(SUM(counts)) ? 100

Time: 110,286 ms

which is almost 10 times faster.

Using the istore ? integer operator to check if a given key exists in an istore might be intuitive from using PostreSQL’s hstore or json types. And the compact(istore) function returns an istore with all pairs with value 0 removed.

Sum up istore values together

Suppose you now need the total count of events hit by all segments in all time.

1
2
3
4
5
6
7
istore_test=# SELECT SUM(sum_up(counts)) FROM istore_event_log;
    sum
------------
 2500002991
(1 row)

Time: 66,740 ms

The sum_up(istore) function adds up the values of an istore document together. It’s more than 5 times faster than:

1
2
3
4
5
6
7
istore_test=# SELECT SUM(count) FROM event_log;
    sum
------------
 2500002991
(1 row)

Time: 365,279 ms

Event Funnels

The second use-case that we’ll demonstrate here is building event funnels using istore to analyze app usage.

Let’s say you want to analyze a game. Each time a user reaches a certain level you get an event entry.

1
2
3
4
5
6
7
8
9
CREATE TABLE user_events AS
SELECT
  '2015-09-01 12:10:00'::timestamp - random()*10*24*60*60 * INTERVAL '1 second' AS install_time,
  '2015-09-01 12:10:00'::timestamp + random()*10*24*60*60 * INTERVAL '1 second' AS time,
  j AS user_id,
  (random()*5)::int + 1 AS level
FROM
  generate_series(1,100000) j,
  generate_series(1, 50) e;

Here we created a 5M random events for 100K users at a random time.

From here, you can build a table showing the time needed to reach a level for the first and last time.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
CREATE TABLE user_levels AS
SELECT
  user_id,
  istore(array_agg(level), array_agg(min)) as first_time,
  istore(array_agg(level), array_agg(max)) as last_time
FROM (
  SELECT
    user_id,
    level,
    MIN(EXTRACT(EPOCH FROM time - install_time)::int),
    MAX(EXTRACT(EPOCH FROM time - install_time)::int)
  FROM user_events
  GROUP BY user_id, level
) t
GROUP BY user_id;

Now, based on your data, you can estimate the probability for an average user to complete level 3 in less than 3 days:

1
2
3
4
5
6
7
8
istore_test=# SELECT COUNT(*) FILTER (WHERE first_time->3 <= 3 * 86400) / COUNT(*)::float
              FROM user_levels;
 ?column?
----------
   0.3615
(1 row)

Time: 19,337 ms

Or the conditional probability of a user taking more than 3 days to complete level 2, given that they completed level 1 within 1 day:

1
2
3
4
5
6
7
8
9
10
istore_test=# SELECT
                COUNT(*) FILTER (WHERE first_time->1 <= 1 * 86400 AND first_time->2 > 3 * 86400 ) /
                (COUNT(*) FILTER (WHERE first_time->1 <= 1 * 86400))::float
              FROM user_levels;
     ?column?
-------------------
 0.647931303669009
(1 row)

Time: 25,611 ms

If you want to feed such a table incrementally, there are several useful functions available.

For example you can update the events with:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
UPDATE user_levels a
SET
  first_time = istore_val_smaller(a.first_time, b.first_time),
  last_time = istore_val_larger(a.last_time, b.last_time)
FROM (
  SELECT
    user_id,
    istore(array_agg(level), array_agg(min)) AS first_time,
    istore(array_agg(level), array_agg(max)) AS last_time
  FROM (
    SELECT
      user_id,
      level,
      MIN(EXTRACT (EPOCH FROM time - install_time)::int),
      MAX(EXTRACT (EPOCH FROM time - install_time)::int)
    FROM user_events
    GROUP BY user_id, level
  ) t GROUP BY user_id
) b
WHERE a.user_id = b.user_id;

istore_val_smaller and istore_val_larger would merge two istore documents by using the smaller (respectively, larger) value for matchings keys.

Check out the full documentation for more examples and functions.

We are curious to hear about the analytical challenges you solve using the istore.

Comments