Inspired by the PostgreSQL key/value data-type hstore
, we developed the istore
extension with support for operators like +
and aggregates like SUM
for semi-structured integer-based data.
While the hstore
allows arbitrary textual-data as its keys and values, in an istore
document both keys and values are represented and stored as integers. Therefore istore
fits nicely in an analytical workload. User journeys, cohort or funnel data, distributional data and many other scenarios can be efficiently modeled and stored in PostgreSQL using istore
.
The extension comes with two data types: istore
and bigistore
, the former having int
and the latter bigint
as values; keys being int
for both. This article demonstrates the efficiency of istore
and some of its applications through two examples - aggregating logs and analyzing event funnels.
Log Aggregation
Say you have an event_log
table with the following structure:
1 2 3 4 5 6 7 8 9 10 11 |
|
This creates a sample table with 5M rows. In each row, you store the information that on day X
in segment Y
, the event Z
hit count
number of times, and this brought revenue
amount of revenue.
Let’s now say you want to look at the hit-counts per event ID
and revenue per event ID
distributions for each (date, segment)
pair.
You could define a table with two istore
fields per (date, segment)
. The first field would have event IDs as keys and hit-counts as values, and the other field would have event IDs as keys and revenue as values:
1 2 3 4 5 6 7 8 9 |
|
This will create a table with 100K rows.
Now that we have istore
and non-istore
models of the data, let’s do some analytics and compare performance.
Fetching an istore
value for a given key
Similarly to the hstore
, the ->
operator retrieves the value of an istore
for a given key.
For example, to get the total event hits for a specific event ID over all segments and all time, you would write:
1 2 3 4 5 6 7 |
|
In the non-istore
example, the same would mean:
1 2 3 4 5 6 7 |
|
Here we already see more than 10 times the performance benefit, mostly due to the reduced I/O:
1 2 3 4 5 6 7 |
|
SQL aggregation and division of istore
documents
Typically, you’d be interested in aggregated distributions for all event IDs instead of just a single event ID. Let’s say you want the revenue per single event-hit for each event ID. With the non-istore
setup, you could write:
1 2 |
|
And using istore
:
1 2 |
|
This illustrates how you can use the SQL SUM
aggregate-function to perform aggregations on istore
data. The result from the SUM
application would be an istore
with event IDs as keys and the revenues and counts as values, respectively. The istore / istore
division operator will subsequently result in an istore
with event IDs as keys and the desired ratios as values.
If you prefer the result as a set instead of an istore
, you can simply apply the each(istore)
on the result from the division, the same way you would with an hstore
.
Note again the improved efficiency of the istore
v.s. non-istore
data model.
Filtering istore
documents
Suppose you want a report of all segments that triggered event ID 5 at least once, but never triggered event ID 100.
Without istore
, you would probably write something like:
1 2 3 4 5 |
|
With istore
, you can write this as:
1 2 3 4 5 |
|
which is almost 10 times faster.
Using the istore ? integer
operator to check if a given key exists in an istore
might be intuitive from using PostreSQL’s hstore
or json
types. And the compact(istore)
function returns an istore
with all pairs with value 0
removed.
Sum up istore
values together
Suppose you now need the total count of events hit by all segments in all time.
1 2 3 4 5 6 7 |
|
The sum_up(istore)
function adds up the values of an istore
document together. It’s more than 5 times faster than:
1 2 3 4 5 6 7 |
|
Event Funnels
The second use-case that we’ll demonstrate here is building event funnels using istore
to analyze app usage.
Let’s say you want to analyze a game. Each time a user reaches a certain level you get an event entry.
1 2 3 4 5 6 7 8 9 |
|
Here we created a 5M random events for 100K users at a random time.
From here, you can build a table showing the time needed to reach a level for the first and last time.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Now, based on your data, you can estimate the probability for an average user to complete level 3 in less than 3 days:
1 2 3 4 5 6 7 8 |
|
Or the conditional probability of a user taking more than 3 days to complete level 2, given that they completed level 1 within 1 day:
1 2 3 4 5 6 7 8 9 10 |
|
If you want to feed such a table incrementally, there are several useful functions available.
For example you can update the events with:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
istore_val_smaller
and istore_val_larger
would merge two istore
documents by using the smaller (respectively, larger) value for matchings keys.
Check out the full documentation for more examples and functions.
We are curious to hear about the analytical challenges you solve using the istore
.