Aggregations With Pg Hstore

If you’re thinking about using a key value store for your technology stack, you shouldn’t just consider all the new kids on the block. Postgres has a some very cool tricks up its sleeve. It has its own integrated key value data type. And today we’ll go through how to write your own aggregations for postgres hstore.

The use case

At adeven our technology stack is heavily based on postgres. Our latest product adjust.io tracks downloads, clicks, impressions and many more. To see the distribution of the parameters we track, we need to implement our own aggregations.

Let’s say you have a simple test table which stores creation time along with country:

1
2
3
4
5
6
7
8
9
10
11
12
CREATE TABLE test
(
  country character varying(12),
  created_at timestamp without time zone
);

INSERT INTO test (created_at, country)
  SELECT time, (Array['de', 'en', 'fr', 'us', 'it'])[(random() * 4 + 1)::int]
  FROM
  generate_series('2012-10-20 00:00'::timestamp, '2012-10-20 23:59', '1 second') time;

CREATE EXTENSION IF NOT EXISTS hstore;

Count countries

If you want to count the number of times different countries are represented in the dataset, you can do it using the postgres hstore function.

1
2
3
4
5
6
7
8
hstores=# SELECT (
  SELECT hstore(array_agg(v), array_agg(c::text))
  FROM ( SELECT v, COUNT(*) as c from unnest(array_agg(country)) v GROUP BY v) t
  ) countries FROM test;
                                 countries
---------------------------------------------------------------------------
 "de"=>"10928", "en"=>"21762", "fr"=>"21324", "it"=>"10783", "us"=>"21544"
(1 row)

You’re returned an hstore document with country as key and the count as value. Note that both key and value in hstores must be of type text.

This basically counts the occurrence of values in an array since array_agg(country) returns an array of countries. And in order to not confuse your coworker, you can put that in a handy function:

1
2
3
4
5
6
7
8
9
10
11
CREATE OR REPLACE FUNCTION array_count(arr anyarray) RETURNS hstore AS $$
  BEGIN
    RETURN
      hstore(array_agg(COALESCE(v::text, 'null')), array_agg(c::text)) FROM (
        SELECT v, COUNT(*) as c FROM unnest(arr) v GROUP BY v
      ) t ;
  END;
 $$ LANGUAGE 'plpgsql' IMMUTABLE STRICT;

-- Select array_count(Array['foo','bar','foo','baz'])
-- => "bar"=>"1", "baz"=>"1", "foo"=>"2"

So your query becomes simply:

1
2
3
4
5
hstores=# SELECT array_count(array_agg(country)) FROM test;
                                array_count
---------------------------------------------------------------------------
 "de"=>"10928", "en"=>"21762", "fr"=>"21324", "it"=>"10783", "us"=>"21544"
(1 row)

Sum hstore

Let’s say you want to store pre-aggregated results in some other table.

1
2
3
4
5
6
7
8
9
CREATE TABLE test_by_hour
(
  countries hstore,
  created_at timestamp without time zone
);

INSERT INTO test_by_hour (countries, created_at)
SELECT array_count(array_agg(country)), date_trunc('hour', created_at) AS hour
FROM test GROUP BY date_trunc('hour', created_at);

To get the distribution of the whole table you now need to aggregate (sum) hstore documents. Luckily you can define your own aggregations in postgres; to do so, you need a function that knows how to add two hstores.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
CREATE OR REPLACE FUNCTION hstore_add(a hstore, b hstore) RETURNS hstore AS $$
  BEGIN
    RETURN
          hstore(
          array_agg(key),
              array_agg(
              (
                  COALESCE(r.value::integer, 0) +
                  COALESCE(l.value::integer, 0)
              )::text
          )
      )
      FROM each(a) l
      FULL OUTER JOIN each(b) r
      USING (key);
  END;
$$ LANGUAGE 'plpgsql' IMMUTABLE;

-- Select hstore_add('a=>1,b=>2'::hstore ,'a=>1,c=>2'::hstore)
-- => "a"=>"2", "b"=>"2", "c"=>"2"

With this you can tell postgres how to sum hstores.

1
2
3
4
5
6
CREATE AGGREGATE SUM (
  sfunc = hstore_add,
  basetype = hstore,
  stype = hstore,
  initcond = ''
);

Note that you can overload functions in postgres - so this sum doesn’t destroy the inbuilt postgres sum for numbers.

If you now query:

1
2
3
4
5
hstores=# SELECT SUM(countries) from test_by_hour;
                                    sum
---------------------------------------------------------------------------
 "de"=>"10928", "en"=>"21762", "fr"=>"21324", "it"=>"10783", "us"=>"21544"
(1 row)

You get what you expect.

Conclusion

Before you think about adding a new fancy key value store with complicated map reduce queries to your toolchain try postgres hstore.

Comments