Table Returning Functions

When writing functions in postgres that involve querying the database you might come into the Why is my function slow? issue. However there are other pitfalls when returning (and using) multiple output values or using RETURNS TABLE syntax. Read why and how to avoid them.

TL;DR;

When calling functions in postgres that return multiple output values avoid calling

1
SELECT (my_func()).*

and write

1
SELECT * FROM my_func()

instead.

An example

Consider the following test table.

1
CREATE TABLE stuff (number integer, created_at timestamp);

Let’s create 10 million test records:

1
2
INSERT INTO stuff(number, created_at)
SELECT n, current_timestamp - (n||'minutes')::interval FROM generate_series(1, 1E7::integer, 1) n;

Querying this table without an index would be rather slow as a whole table scan is needed.

1
2
3
4
5
6
7
testdb=# SELECT * FROM stuff WHERE DATE(created_at) >= current_date - 500 AND number = 500;
 number |         created_at
--------+----------------------------
    500 | 2013-07-28 06:43:35.110845
(1 row)

Time: 1493,431 ms

To avoid this we can easily add an index

1
CREATE INDEX index_stuff_on_date_time ON stuff USING btree (DATE(created_at));

And the same query will run a lot faster.

1
2
3
4
5
6
7
testdb=# SELECT * FROM stuff WHERE DATE(created_at) >= current_date - 500 AND number = 500;
 number |         created_at
--------+----------------------------
    500 | 2013-07-28 06:43:35.110845
(1 row)

Time: 120,357 ms

So far so easy.

Now lets assume that for some reason you want to create a function that takes two parameters and basically does the same query.

1
2
3
4
5
6
7
8
CREATE OR REPLACE FUNCTION get_stuff(num int, date_stamp date)
RETURNS TABLE(number int, created_at timestamp) AS $$
BEGIN
  RETURN QUERY
    SELECT t.number, t.created_at FROM stuff t
      WHERE DATE(t.created_at) >= date_stamp AND t.number = num;
END
$$ LANGUAGE 'plpgsql';

We can now do the same thing using our function.

1
2
3
4
5
6
7
testdb=# SELECT (get_stuff(500, current_date - 500)).*;
 number |         created_at
--------+----------------------------
    500 | 2013-07-28 06:43:35.110845
(1 row)

Time: 229,657 ms

But the result now takes almost twice as long.

However we would get the exactly same result querying

1
2
3
4
5
6
7
testdb=# SELECT * FROM get_stuff(500, current_date - 500);
 number |         created_at
--------+----------------------------
    500 | 2013-07-28 06:43:35.110845
(1 row)

Time: 119,043 ms

So what makes the difference here?

Let’s modify the function to return another constant value. Note: As we change the output format of the function we need to drop the function before we can change it.

1
2
3
4
5
6
7
8
9
DROP FUNCTION get_stuff(num int, date_stamp date);
CREATE OR REPLACE FUNCTION get_stuff(num int, date_stamp date)
RETURNS TABLE(number int, created_at timestamp, foo text) AS $$
BEGIN
  RETURN QUERY
    SELECT t.number, t.created_at, 'foo'::text FROM stuff t
      WHERE DATE(t.created_at) >= date_stamp AND t.number = num;
END
$$ LANGUAGE 'plpgsql';

All we did is just adding the text foo to the output table.

1
2
3
4
5
6
7
testdb=# SELECT (get_stuff(500, current_date - 500)).*;
 number |         created_at         | foo
--------+----------------------------+-----
    500 | 2013-07-28 06:43:35.110845 | foo
(1 row)

Time: 340,359 ms

And suddenly the query takes even longer. While the other version is still as fast as before.

1
2
3
4
5
6
7
testdb=# SELECT * FROM get_stuff(500, current_date - 500);
 number |         created_at         | foo
--------+----------------------------+-----
    500 | 2013-07-28 06:43:35.110845 | foo
(1 row)

Time: 117,085 ms

What effectively happens is that the query internally get’s modified into

1
2
3
SELECT (get_stuff(500, current_date - 500)).number,
       (get_stuff(500, current_date - 500)).created_at,
       (get_stuff(500, current_date - 500)).foo

So the function is called three times and thus taking three times longer. The only way to avoid this behavior is to modify the statement in the above mentioned way.

Comments