Postgres has a ton of features and offers a wide range of data types, functions, operators, and aggregates. But sometimes it’s just not enough for your use case. Luckily, it’s easy to extend Postgres’ functionality through extension. So why not write your own?
This is the first in a series of articles about extending Postgres through extensions. You can follow the code examples here on branch part_i
base36
You might already know the trick used by url shorteners. Use some unique random characters
such as http://goo.gl/EAZSKW to point to something else. You have to remember what points to where, of course, so you need to store it in a database.
But instead of
saving 6 characters using varchar(6)
(and thus wasting 7 bytes) why not use an integer
with 4 bytes and represent it as base36?
The Extension Skeleton
To be able to run the CREATE EXTENSION command in your database, your extension needs at least two files: a control file in the format extension_name.control
, which tells Postgres some basics about your extension, and a extension’s SQL script file in the format extension--version.sql
.
So let’s add them into our project directory.
A good starting point for our control file might be:
1 2 3 4 |
|
As of now, our extension has no functionality. Let’s add some in an SQL script file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
The second line ensures that the file won’t be loaded into the database directly,
but only via CREATE EXTENSION
.
The simple plpgsql function allows us to encode any integer into its base36
representation.
If we copied these two files into postgres SHAREDIR/extension
directory, then we could start using the extension with CREATE EXTENSION
.
But we won’t bother users with figuring out where to put these files and how to
copy them manually – that’s what Makefiles are made for. So, let’s add one to our project.
Makefile
Every PostgreSQL installation from 9.1 onwards provides a build infrastructure for extensions
called PGXS, allowing extensions to be easily built against an already-installed
server. Most of the environment variables needed to build an extension are
setup in pg_config
and can simply be reused.
For our example this Makefile fits our needs.
1 2 3 4 5 6 7 |
|
Now we can start using the extension. Run
1
|
|
from your project directory and
1 2 3 4 5 6 7 8 9 10 |
|
in your database. Awesome!
Write tests
These days, every serious developer writes tests. And as database developer who deals with data (probably the most valuable thing in your company) you should as well.
You can easily add some regression tests to your project that can be invoked
by make installcheck
after doing make install
. For this to work you can put
test script files in a subdirectory named sql/
. For each test file there should
also be a file containing the expected output in a subdirectory named expected/
with the same name and the extension .out
. The make installcheck
command executes
each test script with psql, and compares the resulting output to the matching expected file.
Any differences will be written to the file regression.diffs.
Let’s do so:
1 2 3 4 5 6 7 |
|
We also need to tell our Makefile
about the tests (Line 3):
1 2 3 4 5 6 7 8 |
|
If we now run make install && make installcheck
, then our tests would fail. This is because we didn’t specify the expected output. However, we’d find the new directory results
, which would contain base36_test.out
and base36_test.out.diff
.
The former contains the actual output from our test script file. Let’s move
it into the desired directory.
1 2 |
|
If we now rerun our test, we’d see something like:
1 2 3 4 5 6 |
|
Nice! But hey, we cheated a little bit. If we take a look at our expectations, we’d notice that this isn’t what we should expect.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
|
You’ll notice that in line 6, base36_encode(0)
returns an empty string where we’d expect 0
. If we fix our
expectation, our test would fail again.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
And we can easily inspect the failing test by looking at the mentioned regression.diffs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
You can read it as “expected 0 got “.
Now let’s implement the fix in the encoding function to make the tests pass again (Line 12-14):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
Optimize for speed, write some C
While shipping related functionality in an extension is a convenient way to share code, the real fun starts when you implement stuff in C. Let’s get the first 1M base36 numbers.
1 2 |
|
11s? That’s …well, not so fast.
Let’s see if we can do better in C. Writing C-Language Functions isn’t that hard.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
You might have noticed that the actual algorithm is the one Wikipedia provides. Let’s see what we added to make it work with Postgres.
#include "postgres.h"
includes most of the basic stuff needed for interfacing
with Postgres. This line needs to be included in every C-File that declares Postgres functions.
#include "fmgr.h"
needs to be included to make use of PG_GETARG_XXX
and PG_RETURN_XXX
macros.
#include "utils/builtins.h"
defines some operations on Postgres’ built-in datatypes (cstring_to_text used later
)
PG_MODULE_MAGIC
is the “magic block” needed as of PostgreSQL 8.2 in one (and only one) of the module source files after including the header fmgr.h
.
PG_FUNCTION_INFO_V1(base36_encode);
introduces the function to Postges as Version 1 Calling Convention, and is only needed if
you want the function to interface with Postgres.
Datum
is the return type of every C-language Postgres function and can be any data type. You can think of it as something similar to a void *
.
base36_encode(PG_FUNCTION_ARGS)
our function is named base36_encode
PG_FUNCTION_ARGS
and can take any number and any type of arguments.
int32 arg = PG_GETARG_INT32(0);
get the first argument. The arguments are numbered starting from 0
. You must use the PG_GETARG_XXX
macros defined in fmgr.h to get the actual argument value.
char *buffer = palloc(7 * sizeof(char));
to prevent memory leaks when allocating memory, always use the PostgreSQL functions palloc and pfree instead of the corresponding C library functions malloc and free.
Memory allocated by palloc will be freed automatically at the end of each transaction. You can also use palloc0
to ensure the bytes are zeroed.
PG_RETURN_TEXT_P(cstring_to_text(&buffer[offset]));
to return a value to Postgres you always have to use
one of the PG_RETURN_XXX macros. cstring_to_text
converts the cstring to Postgres text type before.
Once we’re finished with the C-part, we need to modify our SQL function.
1 2 3 4 5 |
|
To be able to use the function we also need to modify the Makefile
(Line 4)
1 2 3 4 5 6 7 8 9 |
|
Luckily, we already have test and can try it out with make install && make installcheck
.
Opening a database console also proves it to be a lot (30 times) faster:
1 2 |
|
Returning errors
You might have noticed that our simple implementation would not work with negative
numbers. Just as it did before with 0
, it would return an empty string. We might want to add
a -
sign for negative values or simply error out. Let’s go for the latter. (Line 12-20)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
|
Which would result in
1 2 3 4 |
|
Postgres has some nice error reporting build in. While for this use case a simple errmsg might have been enough, you can (but don’t need to) add details, hints and more.
For simple debugging, it’s also convenient to use a
1
|
|
The INFO
level error would only result into a log message and not immediately stop
the function call. Severity levels range from DEBUG
to PANIC
.
More to come…
Now that we know the basics for writing extensions and C-Language functions, in the next post we’ll take the next step and implement a complete new datatype.