In the last post about Writing Postgres Extensions we created a new data type base36
from ground up. However we left with a serious bug causing our server to crash.
Now let’s hunt that bug down with a debugger and complete the testsuite.
We created a dedicated github repo following the content from these series on writing PostgreSQL extensions. The code from the last article could be found on branch part_ii and today’s changes are on branch part_iii.
The Bug
First let’s reproduce the bug.
1 2 3 4 5 6 7 8 9 |
|
We definitely don’t want this to happen on our production database, so lets find out where the problem is. We only wrote two relatively simple C-functions base36_out
and base36_in
. If we assume that we are not smarter than the folks from the PostgreSQL-Core team - which is at least for me personally a reasonable assumption - then the bug must be in one of these.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
|
Set up debugging environment
In order to use a debugger such as LLDB you’ll need to compile PostgreSQL with debug symbols. The following short guidance through debugging works for me on MacOS having PostgreSQL installed with homebrew and using LLDB with Xcode.
Firstly, let’s shut down any running Postgres instances - you don’t want to mess up your existing DB or work :)
1 2 3 4 5 6 7 |
|
Next we’ll download the PostgreSQL source code by executing this script.
1 2 3 |
|
And build with debugging options enabled.
1 2 3 |
|
We’ll skip the adduser
command that the Postgres docs recommend. Instead, I’ll just run Postgres using my own user account to make debugging easier.
1
|
|
Then init the data directory
1
|
|
And start the server
1
|
|
Add pgsql/bin
path from the new installation to the PATH
environment variable
1
|
|
Install the extension (due to the export above this time pgxn
from the new installation is used).
1
|
|
Now we can create a test db
1
|
|
and connect to it
1
|
|
Check if it works – well or not
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Debugging
Now that we have our debugging environment setup, let’s start the actual chasing of the problem. Firstly, let’s look at the log file. That’s the file we specified with the -l
flag to pg_ctl
. In our case /usr/local/pgsql/data/postmaster.log
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
Reconnect to the database and find out the pid of your current db session
1 2 3 4 5 |
|
Connect LLDB with the pid (in another terminal)
1
|
|
Run the failing command in the psql session
1
|
|
Continue LLDB
1 2 3 4 5 6 7 8 9 10 |
|
Get a Backtrace from LLDB
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Ok what do we have? The exception is thrown in pfree
which is defined in mcxt.c:699
. pfree
is called from get_const_expr
in ruleutils.c:8002
and so forth. If we go four times up the call stack. We’d end up here:
1 2 3 4 5 6 7 8 9 |
|
Let’s look at the source code in
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
Postgres uses pfree
to release memory from the current memory context. Somehow we messed up our memory.
Let’s take a look at the pointers content
1 2 |
|
It’s indeed our search condition 3c
. So what did we do wrong here? As mentioned in the first article pfree
and palloc
are Postgres counterparts of free
and malloc
to safely allocate and free memory in the current memory context. Somehow we messed it up. In base36_out
we used
1
|
|
to allocate 7 bytes of memory. Finally we return a pointer
1
|
|
at offset 4 in this case. The assertion in mcxt.c:699
1
|
|
Makes sure that the data to be released are correctly aligned. The condition here is:
1
|
|
To be read as does the pointer start at a multiple of 8 bytes? As we don’t return the same address as the one we allocated from, it causes pfree
to complain that the pointer is not aligned.
Let’s fix that!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
Now we allocate the buffer from the stack (Line 18) and finally us pstrdup
to copy the string freshly allocated memory (Line 26). This implementation is closer – almost equivalent to Wikipedias.
You might have guessed that pstrdup
is Postgres counterpart of strdup
. It safely takes memory from the current memory context via palloc
and frees automatically at the end of a transaction.
TYPE CASTING
Now that we can input and output data for our type. It would be nice to also cast from and to other types.
1 2 3 4 |
|
Wow that is relatively easy. As integer
and base36
are binary coercible (that is the binary internal representations are the same) the conversion can be done for free (WITHOUT FUNCTION
). We also marked this cast as IMPLICIT
thus telling postgres that it can perform the cast automatically whenever suitable. For example consider this query:
1 2 3 4 5 |
|
There is no integer + base36
operator defined but by implicit casting base36
to integer
Postgres can use the integer + integer
operator and give us the result as integer. However implicit casts should be defined with care as the result of certain operations might be suspicious. For the above operation a user wouldn’t know if the result is integer or base36 and thus might misinterpret it. Queries will totally break if we later decide to add an operator integer + base36
which returns base36
.
Even more confusing might be this query result:
1 2 3 4 5 |
|
Although we disallowed negative values we get one here how is that possible? Internally Postgres does this operation:
1 2 3 4 |
|
We can and should avoid such a confusing behavior. One option would be to add a prefix to base36 output (like it is common for hex or octal numbers) or by giving the responsibility to the user and only allow explicit casts.
Another option to clarify things would be to mark the cast AS ASSIGNMENT
. With that casting would only be automatically performed if you assign an integer to a base36 type and vice versa. This is typically suitable for INSERT or UPDATE statements. Let’s try this:
1 2 3 4 |
|
and fill our table:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
More to come…
You have seen how important it is to test everything, not only to find bugs that in the worst case might crash the server, but also to specify the expected output from certain operations such as casts. In the next post we’ll elaborate on that creating a full-coverage test suite.