In the last post about Writing Postgres Extensions, we covered the basics of extending PostgresSQL with extension. Now it’s time for the fun part – developing our own type.
A small disclaimer
It’s in your best interest to resist the urge to copy and paste the code found within this article. There are some serious bugs along the lines, which were intentionally left in for illustrative purposes. If you’re looking for a production-ready base36
type definition, then take a look at here.
A refresher on base36
What we’re after is the solid implementation of a base36
data type to use for storing and retrieving base36 numbers. We already created the basic skeleton for our extension, including base36.control
and Makefile
, which you can find in the GitHub repo dedicated to this series of blog posts. You can check out what we ended up with in Part 1 and the code from this post can be found on the part_ii branch.
1 2 3 4 |
|
1 2 3 4 5 6 7 8 9 |
|
Custom data type in Postgres
Let’s rewrite the SQL script file to show our own data type:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
This is the minimum required to create a base type in Postgres: We need the two functions input
and output
that tell Postgres how to convert the input text to the internal representation (base36_in
) and back from the internal representation to text (base36_out
). We also need to tell Postgres to treat our type like integer
. This can also be achieved by specifying these additional parameters in the type definition as in the example below.
1 2 3 4 |
|
Now let’s do the C-Part:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
|
We basically just reused our base36_encode
function to be our OUTPUT
and added an INPUT
decoding function - easy.
Now we can store and retrieve base36
numbers in our database. Let’s build and test it.
1
|
|
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Works so far. Let’s order the output.
1 2 3 4 5 |
|
Hmmm… looks like we missed something.
Operators
Keep in mind that we’re dealing with a completely bare data type. In order to do any sorting, we need to define what it means for an instance of the data type to be less than another instance, for it to be greater than another instance or for two instances to be equal.
This shouldn’t be too strange – in fact, it resembles how you would include the Enumerable
mixin in a Ruby class or implement the sort.Interface
in a Golang type to introduce the ordering rules for your objects.
Let’s add the comparison functions and operators to our SQL script.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
|
Wow…that’s a lot. To break it down: First, we defined a comparison function to power each comparison operator (<
, <=
, =
, >=
and >
). We then put them together in an operator class that will enable us to create indexes on our new data type.
For the functions themselves we could simply reuse the corresponding, built-in functions for the integer type: int4eq
, int4ne
, int4lt
, int4le
, int4gt
, int4ge
, btint4cmp
and hashint4
.
Now let’s take a look at the operator definitions.
Each operator has a left argument (LEFTARG
), a right argument (RIGHTARG
), and a function (PROCEDURE
).
So, if we write:
1 2 3 4 5 |
|
Postgres will use the base36_lt
function and do a base36_lt('larg','rarg')
.
COMMUTATOR and NEGATOR
Each operator also has a COMMUTATOR
and a NEGATOR
(see Line 52-53). These are used by the query planer to do optimizations. A commutator is the operator that should be used to denote the same result, but with the arguments flipped. Thus, since (x < y)
equals (y > x)
for all possible values x
and y
, the operator >
is the commutator of the operator <
. For the same reason <
is the commutator of >
.
The negator is the operator that would negate the boolean result of an operator. That is, (x < y)
equals NOT(x >= y)
for all possible values x
and y
.
So why is that important? Suppose you’ve indexed the column val
:
1 2 3 4 5 6 |
|
As you can see, Postgres has to rewrite the query from 'c1'::base36 > val
to val < 'c1'::base36
in order to be able to use the index.
The same is true for the negator.
1 2 3 4 5 6 |
|
Here NOT val > 'c1'::base36
is rewritten to val <= 'c1'::base36
.
And finally you can see that it would rewrite NOT 'c1'::base36 < val
to val <= 'c1'::base36
.
1 2 3 4 5 6 |
|
So while COMMUTATOR
and NEGATOR
clauses are not strictly required in a custom Postgres type definition, without them the above rewrites won’t be possible. Therefore, the respective queries won’t use the index and in most situations lose performance.
RESTRICT and JOIN
Luckily, we don’t need to write our own RESTRICT
function (see Line 54-55) and can use simply use this:
1 2 3 4 |
|
These are restriction selectivity estimation functions which give Postgres a hint on how many rows will satisfy a WHERE-clause given a constant as the right argument. If the constant is the left argument, we can flip it to the right using the commutator.
You may already know that Postgres collects some statistics of each table when you or the autovacuum daemon run an ANALYZE
. You can also take a look at these statistics on the pg_stats view.
1
|
|
All the estimation function does is to give a value between 0 and 1, indicating the estimated fraction of rows based on these statistics. This is quite important to know as typically the =
operator satisfies fewer rows than the <>
operator. Since you are relatively free in naming and defining your operators, you need to tell how they work.
If you really want to know what the estimation functions look like, take a look at the source code in src/backend/utils/adt/selfuncs.c. Disclaimer: your eyes might start bleeding.
So, it’s pretty great that we don’t need to write our own JOIN
selectivity estimation function. This one is for queries where an operator is used to join tables in the form table1.column1 OP table2.column2
, but it has essentially the same idea: it estimates how many rows will be returned by the operation to finally decide which of the possible plans (i.e. which join order) to use.
So if you have something like:
1 2 3 |
|
Here table3 has only a few rows, while table1 and table2 are really big. So it makes sense to first join table3, amass a few rows and then join the other tables.
HASHES and MERGES
For the equality operator, we also define the parameters HASHES
and MERGES
(Line 35). When we do this, we’re telling Postgres that it’s suitable to use this function for hash to respectively merge join operations. To make the hash join actually work, we also need to define a hash function and put both together in an operator class. You can read further in the PostgreSQL Documentation about the different Operator Optimization clauses.
More to come…
So far you’ve seen how to implement a basic data type using INPUT
and OUTPUT
functions. On top of this we added comparison operators by reusing Postgres internals. This allows us to order tables and use indexes.
However, if you followed the implementation on your computer step-by-step, you might find that the above mentioned EXPLAIN
command doesn’t really work.
1 2 3 4 5 6 7 |
|
That’s because right we just did the worst possible thing: in some situations, our code makes the whole server crash.
In the next post we’ll see how we can debug our code using LLDB, and how to avoid these errors with the proper testing.