PostgreSQL provides operators for comparing geometric data types, for example for computing whether two boxes overlap or whether one box contains another. Such operators are quick compared to similar calculations implemented using normal comparison operators, which I’ll demonstrate below. Here I show use of such geometric data types and operators for determining whether one segment of genomic sequence is contained within another.
Suppose we have a 391 row table called “oligo” containing the following information:
“Sequence” and “chromosome” are strings, but “coordinates” is of PostgreSQL geometric data type “box”. The coordinates show the (x, y) position of one corner and its opposite corner. Notice that I set the y values to zero, since we are only dealing with nucleotide positions on the chromosome. (For some reason the procedure I’m demonstrating here does not work with the “line segment” data type).
Similarly, suppose we have a 4,536,771 row table called “chromatin” containing:
We want to construct a list of oligo table entries whose coordinates are contained within a chromatin table entry’s coordinates, by chromosome. To do this we use PostgreSQL’s “@>” (contains) operator as follows:
SELECT o.sequence, chr.chromosome, CL.cell_line, m.method FROM oligo o, chromatin c, chromosome chr, cell_line CL, method m WHERE chr.id = c.chromosome_id AND o.chromosome_id = c.chromosome_id AND CL.id = c.cell_line_id AND m.id = c.method_id AND c.coordinates @> o.coordinates;
Assuming that we properly indexed the tables (see attached SQL code), this procedure takes about 11.5 seconds on an Amazon t2.micro instance to get through all the rows under consideration.
Now suppose instead that we used ANSI standard SQL’s comparison operators. We would need modified tables:
After indexing, we compare the rows using:
SELECT o.sequence, chr.chromosome, CL.cell_line, m.method FROM oligo_start_stop o, chromatin_start_stop c, chromosome chr, cell_line CL, method m WHERE chr.id = c.chromosome_id AND o.chromosome_id = c.chromosome_id AND CL.id = c.cell_line_id AND m.id = c.method_id AND c.start <= o.start AND c.stop >= o.stop;
This procedure now takes about 15.3 seconds.
If we run both methods 100 times and collect the query time statistics, we obtain the following boxplot:
The medians for the groups are a full 3.8 seconds apart, indicating that the PostgreSQL geometric “contains” operator performs significantly faster than the ANSI SQL comparison operators (p-value = 1.28e-34, Mann-Whitney U test).
Note on Indexing
To run the indexing commands in the attached code for a combination of an integer (chromosome ID) and a geometric box (genomic coordinates), you need to install the PostgreSQL extension “btree_gist”.