fast genomic coordinate comparison using PostgreSQL’s geometric operators

PostgreSQL provides operators for comparing geometric data types, for example for computing whether two boxes overlap or whether one box contains another. Such operators are quick compared to similar calculations implemented using normal comparison operators, which I’ll demonstrate below. Here I show use of such geometric data types and operators for determining whether one segment of genomic sequence is contained within another.

Suppose we have a 391 row table called “oligo” containing the following information:

oligo_coordinates

“Sequence” and “chromosome” are strings, but “coordinates” is of PostgreSQL geometric data type “box”. The coordinates show the (x, y) position of one corner and its opposite corner. Notice that I set the y values to zero, since we are only dealing with nucleotide positions on the chromosome. (For some reason the procedure I’m demonstrating here does not work with the “line segment” data type).

Similarly, suppose we have a 4,536,771 row table called “chromatin” containing:

chromatin_coordinates

We want to construct a list of oligo table entries whose coordinates are contained within a chromatin table entry’s coordinates, by chromosome. To do this we use PostgreSQL’s “@>” (contains) operator as follows:

SELECT o.sequence, chr.chromosome, CL.cell_line, 
m.method FROM oligo o, chromatin c, 
chromosome chr, cell_line CL, method m 
WHERE chr.id = c.chromosome_id AND 
o.chromosome_id = c.chromosome_id AND 
CL.id = c.cell_line_id AND 
m.id = c.method_id 
AND c.coordinates @> o.coordinates;

Assuming that we properly indexed the tables (see attached SQL code), this procedure takes about 11.5 seconds on an Amazon t2.micro instance to get through all the rows under consideration.

Now suppose instead that we used ANSI standard SQL’s comparison operators. We would need modified tables:

oligo_ss

chromatin_ss

After indexing, we compare the rows using:

SELECT o.sequence, chr.chromosome, CL.cell_line, 
m.method FROM oligo_start_stop o, 
chromatin_start_stop c, chromosome chr, cell_line CL, 
method m WHERE chr.id = c.chromosome_id AND 
o.chromosome_id = c.chromosome_id AND 
CL.id = c.cell_line_id AND 
m.id = c.method_id AND 
c.start <= o.start AND c.stop >= o.stop;

This procedure now takes about 15.3 seconds.

If we run both methods 100 times and collect the query time statistics, we obtain the following boxplot:

boxplot

The medians for the groups are a full 3.8 seconds apart, indicating that the PostgreSQL geometric “contains” operator performs significantly faster than the ANSI SQL comparison operators (p-value = 1.28e-34, Mann-Whitney U test).

Note on Indexing

To run the indexing commands in the attached code for a combination of an integer (chromosome ID) and a geometric box (genomic coordinates), you need to install the PostgreSQL extension “btree_gist”.

Code

postgres_box.tar

Post Author: badassdatascience

4 thoughts on “fast genomic coordinate comparison using PostgreSQL’s geometric operators

    Dolly

    (January 6, 2017 - 2:12 pm)

    Ei Luiz,valeu pela visita e comentário.Copiar, colar, retwittar sem dúvida também acrescenta no processo, faz a diferença e pode sim trazer significados novos à ocasião, entretanto acredito que nisso deva ter também doses de pr§oÃçu£o.PrdduÃoão pode ser também a opinião da pessoal, a análise crítica dela, os pontos de vistas feitos pela leitura dela acompanhada do copiado/citado.Um abraço

    This is gorgeous and i so love the name, i hope that you feel better soon, it can really get you down, sending you big hugs ((( )))) Pops x xx x

    pou hack for android download

    (February 14, 2017 - 3:35 pm)

    Hello there. I wanted to drop you a fast note in order to express my personal thanks. I have as well been following your website for a month or so and have picked up a lot of vast information along with facts as well as appreciated the way you’ve organised your own site. I am attempting to operate my personal web page but I consider it is way very standard along with also I desire to concentrate further on smaller topics. Being all things to all people is not all that its cracked up to be.

    bau kredit rechner nutzen

    (February 16, 2017 - 2:30 pm)

    MP: Could this suggest that there is "pervasive unexamined gender bias” against women in computer science?This issue is under constant examination, I assure you. Nowhere more so than across town at Kettering.-Adam

Leave a Reply to Dolly Cancel reply

Your email address will not be published.