fast genomic coordinate comparison using PostgreSQL’s geometric operators

PostgreSQL provides operators for comparing geometric data types, for example for computing whether two boxes overlap or whether one box contains another. Such operators are quick compared to similar calculations implemented using normal comparison operators, which I’ll demonstrate below. Here I show use of such geometric data types and operators for determining whether one segment of genomic sequence is contained within another.

Suppose we have a 391 row table called “oligo” containing the following information:

oligo_coordinates

“Sequence” and “chromosome” are strings, but “coordinates” is of PostgreSQL geometric data type “box”. The coordinates show the (x, y) position of one corner and its opposite corner. Notice that I set the y values to zero, since we are only dealing with nucleotide positions on the chromosome. (For some reason the procedure I’m demonstrating here does not work with the “line segment” data type).

Similarly, suppose we have a 4,536,771 row table called “chromatin” containing:

chromatin_coordinates

We want to construct a list of oligo table entries whose coordinates are contained within a chromatin table entry’s coordinates, by chromosome. To do this we use PostgreSQL’s “@>” (contains) operator as follows:

SELECT o.sequence, chr.chromosome, CL.cell_line, 
m.method FROM oligo o, chromatin c, 
chromosome chr, cell_line CL, method m 
WHERE chr.id = c.chromosome_id AND 
o.chromosome_id = c.chromosome_id AND 
CL.id = c.cell_line_id AND 
m.id = c.method_id 
AND c.coordinates @> o.coordinates;

Assuming that we properly indexed the tables (see attached SQL code), this procedure takes about 11.5 seconds on an Amazon t2.micro instance to get through all the rows under consideration.

Now suppose instead that we used ANSI standard SQL’s comparison operators. We would need modified tables:

oligo_ss

chromatin_ss

After indexing, we compare the rows using:

SELECT o.sequence, chr.chromosome, CL.cell_line, 
m.method FROM oligo_start_stop o, 
chromatin_start_stop c, chromosome chr, cell_line CL, 
method m WHERE chr.id = c.chromosome_id AND 
o.chromosome_id = c.chromosome_id AND 
CL.id = c.cell_line_id AND 
m.id = c.method_id AND 
c.start <= o.start AND c.stop >= o.stop;

This procedure now takes about 15.3 seconds.

If we run both methods 100 times and collect the query time statistics, we obtain the following boxplot:

boxplot

The medians for the groups are a full 3.8 seconds apart, indicating that the PostgreSQL geometric “contains” operator performs significantly faster than the ANSI SQL comparison operators (p-value = 1.28e-34, Mann-Whitney U test).

Note on Indexing

To run the indexing commands in the attached code for a combination of an integer (chromosome ID) and a geometric box (genomic coordinates), you need to install the PostgreSQL extension “btree_gist”.

Code

postgres_box.tar

4 thoughts on “fast genomic coordinate comparison using PostgreSQL’s geometric operators

  1. Ei Luiz,valeu pela visita e comentário.Copiar, colar, retwittar sem dúvida também acrescenta no processo, faz a diferença e pode sim trazer significados novos à ocasião, entretanto acredito que nisso deva ter também doses de pr§oÃçu£o.PrdduÃoão pode ser também a opinião da pessoal, a análise crítica dela, os pontos de vistas feitos pela leitura dela acompanhada do copiado/citado.Um abraço

  2. Hello there. I wanted to drop you a fast note in order to express my personal thanks. I have as well been following your website for a month or so and have picked up a lot of vast information along with facts as well as appreciated the way you’ve organised your own site. I am attempting to operate my personal web page but I consider it is way very standard along with also I desire to concentrate further on smaller topics. Being all things to all people is not all that its cracked up to be.

Leave a Reply

Your email address will not be published.