graph database for gene annotation

Lately I’ve been experimenting with graph databases using Neo4j and the Cypher query language. To get a feel for these tools, I created the following gene annotation network. The Cypher commands I used are discussed in this post, followed by a demonstration of querying the database.

Creating the Graph Database

We are creating the following graph:

gd08

Here genes 5663 (PSEN1) and 675 (BRCA2) connect with the species Homo sapiens and also connect to the gene alias FAD. (Both genes have the same gene alias). RefSeq “NM” transcripts connect to their respective genes.

In Cypher, we first create two nodes representing genes PSEN1 and BRCA2, along with a node representing the human species:

CREATE (gid5663:Gene {symbol: "PSEN1", id: 5663, full_name: "presenilin 1"})
CREATE (gid675:Gene {symbol: "BRCA2", id: 675, full_name: "breast cancer 2, early onset"})
CREATE (human:Species {taxonomy_id: 9606, id: "Homo Sapiens"})

The Neo4j console replies that three nodes were created.

gd01

We then link each of the gene nodes we created to the species node. These steps require selecting the nodes to connect with a MATCH query and then feeding the selection results into a CREATE clause.

MATCH (a:Gene), (b:Species)
WHERE a.id = 5663 AND b.taxonomy_id = 9606
CREATE (a)-[r:SPECIES]->(b)
RETURN r

MATCH (a:Gene), (b:Species)
WHERE a.id = 675 AND b.taxonomy_id = 9606
CREATE (a)-[r:SPECIES]->(b)
RETURN r

The console shows each MATCH/CREATE query result:

gd02

We then create a gene alias node, which the gene nodes will connect to:

CREATE (FAD:GeneSymbolAlias {id: "FAD"})

gd03

Like before with the species node, we now connect the genes to the gene alias node:

MATCH (a:Gene), (b:GeneSymbolAlias)
WHERE a.id = 5663 AND b.id="FAD"
CREATE (a)-[r:ALIAS]->(b)
RETURN r

MATCH (a:Gene), (b:GeneSymbolAlias)
WHERE a.id = 675 AND b.id="FAD"
CREATE (a)-[r:ALIAS]->(b)
RETURN r

The console again shows each MATCH/CREATE query result:

gd04

We next create three nodes to represent the three RefSeq transcripts associated with our two genes:

CREATE (NM_000059:RefSeqTranscript {id: "NM_000059", version: 3})
CREATE (NM_000021:RefSeqTranscript {id: "NM_000021", version: 3})
CREATE (NM_007318:RefSeqTranscript {id: "NM_007318", version: 2})

gd05

Finally, we connect these transcripts to their respective genes:

MATCH (a:Gene), (b:RefSeqTranscript)
WHERE a.id = 675 AND b.id="NM_000059"
CREATE (a)-[r:TRANSCRIPT]->(b)
RETURN r

MATCH (a:Gene), (b:RefSeqTranscript)
WHERE a.id = 5663 AND b.id="NM_000021"
CREATE (a)-[r:TRANSCRIPT]->(b)
RETURN r

MATCH (a:Gene), (b:RefSeqTranscript)
WHERE a.id = 5663 AND b.id="NM_007318"
CREATE (a)-[r:TRANSCRIPT]->(b)
RETURN r

gd06

We can view the whole network with:

MATCH (n)-[r]-()
RETURN n, r

gd07

Queries

We can start with a gene and trace its transcripts, returning the transcript nodes:

MATCH (:Gene {id: 5663})-[:TRANSCRIPT]->(t) return t

09

Similarly, we can start with a transcript and trace through its gene to the transcript’s species:

MATCH (:RefSeqTranscript {id: "NM_000021"})-[:TRANSCRIPT]-(gene)-[:SPECIES]-(species) return species

10

2 thoughts on “graph database for gene annotation

Leave a Reply

Your email address will not be published.