graph database for heterogeneous biological data

To assist with a project I’m working on, I recently implemented a substantial portion of DisGeNET as a graph database. Furthermore, I added MeSH, OMIM, Entrez, and GO into the database to facilitate linking of data between these sources. Here I briefly describe these data sources, describe graph databases, and then show how use of a graph database to link these disparate data sources proves more intuitive than use of a relational database.

Biological Data Sources

DisGeNET [1] is an effort to relate diseases to genes and clusters of genes. Its creators mined the literature to provide evidence-based gene/disease associations, and scored each association by how strong the evidence is. This is probably the best source of disease to gene connection information we have to date. DisGeNET uses a relational SQLite database to store its data. Diseases are identified according to the Unified Medical Language System (UMLS) [2], and many are classified by MeSH term.

MeSH (Medical Subject Headings) [3] is a vocabulary for classifying PubMed articles. It is useful for associating a PubMed article with a disease. For example, articles classified under MeSH term “C10” correspond to nervous system diseases.

OMIM (Online Mendelian Inheritance in Man) [4] is a compilation of genetic disorders and related genes. It has some crossover with DisGeNET and references MeSH terms.

Entrez Gene [5] is a comprehensive gene description source. Member genes’ unique identifiers are referenced by DisGeNET.

GO (Gene Ontology) [6] describes molecular functions and biological processes within a cell and links them together into pathways, making it very amenable to implementation in a graph database. One or more GO terms are mapped to one or more Entrez gene identifiers.

By linking all these sources together into one database, one can answer questions like: “What nervous system diseases involve the ‘flavin adenine dinucleotide binding’ molecular function?”

Graph Databases

Graph databases store nodes and edges. They represent a class of NoSQL database paradigms. The nodes can be various (and different) objects such as genes and MeSH terms, and can have properties such as gene name. They can be queried and indexed by object type or property value. Edges connect the nodes, such as building associations between a disease nodes and gene nodes. They may be labeled to distinguish edge types, and can have properties assigned to them such as an evidence score for an association between a disease and gene. Like with nodes, queries can specify edge labels and properties to filter on. Queries are designed to follow paths of nodes and edges.

The image below shows an example of a graph database structure. It shows the diseases (in blue) classified according to the MeSH term “occupational diseases” (magenta) and their associated genes (green):

DisGeNet_graph

A tabular representation of this query is also available:

DisGeNet_table

The tabular representation shows all the properties assigned to each queried node and edge.

As another example, here are the GO terms, OMIM term, species, and DisGeNET pathways for gene “AP2A1”:

GO_2

Neo4j

Neo4j [7] is the main graph database vendor. They provide an enterprise edition and a free community edition of the tool. One can query the database through a web interface (shown above) or a REST API that retrieves content as JSON. Furthermore, there are libraries for languages such as Python that provide programmatic interaction with the platform. (For example, I used py2neo extensively when creating the database described in this post).

Neo4j provides the “Cypher” language for querying the database. An example of a Cypher query tracing diseases in MeSH category “C10”, their genes, and their GO terms is:

MATCH (m:MeSH_Tree_Number {id: 'C10'})-[mu]-(u:UMLS)-[ug]-(g:GENE)-[ggo]-(go:GO) RETURN m, u, g, go, mu, ug, ggo LIMIT 100

The output of this query, limited to the first 100 nodes, is:

Cypher

Why Not a Regular Relational Database?

I originally built a relational database containing about half of the information I described above, and found it became difficult to manage the complex joins necessary to bridge the disparate data sources. It proved so much easier to trace nodes and edges in a graph query to obtain the same information. To illustrate the complexity of the relational database implementation, here is an image of the relational data model I abandoned. (Admittedly, it is a bit overdesigned, but I like the high degree of normalization):

disease

Conclusion

The graph database model is an excellent way of storing and querying diverse biological annotations.

Related Posts

graph database for gene annotation

gene annotation database with MongoDB

References

  1. http://www.disgenet.org/web/DisGeNET/menu
  2. http://www.ncbi.nlm.nih.gov/pubmed/14681409
  3. http://www.ncbi.nlm.nih.gov/mesh/
  4. http://www.ncbi.nlm.nih.gov/omim
  5. http://www.ncbi.nlm.nih.gov/gene/
  6. http://geneontology.org/
  7. http://neo4j.com/

 

One thought on “graph database for heterogeneous biological data

Leave a Reply

Your email address will not be published.