gene annotation database with MongoDB

After reading Datanami’s recent post “9 Must-Have Skills to Land Top Big Data Jobs in 2015” [1], I decided to round out my NoSQL knowledge by learning MongoDB. I have previously reported NoSQL work with Neo4j on this blog, where I discussed building a gene annotation graph database [2]. Here I build a similar gene annotation database, but using the MongoDB platform and its design patterns.

MongoDB

MongoDB is a database system that stores documents made up of key-value pairs. The values may themselves be other documents of key-value pairs, or arrays, resulting in an overall document format similar to JSON. These JSON-like documents are then compiled into sets called “collections”, which resemble tables in relational databases (typically documents describing similar things are grouped within the same collection). Unlike relational databases, however, is the fact that the documents in a collection do not have to have exactly the same keys (fields). Like a relational database table, a collection may be indexed by one or more of its keys, or by a key of embedded documents within the documents of the collection. Additionally, two collections may be matched on mutual identifiers, similar to a join in a relational database, but the matching is done on the client side rather than on the server.

These concepts will be illustrated by the example case shown below: a gene annotation database containing gene information for 15079 species, along with RefSeq information for each gene.

Gene Annotation Database

I first downloaded the “gene_info” and “gene2refseq” files from NCBI’s FTP site [3] and inserted them into MongoDB using the Python code appended to the bottom of this post. This created two collections, “gene_collection” which stores gene information–one gene per document–and “refseq_collection” which stores protein/transcript information–one RefSeq entry per document. Each document in the RefSeq collection points to the gene document that the RefSeq sequences correspond to.

Searching the gene collection by NCBI Entrez gene ID 675 yields a full gene document:

675_full_description

Here we can see that MongoDB assigned an internal ID “_id” to the document, which we will use later to reference it from the RefSeq documents. We see standard key-value pairs such as “GeneID” which contains an integer value specifying the Entrez gene ID and “type_of_gene” which contains a string describing whether the gene is known to code for a protein. For the key “dbXrefs”, notice that its value is an embedded document that contains its own key-value pairs linking the gene to other database identifiers. Keys “Synonyms” and “Other_designations” have arrays as values.

Searching the same collection by NCBI Entrez gene ID 544203 yields a gene document with less keys:

544203_short_description

This shows how documents within a collection need not share the same keys. It is helpful to have some of the same keys to have a meaningful data model design, but by not forcing every document to have the same schema one avoids the heavy use of null values often found in relational database tables.

One can search a collection by a top-level key as demonstrated above. You can also search for documents with an array containing an element, e.g.,

FAD

In this case we searched for human genes with “FAD” contained in their documents’ “Synonyms” array and listed only the IDs and symbols from the resulting documents.

Finally, we can search by the key of an embedded document:

ENSG00000080815

Here we searched for the Ensembl ID “ENSG00000080815” in the documents embedded under the key “dbXrefs”.

Indexing

Like a relational database system, MongoDB enables users to index collections to improve query performance. However, unlike a relational database, the indices need not only be on the top-level keys, they can also be built from keys inside embedded documents or built from array contents.

The following table shows the performance improvement (in query times) between an indexed and non-indexed version of this gene annotation database:

query_times_POST_CROPPED

Joins

MongoDB does not allow server-side joins between collections like those found in relational databases. Instead, collections may be matched on the client side by an application through use of a common reference. For example, in the image below we see that gene 5663 has an internal “_id” of “54b067d44f2ff124afc15cbf” in the gene collection. When we loaded the RefSeq data into the RefSeq collection, we matched each RefSeq entry to their respective genes through the “gene_document_id” key, which points to the corresponding “_id” in the gene collection. The ID and reference are highlighted in green in the image below:

references_POST_CROPPED

The following Python code shows how to implement a client-side match by these identifiers:

# useful libraries
from pymongo import MongoClient

# connect to MongoDB
client = MongoClient('localhost', 27017)
db = client.gene
refseq_collection = db.refseq_collection
gene_collection = db.gene_collection

# find gene 5663
gene = gene_collection.find_one({'GeneID' : 5663})
gene_document_id = gene['_id']

print
print 'Showing RefSeq sequences for gene 5663:'
print
print '\t', 'RNA', '\t', '\t', 'Genomic', '\t', 'RefSeq Collection Document ID'

# find RefSeq data for gene (by reference)
refseq = refseq_collection.find({'gene_document_id' : gene_document_id})
for r in refseq:
    if r.has_key('RNA_nucleotide_accession_version') and r.has_key('genomic_nucleotide_accession_version'):
        print '\t', r['RNA_nucleotide_accession_version'], '\t', r['genomic_nucleotide_accession_version'], '\t', r['_id']

print

Running this code shows all the RefSeq RNA and genomic nucleotide sequences associated with gene 5663:

manual_references

Conclusion

I have not shown all the features of MongoDB using this gene annotation database example. For instance I left out discussion of aggregation, scripting, and scalability across multiple servers. However, I believe I have shown enough to demonstrate knowledge of MongoDB and to demonstrate that it is a viable alternative to traditional relational databases for Big Data applications.

Code

Python code for loading NCBI’s “gene_info” table [3] into MongoDB is:

# useful libraries
import datetime
from pymongo import MongoClient

# connect to MongoDB
client = MongoClient('localhost', 27017)
db = client.gene
collection = db.gene_collection

# get header
f = open('data/gene_info')
for line in f:
    header = line.strip().split(' ')[1:16]
    break
f.close()

# load data
f = open('data/gene_info')
for line in f:
    line = line.strip()

    # remove commented out lines
    if line[0] == '#':  continue

    # deal with missing values
    split_line_with_dashes = line.split('\t')
    split_line = []
    for s in split_line_with_dashes:
        if s == '-':
            split_line.append(None)
        else:
            split_line.append(s)

    # specify data types of values
    split_line[0] = int(split_line[0])
    split_line[1] = int(split_line[1])
    year = int(split_line[14][0:4])
    month = int(split_line[14][4:6])
    day = int(split_line[14][6:8])
    split_line[14] = datetime.datetime(year, month, day, 0, 0)

    # split the aliases
    if split_line[4] != None:
        split_line[4] = split_line[4].replace(' ', '').split('|')

    # split the other designations
    if split_line[13] != None:
        split_line[13] = split_line[13].split('|')

    # split the DB cross references
    if split_line[5] != None:
        db_cross_refs = split_line[5].replace(' ', '').split('|')
        cross_refs = {}
        for dcr in db_cross_refs:
            db = dcr.split(':')[0]
            ref = dcr.split(':')[1]
            cross_refs[db] = ref
        split_line[5] = cross_refs

    # build document
    document = {}
    for h, s in zip(header, split_line):
        if s != None:
            document[h] = s

    # insert document into MongoDB
    collection.insert(document)

f.close()

Python code for loading NCBI’s “gene2refseq” table [3] into MongoDB is:

# useful libraries
from pymongo import MongoClient

# connect to MongoDB
client = MongoClient('localhost', 27017)
db = client.gene
collection = db.refseq_collection
gene_collection = db.gene_collection

# get header
f = open('data/gene2refseq')
for line in f:
    header = line.strip().split(' ')[1:17]
    header = [h.replace('.', '_') for h in header]
    break
f.close()

# load data
count = 0
f = open('data/gene2refseq')
for line in f:
    line = line.strip()

    # remove commented out lines
    if line[0] == '#':  continue

    # deal with missing values
    split_line_with_dashes = line.split('\t')
    split_line = []
    for i, s in enumerate(split_line_with_dashes):
        if s == '-' and header[i] != 'orientation':
            split_line.append(None)
        else:
            split_line.append(s)

    # convert data types
    if split_line[0] != None:  split_line[0] = int(split_line[0])
    if split_line[1] != None:  split_line[1] = int(split_line[1])
    if split_line[4] != None:  split_line[4] = int(split_line[4])
    if split_line[6] != None:  split_line[6] = int(split_line[6])
    if split_line[8] != None:  split_line[8] = int(split_line[8])
    if split_line[9] != None:  split_line[9] = int(split_line[9])
    if split_line[10] != None:  split_line[10] = int(split_line[10])
    if split_line[14] != None:  split_line[14] = int(split_line[14])

    # build document
    document = {}
    for h, s in zip(header, split_line):
        if s != None:
            document[h] = s

    # get ID of corresponding gene document
    gene_id = split_line[1]
    gene = gene_collection.find_one({'GeneID' : gene_id})
    document['gene_document_id'] = gene['_id']

    # remove 'GeneID' from document
    del document['GeneID']
    
    # insert document into MongoDB
    collection.insert(document)

f.close()

References

  1. http://www.datanami.com/2015/01/07/9-must-skills-land-top-big-data-jobs-2015/
  2. http://badassdatascience.com/2014/07/11/graph-database-for-gene-annotation/
  3. ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/

2 thoughts on “gene annotation database with MongoDB

  1. Hi
    Fantastic read but I guess there is a bug in your Python code for loading NCBI’s “gene_info” table
    Replace header = line.strip().split(‘ ‘)[1:16]
    with header = line.strip().split(‘\t’)[0:15]

Leave a Reply

Your email address will not be published.