simulating RNA-seq read counts

The Challenge I want to explore the statistics of RNA sequencing (RNA-seq) on next-generation sequencing (NGS) platforms in greater detail, so I thought I’d start by simulating read counts to experiment with. This post details how I constructed a simulated set of read counts, examines its concordance with the expected negative binomial distribution of the […]

synthetic biology: an emerging engineering discipline

In the last decade a new engineering disciple called “synthetic biology” has emerged. It differs from the science of biology in that it applies engineering strategies to the creation of cells that perform a desired task, such as the production of drugs or biofuels. It also differs from previous genetic engineering approaches by stressing the […]

rapidly extracting a subsequence from chromosome sequence data in Java

The Challenge We have a text file containing the nucleotides of a chromosome, say human chromosome 11, and need to be able to quickly extract a subsequence from the chromosome text given a nucleotide position and number of subsequent nucleotides to include. The problem is that chromosome files are huge, e.g. 135 megabytes for chromosome […]

graph database for gene annotation

Lately I’ve been experimenting with graph databases using Neo4j and the Cypher query language. To get a feel for these tools, I created the following gene annotation network. The Cypher commands I used are discussed in this post, followed by a demonstration of querying the database. Creating the Graph Database We are creating the following […]

examining mRNA complexity by annotation region using MapReduce

I became interested in how annotated mRNA regions (e.g., 5′ UTR, coding, and 3′ UTR) vary in information content, speculating that coding regions (CDS) of transcripts will be generally more complex than other regions due to their role in specifying protein recipes. Measuring sequence complexity using Shannon entropy validated this hypothesis, at least with regard […]

test driving the Seven Bridges Genomics bioinformatics platform

I recently examined the Seven Bridges Genomics (SBG) platform, building and running a short-read alignment pipeline. Overall, I am impressed by the software. Here I describe my test of the program and then report on my investigation of how it works. Test Drive The test pipeline I devised consisted of two steps, FastQC analysis of […]

Excel mangles NCBI gene symbols

Using Microsoft’s Excel for bioinformatics work sucks, but sometimes a spreadsheet is the best format for communicating results to other scientists. The program’s default behavior mangles some NCBI gene symbols when you import them from a text file. Here is how to deal with it. Suppose you have the following list of gene symbols, and […]

RNAfold and sequence length

I’ve been looking for a way to compare RNAfold [1] dG results for two RNA sequences, where the two sequences differ in length. My initial thought was to simply divide the computed dG’s by sequence length (i.e., normalize by sequence length) and then compare the results. The analysis presented below shows why this won’t work. […]

comparing BLAST results by bit score ratio

I recently read that two separate BLAST alignments to the same reference sequence can be compared to each other by normalizing the alignments by the maximum bit score of the reference sequence BLASTed against itself [1]. In this procedure, the user first aligns the reference sequence to itself to find the maximum possible bit score, […]

werewolf transcriptome conjecture

Lycanthropy—the sudden transformation of individuals into wolf-human chimeras during full moon periods—remains one of the least understood medical conditions persisting today. Researchers find investigation of the phenomenon doubly confounded by social stigma (who wants to tell a scientist that they are a werewolf?) and sampling difficulty (how many werewolves will actually sit still for a […]