examining mRNA complexity by annotation region using MapReduce

I became interested in how annotated mRNA regions (e.g., 5′ UTR, coding, and 3′ UTR) vary in information content, speculating that coding regions (CDS) of transcripts will be generally more complex than other regions due to their role in specifying protein recipes. Measuring sequence complexity using Shannon entropy validated this hypothesis, at least with regard […]

test driving the Seven Bridges Genomics bioinformatics platform

I recently examined the Seven Bridges Genomics (SBG) platform, building and running a short-read alignment pipeline. Overall, I am impressed by the software. Here I describe my test of the program and then report on my investigation of how it works. Test Drive The test pipeline I devised consisted of two steps, FastQC analysis of […]

listing an Amazon S3 directory’s contents in Java

After much struggle, I have figured out how to list an Amazon S3 directory’s contents in Java using the AWS SDK. Here is how to do it: First, you need to import the following libraries: Then, in your main function (or elsewhere in your code) you need: Be sure to change the “prefix” variable to […]

test driving Amazon Web Services’ Elastic MapReduce

Hadoop provides software infrastructure for running MapReduce tasks, but it requires substantial setup time and availability of a compute cluster to take full advantage of. Amazon’s Elastic MapReduce (EMR) solves these problems; delivering pre-configured Hadoop virtual machines running on the cloud for only the time they are required, and billing only for the computation minutes […]

the first Big Data recession

The “Great Recession” of 2007-2009 may be the first “Big Data” recession, i.e., the first recession which we can examine using the vast information delivered by the advent of Big Data. Certainly the next recession will be studied through that lens. To test whether new data is available that can be cast in an economic […]

comprehensive, anticipatory design in the age of Big Data

Buckminster (Bucky) Fuller wrote in the 1950s that a strategy of “comprehensive anticipatory design science” [1] was required to create technology and systems suitable to sustainable living and sustainable business. This post examines what Bucky meant by comprehensive anticipatory design and then explores how Big Data can play a role in its deployment. Bucky’s vision […]

using Hadoop to examine county-level industrial diversity

In a previous post, I computed each U.S. county’s industrial diversity from the 2009 County Business Patterns data published by the U.S. Census Bureau. The diversity calculation made use of Shannon’s information entropy equation, which is similarly used by ecologists to calculate species diversity for a region. Here I perform the same calculation using Hadoop, […]

Data scientist makes peace with web programming

True to my hacker roots, I prefer command line interfaces (CLIs) to graphical user interfaces (GUIs). That sentiment compounds when the GUI is delivered through a web browser. However, I recently—finally—accepted the fact that the web browser is the most important user interface out there, and the only user interface that most scientists will bother […]