sending an attachment with Amazon SES using Java

Sometimes data scientists need to write software that sends out automated e-mails, and sometimes those e-mails must carry attachments. Here is how to do so in Java using Amazon’s Simple Email Service (SES), which is an inexpensive outbound email service built on Amazon’s cloud infrastructure. Note: Be sure to log into Amazon SES and verify […]

rapidly extracting a subsequence from chromosome sequence data in Java

The Challenge We have a text file containing the nucleotides of a chromosome, say human chromosome 11, and need to be able to quickly extract a subsequence from the chromosome text given a nucleotide position and number of subsequent nucleotides to include. The problem is that chromosome files are huge, e.g. 135 megabytes for chromosome […]

examining mRNA complexity by annotation region using MapReduce

I became interested in how annotated mRNA regions (e.g., 5′ UTR, coding, and 3′ UTR) vary in information content, speculating that coding regions (CDS) of transcripts will be generally more complex than other regions due to their role in specifying protein recipes. Measuring sequence complexity using Shannon entropy validated this hypothesis, at least with regard […]

command line Hadoop with a “live” Elastic MapReduce cluster

There are two ways to run Hadoop from the command line on an Elastic MapReduce (EMR) cluster that is active in “waiting” mode. First the hard way: Running Hadoop Directly by Logging into the Cluster’s Head Node The following commands show how you can log into the cluster’s head node and run Hadoop from the […]

listing an Amazon S3 directory’s contents in Java

After much struggle, I have figured out how to list an Amazon S3 directory’s contents in Java using the AWS SDK. Here is how to do it: First, you need to import the following libraries: Then, in your main function (or elsewhere in your code) you need: Be sure to change the “prefix” variable to […]

chaining map operations in Hadoop

Suppose we have a list of RNA sequences (pictured below), and we want to calculate both the “GC” nucleotide content and the RNA folding energy for each sequence using Hadoop 2.2.0. Furthermore, we want to chain the two operations so that each GC content result is fed to the corresponding sequence’s dG calculation. We also […]

test driving Amazon Web Services’ Elastic MapReduce

Hadoop provides software infrastructure for running MapReduce tasks, but it requires substantial setup time and availability of a compute cluster to take full advantage of. Amazon’s Elastic MapReduce (EMR) solves these problems; delivering pre-configured Hadoop virtual machines running on the cloud for only the time they are required, and billing only for the computation minutes […]

using Hadoop to examine county-level industrial diversity

In a previous post, I computed each U.S. county’s industrial diversity from the 2009 County Business Patterns data published by the U.S. Census Bureau. The diversity calculation made use of Shannon’s information entropy equation, which is similarly used by ecologists to calculate species diversity for a region. Here I perform the same calculation using Hadoop, […]