I became interested in how annotated mRNA regions (e.g., 5′ UTR, coding, and 3′ UTR) vary in information content, speculating that coding regions (CDS) of transcripts will be generally more complex than other regions due to their role in specifying protein recipes. Measuring sequence complexity using Shannon entropy validated this hypothesis, at least with regard to the calculation method described herein.

# Method

I iterated through all human “NM” transcripts in RefSeq 64 in 21-mer segments, calculating the Shannon entropy (discussed below) for each segment. For each of these segments, I calculated the nucleotide distribution required by Shannon’s equation solely from that of the bases contained in the segment. Segments containing an “N” as one of the bases were ignored (there were only 21 of these cases). I also recorded which mRNA region (CDS, 3′ UTR, and 5′ UTR) the segment originated from, and discarded segments that crossed a boundary between regions.

# Shannon Entropy

Shannon’s entropy equation computes the minimum number of bits necessary to encode a signal (e.g., a sequence of nucleotides) without information loss. Information-rich sequences require more bits to encode than information-poor sequences. The equation reads:

where X is a random variable with values A, C, G, or U/T. We calculate the distribution p of the nucleotides separately for each 21-mer segment, rather than use the overall distribution of nucleotides in the source transcript or transcriptome. For example, segment UACGUAGAUUUGGAAUUCCAG has a nucleotide distribution of p = {A: 0.29, C: 0.14, U: 0.33, G: 0.24}, resulting in a Shannon entropy of 1.94 bits.

# MapReduce with Hadoop

All 21-mer segments from RefSeq’s human NM transcripts add up to a lot of data. I therefore tackled the calculation using Hadoop via Amazon’s Elastic MapReduce (EMR). The EMR cluster I employed used three “m2.4xlarge” instances, since the final reduce step was very memory intensive. I started with an input sequence containing RefSeq accession number, coding region start and stop, and transcript sequence on each line:

The first mapping operation iterated through each line in the input file and mapped each 21-mer with the region name appended to a value of one:

The first reduce operation counted the instances of identical 21-mers/region names, thereby producing a list of unique 21-mer/region combinations. This cancels out bias that might be produced by the fact that some genes have more splice variants than others.

The second map operation mapped each 21-mer’s region to its Shannon entropy:

The second reduce operation computed the Shannon entropy boxplot statistics (median, IRQ, etc.) per region:

In this image the outliers are visible as pipe-delimited numbers.

# T-Test for Equality

I tested the CDS mean against the 3′ UTR mean for equality using the t-test, realizing that this is not the best test for this data given the skew visible in the boxplot above. However, we have large sample sizes, which helps the t-test work effectively. I was not able to assemble a non-parametric test due to the vastness of the data (and, more accurately, due to the fact that I’ve run out of time alloted to this project). The t-test produced a very low p-value, leading to rejection of the null hypothesis that the two means are equal.

# Source Code