simulating RNA-seq read counts

The Challenge

I want to explore the statistics of RNA sequencing (RNA-seq) on next-generation sequencing (NGS) platforms in greater detail, so I thought I’d start by simulating read counts to experiment with. This post details how I constructed a simulated set of read counts, examines its concordance with the expected negative binomial distribution of the counts across replicates, and then tests DESeq2’s analysis of the simulated read counts to determine how well it matches the expected outcome. Code implementing the calculations follows the description.

Constructing Simulated Read Counts

I first randomly sampled 1000 transcripts from RefSeq to obtain a list of representative transcript lengths, ensuring that no gene is represented twice in the sample (I made sure that no gene has two or more isoforms in the set). The resulting list and distribution of lengths looks like:



I then simulated a relative abundance metric for the 1000 transcripts using a gamma distribution. This relative abundance metric is completely unrelated to the true expression levels of the genes I drew length information from, it is simply a randomly selected value assigned to each transcript length. (This is why I label the transcripts “0” to “999” instead of by their original RefSeq accession numbers, since at this point in the creation of simulated reads we are departing from prior biological information). The distribution of the relative abundance metric is shown below. The values on the X axis are only meaningful in relation to each other–my use of “relative” here is with regard to each gene’s abundance compared with each other, not “relative” across biological samples or technical replicates.


At this point I copied the data into two parts: length/abundance information for condition “A” and length/abundance information for condition “B”. Condition “B” is an exact copy of condition “A” except for approximately 10% of its transcripts in which the abundance metric has been cut in half, and another approximately 10% of its transcripts in which the abundance metric has been doubled. These conditions represent two simulated biological samples we are comparing. For each condition I then multiplied the abundance metric by 20 to increase the dynamic range, and converted the value to an integer.

For each transcript and each condition, I then multiplied the transcript length by the increased abundance metric to get a relative amount of sequence that a read might come from. For each transcript, I added the transcript to an array for the given condition this amount of times, which we later sample from to count reads. Call these arrays “available hits A” and “available hits B” for each respective condition.

I wanted to create three replicates of each biological condition, labeled “A1”, “A2”, “A3”, “B1”, “B2”, and “B3”. For each replicate, I randomly selected a sample size (total reads) between 12 and 15 million. I then sampled from “available hits A” three times and “available hits B” three times using the randomly selected sample sizes, and counted the number of times a transcript was selected for each. The generated counts data looks like:


To confirm that the simulation worked correctly, I plotted the log2 of counts A1 versus the log2 of counts B1, by transcript, producing the following graph. It is clear from the plot that roughly 10% of the B1 transcripts are one log2 value behind the corresponding A1 value, and that roughly 10% of the B1 transcripts are one log2 value ahead of the corresponding A1 value, as expected.


Python code implementing the computations described above is appended to the end of this blog post.

Distribution of Transcript Counts Across Replicates

For each transcript where the nominal expression level was the same across both biological conditions, i.e., no doubling or halving of the expression level took place, I calculated the maximum likelihood estimators of the mean and variance of the simulated replicate counts according to a negative binomial distribution. I then plotted the estimated mean versus the estimated variance as follows:


From the image we see that the estimated variance grows faster than the estimated mean–the system is overdispersed–as expected. This indicates that a negative binomial distribution is probably a good model for the replicates. I plan to try to come up with a more formal goodness of fit test to confirm this in the near future, but that will have to wait until a later blog post.

DESeq2 Analysis

I next ran DESeq2 analysis on the two simulated biological samples’ sets of read counts, producing the following output:


The first block in the above image shows the simulated read counts for transcripts zero through four. We see that transcripts zero and one have “B” expression that is roughly twice that of their “A” expression. We also see that transcript four has “B” expression that is approximately half that of its “A” expression.

The second block in the image above shows that a log2 fold change of one is detected by DESeq2 for transcripts zero and one, and that a log2 fold change of negative one is detected for transcript four. This matches our expectation given the read counts listed in the first block. The p-values for these assessments of fold change are very small, indicating significance.

The third block in the image above shows a summary of all 1,000 transcripts’ results. We see that approximately 10% are upwardly differentially expressed between “A” and “B” and that approximately 10% are downwardly differentially expressed between “A” and “B”. This matches our original simulation design.

R code implementing these calculations follows.


Python code used to produce the simulated counts:

# load useful libraries
from scipy.stats import gamma
import matplotlib.pyplot as plt
import random
import numpy as np

# user settings
number_of_transcripts = 1000
gamma_alpha = 2.7

# set the random seeds for repeatability

# generate gamma-distributed transcript "expression" levels (relative expression, not absolute)
r = gamma.rvs(gamma_alpha, size=number_of_transcripts)
r1_list = r.tolist()
plt.hist(r1_list, color="lightblue")
plt.title("Simulated Distribution of Relative Abundance for 1000 Transcripts")
plt.xlabel("Simulated Relative Abundance")

# double 10% and halve 10% of the relative "expression" levels to simulate a different "biological" circumstance or cell
r2_list = []
for r1 in r1_list:
    r2 = r1
    unif = random.uniform(0, 1)
    if unif >= 0.8:
        r2 = r1 / 2.
    if unif >= 0.9:
        r2 = r1 * 2.

# convert the "expression" levels to integers and multiply by 20 to increase dynamic range
r1_int = []
r2_int = []
for r1, r2 in zip(r1_list, r2_list):
    r1_int.append(int(20. * r1))
    r2_int.append(int(20. * r2))

# write the "known" expression levels to disk so that we may compare them to DESeq2 calculations
f = open('data/transcript_lengths.csv')
f_out = open('output/expression.csv', 'w')
length = []
transcript = []
for i, line in enumerate(f):
    line = line.strip()
    if line.find('transcript') >= 0:  continue
    f_out.write(line + ',' + str(r1_int[i-1]) + ',' + str(r2_int[i-1]) + '\n')
plt.hist(length, color="lightblue")
plt.title("Distribution of Lengths for 1000 Selected Transcripts")
plt.xlabel("Length (Nucleotides)")

# create populations to sample from that reflect transcript size and relative abundance
r1 = []
for i, j, k in zip(transcript, length, r1_int):
    for n in range(0, j * k):
r2 = []
for i, j, k in zip(transcript, length, r2_int):
    for n in range(0, j * k):

# simulate read hits to transcripts:  generate three replicates for each of the two "biological" samples
samples = {1 : {}, 2: {}}
samples[1]['A'] = random.sample(r1, random.randint(12000000, 15000000))
samples[1]['B'] = random.sample(r1, random.randint(12000000, 15000000))
samples[1]['C'] = random.sample(r1, random.randint(12000000, 15000000))
samples[2]['A'] = random.sample(r2, random.randint(12000000, 15000000))
samples[2]['B'] = random.sample(r2, random.randint(12000000, 15000000))
samples[2]['C'] = random.sample(r2, random.randint(12000000, 15000000))

# count hits to transcripts
counts = {}
for bio in samples.keys():
    if not counts.has_key(bio):  counts[bio] = {}
    for replicate in samples[bio].keys():
        if not counts[bio].has_key(replicate):  counts[bio][replicate] = {}
        for t in samples[bio][replicate]:
            if not counts[bio][replicate].has_key(t):
                counts[bio][replicate][t] = 0
            counts[bio][replicate][t] += 1

# write simulated counts to file
f_out = open('output/counts.csv', 'w')
for i, ln in enumerate(length):
    f_out.write(str(transcript[i]) + ',' + str(ln) + ',' + str(counts[1]['A'][i]) + ',' + str(counts[1]['B'][i]) + ',' + str(counts[1]['C'][i]) + ',' + str(counts[2]['A'][i]) + ',' + str(counts[2]['B'][i]) + ',' + str(counts[2]['C'][i]) + '\n')

R code used to conduct differential expression analysis:

# the following relies heavily on the text at:

# clear memory

# load useful libraries

# load counts data into momory and generate matrix <- read.csv("counts.csv")
raw.counts.matrix <- as.matrix([,c(3:8)])
rownames(raw.counts.matrix) <-$transcript

# specify how the columns in the matrix are related
coldata <- data.frame(experiment=c("A", "A", "A", "B", "B", "B"))
rownames(coldata) <- c("A1", "A2", "A3", "B1", "B2", "B3")

# create a DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData = raw.counts.matrix,
                                 colData = coldata,
					design = ~ experiment)

# normalize
dds <- estimateSizeFactors(dds)

# plot counts per transcript
plot(log2(1 + counts(dds, normalized=TRUE)[,c(1,4)]), pch=16, cex=0.4, col="blue",
	xlab="Log2 Counts for Sample A1",
	ylab="Log2 Counts for Sample B1",
	main="Counts per Transcript:  Sample A1 vs. Sample B1"

# generate the regularized-logarithm transformation
rld <- rlog(dds)

# similarity by clustering
sampleDists <- dist(t(assay(rld)))
sampleDistMatrix <- as.matrix( sampleDists )
rownames(sampleDistMatrix) <- paste( rld$dex, rld$cell, sep="-" )
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
hc <- hclust(sampleDists)
heatmap.2( sampleDistMatrix, Rowv=as.dendrogram(hc),
          symm=TRUE, trace="none", col=colors,
          margins=c(2,10), labCol=FALSE )

# run differential expression analysis
dds <- DESeq(dds)

# view results
res <- results(dds)

# test negative Binomial mean vs. variance
MU <- c()
VAR <- c()
for (i in 1:1000) {
    X <-[i,3:8]
    ratio <- X[1] / X[4]
    if (0.70 <= ratio && ratio <= 1.3) {
        fit <- fitdistr(as.numeric(X), "negative binomial")
        size <- fit$estimate[1]
        mu <- fit$estimate[2]
        variance <- mu + (mu^2) / size
        MU <- c(MU, mu)
        VAR <- c(VAR, variance)
plot(MU, VAR, main="Mean vs. Variance of Negative Binomial Fit of Read Counts, per Transcript", xlab="Mean", ylab="Variance", pch=16, cex=0.4, col="blue")

Supplementary Data: RefSeq Version 64 RNA Transcript Lengths for Homo sapiens

Here are the RefSeq version 64 transcript lengths for human per gene. Note that multiple transcripts exist per gene, and that you should only sample one of these per gene. Furthermore, I recommend only using “NM” transcripts:



3 thoughts on “simulating RNA-seq read counts

  1. Hey,
    thanks a lot for this interesting post! I try to reproduce you script in Matlab and I wonder how you sampled the transcript length from RefSeq. Could you include this step into your description or send it to me via email? Thank you!

    1. Max,

      I had the data in a database at work, not a propitiatory database since it only contained RefSeq version 64, but one organized for easy bioinformatics use by our team. To save you the trouble of compiling this data yourself, I have posted a dump of this data for Homo sapiens to the blog post above after the code. It is in CSV format and should be easy to work with. Since there are multiple transcripts per gene, you’ll want to select only one per gene when performing your sampling procedure to limit bias. Furthermore, I recommend using only “NM” transcripts in you simulation.

      – Emily

Leave a Reply

Your email address will not be published.