HRC Corporate Equality Index correlates with Fortune’s 50 most admired companies

Introduction

The Human Right’s Campaign, one of America’s largest civil rights groups, scores companies in its yearly Corporate Equality Index (CEI) according to their treatment of lesbian, gay, bisexual, and transgender employees [1]. The companies automatically evaluated are the Fortune 1000 and American Lawyer’s top 200. Additionally, any sufficiently large private sector organization can request inclusion in the CEI [2].

Similarly, Fortune Magazine publishes an annual list of 50 of the world’s most admired companies [3]. Companies are rated by financial health, stock performance, leadership effectiveness, customer sentiment, scandals, and social responsibility.

I became curious whether CEI scores correlate with membership in Fortune’s most admired list, so I matched the two datasets and analyzed the outcome. The results (below) are striking. Code implementing the calculations, with the source data, is attached.

Results

Plotting the CEI score distributions by whether a company was included in Fortune’s list produced:

boxplot

From this difference in distributions it is clear that the status of being “most admired” correlates with a high CEI score, though there are a few outliers. In the distribution on the left, we see that over 50% of the companies in Fortune’s list held the top CEI score of 100, whereas only 25% of the companies not contained in Fortune’s held the top score. The median score for the most admired group was 100 while for the companies not included in Fortune’s list it is about 80. Over 80% of the most admired companies scored 90 or above. The variance is much wider for the companies not included on the list. Statistical analysis comparing the two groups, detailed below, confirms the correlation.

While correlation does not imply causality, this analysis suggests two things: First, the type of leadership necessary to achieve a high CEI score is the same type of leadership that leads to inclusion in Fortune’s most admired companies group. Second, any company aspiring to membership in the most admired group might consider developing its CEI score.

There is one possible source of bias, but I don’t expect that it is large: “Social responsibility” is used in Fortune’s rankings, which may include CEI scores (I don’t know). However, Fortune’s emphasis on financial health and stock price probably trumps any contribution that the CEI would generate alone. Furthermore, in the CEI score distribution for the most admired companies, there are outliers containing extremely low scores. This suggests that the CEI played little if any role in the selection of most admired companies.

Method

I manually copied and pasted the company names and scores from the CEI online database [1]. Then I cleaned up the results to create a manageable CSV file. Similarly, I copied and pasted the Fortune 50 most admired company list [3] into another CSV file. After that, I matched the two datasets by hand. Perhaps I could have performed the match algorithmically, but I would have had to worry about different representations of company names between the two datasets, e.g. “3M Co.” vs. “3M”. There was only 50 cases so the manual match did not take long.

Two cases in Fortune’s list had to be excluded, BMW and Singapore Airlines, because they were not included in the CEI, possibly because they are based outside the USA. In the case of two other non-US companies in Fortune’s list, Toyota and Volkswagen, I matched to Toyota Motor Sales USA and Volkswagen Group of America, respectively.

Finally, I plotted the CEI score distributions shown above and performed the statistical analysis reported below using the attached Python code.

Statistical Analysis

The extreme difference in variance between the two groups makes it impossible to compare medians using a non-parametric test, and the distribution of the CEI scores does not lend itself to a clean regression analysis. Therefore I built the following contingency table from the data:

contingency_table

The p-value for this table obtained from Fisher’s exact test is 4.53e-08, indicating that the proportions are significantly different.

References

  1. http://www.hrc.org/campaigns/corporate-equality-index
  2. http://www.hrc.org/resources/entry/corporate-equality-index-what-businesses-are-rated-and-how-to-participate
  3. http://fortune.com/worlds-most-admired-companies/

Code and Data

HRC_Fortune_data_and_code

 

Posted in data science, econometrics, marketing, statistics | Tagged , , , , , , , , , , | Leave a comment

fast genomic coordinate comparison using PostgreSQL’s geometric operators

PostgreSQL provides operators for comparing geometric data types, for example for computing whether two boxes overlap or whether one box contains another. Such operators are quick compared to similar calculations implemented using normal comparison operators, which I’ll demonstrate below. Here I show use of such geometric data types and operators for determining whether one segment of genomic sequence is contained within another.

Suppose we have a 391 row table called “oligo” containing the following information:

oligo_coordinates

“Sequence” and “chromosome” are strings, but “coordinates” is of PostgreSQL geometric data type “box”. The coordinates show the (x, y) position of one corner and its opposite corner. Notice that I set the y values to zero, since we are only dealing with nucleotide positions on the chromosome. (For some reason the procedure I’m demonstrating here does not work with the “line segment” data type).

Similarly, suppose we have a 4,536,771 row table called “chromatin” containing:

chromatin_coordinates

We want to construct a list of oligo table entries whose coordinates are contained within a chromatin table entry’s coordinates, by chromosome. To do this we use PostgreSQL’s “@>” (contains) operator as follows:

SELECT o.sequence, chr.chromosome, CL.cell_line, 
m.method FROM oligo o, chromatin c, 
chromosome chr, cell_line CL, method m 
WHERE chr.id = c.chromosome_id AND 
o.chromosome_id = c.chromosome_id AND 
CL.id = c.cell_line_id AND 
m.id = c.method_id 
AND c.coordinates @> o.coordinates;

Assuming that we properly indexed the tables (see attached SQL code), this procedure takes about 11.5 seconds on an Amazon t2.micro instance to get through all the rows under consideration.

Now suppose instead that we used ANSI standard SQL’s comparison operators. We would need modified tables:

oligo_ss

chromatin_ss

After indexing, we compare the rows using:

SELECT o.sequence, chr.chromosome, CL.cell_line, 
m.method FROM oligo_start_stop o, 
chromatin_start_stop c, chromosome chr, cell_line CL, 
method m WHERE chr.id = c.chromosome_id AND 
o.chromosome_id = c.chromosome_id AND 
CL.id = c.cell_line_id AND 
m.id = c.method_id AND 
c.start <= o.start AND c.stop >= o.stop;

This procedure now takes about 15.3 seconds.

If we run both methods 100 times and collect the query time statistics, we obtain the following boxplot:

boxplot

The medians for the groups are a full 3.8 seconds apart, indicating that the PostgreSQL geometric “contains” operator performs significantly faster than the ANSI SQL comparison operators (p-value = 1.28e-34, Mann-Whitney U test).

Note on Indexing

To run the indexing commands in the attached code for a combination of an integer (chromosome ID) and a geometric box (genomic coordinates), you need to install the PostgreSQL extension “btree_gist”.

Code

postgres_box.tar

Posted in big data, bioinformatics, data science, engineering, science | Tagged , , , | Leave a comment

Bayesian network modeling stock price change

Taking a cue from the systems biology folks, I decided to model stock price change interactions using a dynamic Bayesian network. For this analysis I focused on the members of the Dow Jones Industrial Average (DJIA) that are listed on the New York Stock Exchange (NYSE).

Bayesian Networks

A Bayesian network is an acyclic directed graph where the nodes are random variables and the edges indicate conditional probabilistic relationships between the nodes. For example, a given node might represent the percent change in a stock’s price, and edges entering the node might represent factors that influence the node’s value.

The graph’s structure may be specified by a knowledgeable researcher, or may be learned from data using a search algorithm. We will demonstrate the latter case below. When learning from time series data, the graph need no longer be acyclic and the resulting graph is called a dynamic Bayesian network. The reason the acyclic requirement is dropped is that one node might influence a second node at one time point, while the second node influences the first node at another time point. This is a common occurrence in biological circuits which contain feedback loops.

Results

The following is a dynamic Bayesian network learned from a year of NYSE DJIA stocks’ daily changes in closing prices. The method used to learn the network from the data is described below.

final_graph_08_10000__COMPRESSED_2

From this we see that the probability associated with stock JNJ’s daily change in closing price is conditional on stock V’s daily change in closing price, and subsequently the probability associated with stock XOM’s daily change in closing price is conditional on JNJ’s daily change in closing price. Because this is a dynamic Bayesian network, we also see pairs such as XOM and UTX which are mutually dependent on each other. This results from the fact that the first stock in the pair might influence the second at one date, and the second might influence the first on a different date.

These results are pure unsupervised machine learning; I have not yet analyzed whether the proposed interactions make sense economically. Such analysis will be required to validate the method.

The graph shown above was generated by GraphViz.

Method

The following describes the procedure used to learn the above network from the data. Python code implementing the procedure is attached to the end of this post.

I started with a list of daily stock prices for all the stocks listed in the NYSE, e.g.,

source_data_POST_CROP1

I then filtered out all symbols not found in the DJIA and all dates outside of the range 29 October 2013 through 28 October 2014. I also discarded the open, high, low, and adjusted prices from the above data structure, as well as the volume, since we are only modeling change in closing price.

Next, I sorted each symbol’s closing price by date and computed a “percent change in closing price” time series for each symbol from the sorted closing price time series.

I then “discretized” the data as follows: For each trading day in the time series, I computed the mean percent change in closing price across all stocks. Using this I created a new time series for each stock, assigning a zero to the date if the percent change was below the daily mean, and a one to the date if the percent change was equal to or above the daily mean. This produced a series of ones and zeros for each stock.

I performed this discretization process to simplify my learning of Bayesian network learning procedures, taking the idea from gene expression modeling where indication of whether a gene is expressed or not might be dependent on its expression level in relation to a control gene’s expression level. Nonetheless, a more sophisticated model could have used a trinary discretization, or no discretization at all.

I next “transposed” the discretized time series as follows: Placed each time series for each stock as a row in a matrix, and then transposed the matrix so that each stock is a column and each row is a “system state” for the stocks on a given day. (The “system state” is made up of zeros and ones indicating movement of the percent change in price, per stock). I assigned a probability (hereafter called the “base state probability”) to each of these system states as one divided by the total number of system states, and then checked for duplicate system states, adjusting the base state probability accordingly.

I then created an initial Bayesian network graph, assigning a node to each NYSE DJIA stock and no edges, as shown by the following networkx image:

initial_to_use

The learning procedure applies a greedy algorithm to search for an optimal graph: mutating the graph (randomly adding, reversing, or removing an edge) and scoring it against the data. If the mutated graph’s score exceeds the previous graph’s score, the new graph is retained and is further mutated until a set number of iterations has run. The scoring procedure is described below:

Bayes theorem provides a means to relate the probability of observing data given a graph structure to the probability of observing a graph given data

bayes_theorem

In this equation, P(Data) is a constant that does not matter because we are simply trying to maximize P(Graph|Data). The prior P(Graph) is a function of the graph structure that we use to penalize graphs with many edges over graphs with fewer edges, seeking the graph that provides the simplest explanation of the data. Taking the log of each side gives us the equation we use for scoring a graph given the data

score_equation

We are left with the need to compute P(Graph|Data), given by the following equation

p_data_given_graph

Here t = 1 through T are the individual time points, and i = 1 through n are the nodes (stocks). For each time point t, we compute the probability of Xi given the state of its parent nodes in the graph. This is computed using the base state probabilities defined above and the system state at the time point.

I mutated the graph as described above 10,000 times, scoring each iteration and comparing it to the previous high score, which generated the network shown in the Results section. However, for the final graph I computed P(Data|Graph) for each time point, and found that the resulting probabilities proved really small:

probabilities

These low probabilities make me suspicious of the resulting network. Again, analysis of how the particular stocks that the algorithm joined together relate economically is necessary to evaluate the method.

Code

djia_bayesian_network.py

Posted in econometrics, engineering, science | Tagged , , , , , , , , , , , , , , , , , | Leave a comment

reflections on living in an RV for six months

I’ve been living in my RV for six months now, and overall I like doing so. Recently, one of my friends expressed interest in full time RV living, so I wanted to write down all the challenges so that he knows what to expect before taking the plunge. Before I do that though, I’m going to list the good stuff:

Benefits of Full Time RV Living

Cheap Rent

The primary benefit is cheap rent. Monthly rates at RV parks are very inexpensive compared to apartment rentals in my area (San Diego county). When you add in the cost of financing the RV, the total monthly cost resembles that of a studio apartment, but you get to keep the RV. This assumes you live in an RV park. I suppose it would be possible to boondock frequently for even less money, but since I have a job I need a stable address. (See below for a listing of the RV park-related challenges that must be considered).

Less Stuff

Fitting into an RV means having less stuff, unless you rent storage space which I didn’t. I find owning few things liberating. I still have room in my RV for musical instruments (sans amplifiers), a few books, and a computer, so I have no trouble staying entertained. Not having many material things to manage has provided substantial freedom from intellectual distraction.

Easy to Move

If I get laid off from my job, which I don’t expect to happen, moving will be easy. I’ll just hitch my trailer to my truck and be off to wherever new economic opportunity resides. Alternatively, if I choose to pursue a PhD after completing my Master’s degree I’ll simply tow my rig to wherever I attend school.

Ecological Footprint (The Benefit)

An RV impacts the land it is set upon less than an apartment or house does, leaving less of an ecological footprint.

Feeling Less Screwed

While I still have to pay rent to the RV park, I don’t feel I’m paying as much to “the man” as I would in an apartment setting, which suits my slightly anarchistic world view. Saving more money increases my overall freedom.

Challenges of Full Time RV Living

Cat

My cat does not have as much room in my RV as he had in my last apartment. However, he is an old cat who doesn’t move around much. And he has lots of toys to play with so I don’t worry much about him. The biggest challenge was figuring out where to fit in litter boxes, and how to keep litter from getting all over the RV. (Litter gets into the floor-mounted heating ducts for instance). I partially solved the problem by laying plastic sheeting in a corner and placing the litter boxes on top. This is not a perfect solution but it works well enough:

cat_litter_compressed

Insulation and Propane Use

While I live in mild San Diego county, I live in the foothills of the mountains and therefore it can get cold at night during the winter. During these nights I go through a lot of propane since my RV’s insulation is poor. I estimate that during December I spent $100 on propane, which I used only for heating (not cooking or water heating). Fortunately, my RV park has a propane filling service where you leave an empty bottle on the driveway in the morning and they return it full in the afternoon, billing you at the end of the month.

The poor insulation was a problem in the summer too where I’d have to run my air conditioner during the day to keep my cat cool. I estimate the monthly electricity bill was $100.

I hear that the trick to getting an RV with good insulation is to buy one made for Canadians. I’ve thought of fixing the “reflective bubble wrap” insulation you can buy at Home Depot to the walls to improve the situation, but haven’t figured out a good way to attach the insulation in a way that doesn’t leave a mark when the insulation is removed.

Internet Access

My RV park has very poor WiFi, which also costs $30 a month. This did not meet my schooling needs so I bought mobile hotspot service from my phone provider. This provides only 4G of data at 4G LTE speed per month, which is enough for browsing but not enough for much video (I had to get rid of Netflix streaming service). I am fortunate that my RV park, which is rather remote, is still within 4G range. To adapt to the limited monthly download volume, I download my class lectures at work and save them to a flash drive, and also download large data files to an Amazon EC2 instance and manipulate the files there (more on Amazon EC2 below). I also take my laptop to the library often to use the WiFi there.

Computing Power

I do not have room for a large desktop computer, which limits the computing power I have available locally. Readers of this blog will be aware that I perform computationally demanding tasks regularly. To get around the problem I rent computing power from Amazon when I need it. Using Amazon, I probably break even with the overall cost of buying a powerful desktop computer, so nothing is really lost. All I need to make this arrangement work is the 4G internet access discussed above. This works because all data is processed and stored on the Amazon EC2 instances, so only results and commands need to be passed between my laptop and the EC2 instances, thereby minimizing the amount of the monthly data download availability I consume.

Lab Space

I am both a scientist and an engineer, and therefore am constantly tinkering with something. Usually this tinkering involves only a computer so there is no need for lab space. However, there are projects I would like to take on that require space, such as learning to use 3D printers, that I simply do not have. I only have one table and that holds my laptop and rather large monitor. The table provides enough room to mess with things like Raspberry Pis, but again, a 3D printer or an autonomous robot might be too big for the space I have.

Small Space (Need for Storage Efficiency)

Living in an RV requires storage efficiency. For example, I use my bunks as storage racks:

bunk_storage_compressed

Similarly, under the bed I store books:

under_bed_compressed

Note that when planning storage, you must not exceed your towable weight limit.

Having to Own a Truck

My RV is a travel trailer, so I have to own a truck to tow it with. Since it makes little sense for me to own two vehicles, I’m not buying a more economical hybrid vehicle that would make my compute cheaper and more ecologically friendly. So I feel I am stuck with my gas-hogging truck for the time I own the travel trailer.

I could have bought an RV with its own engine (at substantially greater cost) and then bought a vehicle suitable for towing behind it. However, this arrangement was out of my price range at the time I bought the travel trailer, since I already owned the truck with no debt.

Issues With RV Parks

I like the RV park I am in, having no complaints other than the great distance I have to travel to get to work. However, there are three issues one needs to be aware of when staying at a park:

1. Your RV must be younger than some date set by the park, and in good condition, or they won’t let you stay there. This precludes buying a very cheap, very used RV and fixing it up.

2. The RV parks in the cities around me (San Diego county) have limits on how long you can stay. Usually, they require you to leave for two days following a six-month stay. I think these are city laws but am not sure. Out in the county, I’ve not seen the same constraint, so I live in the county far from my workplace.

3. The RV parks close to where I work cost substantially more, up to twice as much per month. This is because they are closer to the beach. However, there is a trade off since living at one of these would reduce my commute, but I’d have to deal with issue #2 above.

Long Commute

As mentioned above, I live in a relatively remote part of San Diego county compared to where I work, so my commute is typically 45 minutes one way. Fortunately, the county has a good public radio station.

Cooking

I haven’t mastered cooking in such a small environment yet, and therefore use a microwave for most of my meals. My RV has a propane stove top with three burners but no oven. There is also no dishwasher, making cleanup a bit more annoying. I also haven’t mastered food storage in the small space yet. The small refrigerator has proven big enough for my needs though.

Ecological Footprint (The Challenge)

My long commute, having to own a truck, and the poor insulation all contribute to the ecological footprint of my living in an RV.

Lack of Study Space

Similar to the lack of lab space is a lack of good study space. Again, my only table is taken up by my computer and monitor, leaving no room to spread out books and lecture notes. I find studying on my bed instead makes me tired so I usually do my homework at work (after hours).

Not Sure What To Do When Lifestyle Changes

I expect I won’t live in an RV forever, e.g., when I start a family. Therefore I’ll eventually need to go through the hassle of retiring the RV. This will mean either storing it or selling it. Storing it will be expensive, and selling it suffers the effect of great depreciation. Moreover, I might still owe on the RV loan after such a change in lifestyle. My response to this eventuality has been to pay off the loan aggressively.

Durability

My RV was not designed for full time living (some are). For example, latches are weak and cabinets are easily damaged. Moreover, full time living might void the warranty according to the users’ manual. I did not realize this until I had bought the unit. If there is a next time, I’ll buy a unit made for full time living.

Solar Power

Using solar power to charge batteries makes sense for boondocking, but mixing solar power with park-based living may not provide much benefit. I haven’t seen RV solar installations that feed power back into the grid once the batteries are charged (though admittedly I haven’t looked as hard as I could). This is something I’ll explore more in the future, but for now I’m suggesting that investing in solar power for park-based RV living needs to be carefully evaluated before taking the plunge.

Tank Sensors Giving Me Trouble

From the moment I started using my RV, my sensor on my black-water holding tank has been reading three-quarters full, no matter what chemicals I dump into it. Therefore I’m not always sure when to dump the tank. Even when living in a park, it is best for the holding tanks to let them fill before emptying them, rather than simply leaving the valve open all the time.

12 V

As expressed in my post “engineer moves into an RV“, there are no 12 VDC receptacles for running 12V appliances in my RV. This is no problem for park living, but if I boondock using the batteries for power I’d like to be able to use 12V devices. Furthermore, there is no easy place to connect an inverter to my 12V power supply. These are things I wish I checked for when shopping for RVs.

Plumbing

I’ve found plumbing to be a pain in my RV. There are two problems: First, you can’t use chemicals like Draino to clear blockages since they will damage your holding tanks, so you have to manually clear pipes. Second, the tight spaces make getting to the pipes a challenge.

Shower

If you let the water heater run for 45 minutes, you can take a warm, roughly 10 minute shower. However, you’ll lose warm water after that. In drought-stricken southern California, one should not take a long shower anyway. You have to remember to turn off your water heater when you are not using it to save propane.

Related Posts

selecting travel trailers by regression
engineer moves into an RV

Posted in Uncategorized | Tagged , , , | Leave a comment

gene annotation database with MongoDB

After reading Datanami’s recent post “9 Must-Have Skills to Land Top Big Data Jobs in 2015″ [1], I decided to round out my NoSQL knowledge by learning MongoDB. I have previously reported NoSQL work with Neo4j on this blog, where I discussed building a gene annotation graph database [2]. Here I build a similar gene annotation database, but using the MongoDB platform and its design patterns.

MongoDB

MongoDB is a database system that stores documents made up of key-value pairs. The values may themselves be other documents of key-value pairs, or arrays, resulting in an overall document format similar to JSON. These JSON-like documents are then compiled into sets called “collections”, which resemble tables in relational databases (typically documents describing similar things are grouped within the same collection). Unlike relational databases, however, is the fact that the documents in a collection do not have to have exactly the same keys (fields). Like a relational database table, a collection may be indexed by one or more of its keys, or by a key of embedded documents within the documents of the collection. Additionally, two collections may be matched on mutual identifiers, similar to a join in a relational database, but the matching is done on the client side rather than on the server.

These concepts will be illustrated by the example case shown below: a gene annotation database containing gene information for 15079 species, along with RefSeq information for each gene.

Gene Annotation Database

I first downloaded the “gene_info” and “gene2refseq” files from NCBI’s FTP site [3] and inserted them into MongoDB using the Python code appended to the bottom of this post. This created two collections, “gene_collection” which stores gene information–one gene per document–and “refseq_collection” which stores protein/transcript information–one RefSeq entry per document. Each document in the RefSeq collection points to the gene document that the RefSeq sequences correspond to.

Searching the gene collection by NCBI Entrez gene ID 675 yields a full gene document:

675_full_description

Here we can see that MongoDB assigned an internal ID “_id” to the document, which we will use later to reference it from the RefSeq documents. We see standard key-value pairs such as “GeneID” which contains an integer value specifying the Entrez gene ID and “type_of_gene” which contains a string describing whether the gene is known to code for a protein. For the key “dbXrefs”, notice that its value is an embedded document that contains its own key-value pairs linking the gene to other database identifiers. Keys “Synonyms” and “Other_designations” have arrays as values.

Searching the same collection by NCBI Entrez gene ID 544203 yields a gene document with less keys:

544203_short_description

This shows how documents within a collection need not share the same keys. It is helpful to have some of the same keys to have a meaningful data model design, but by not forcing every document to have the same schema one avoids the heavy use of null values often found in relational database tables.

One can search a collection by a top-level key as demonstrated above. You can also search for documents with an array containing an element, e.g.,

FAD

In this case we searched for human genes with “FAD” contained in their documents’ “Synonyms” array and listed only the IDs and symbols from the resulting documents.

Finally, we can search by the key of an embedded document:

ENSG00000080815

Here we searched for the Ensembl ID “ENSG00000080815″ in the documents embedded under the key “dbXrefs”.

Indexing

Like a relational database system, MongoDB enables users to index collections to improve query performance. However, unlike a relational database, the indices need not only be on the top-level keys, they can also be built from keys inside embedded documents or built from array contents.

The following table shows the performance improvement (in query times) between an indexed and non-indexed version of this gene annotation database:

query_times_POST_CROPPED

Joins

MongoDB does not allow server-side joins between collections like those found in relational databases. Instead, collections may be matched on the client side by an application through use of a common reference. For example, in the image below we see that gene 5663 has an internal “_id” of “54b067d44f2ff124afc15cbf” in the gene collection. When we loaded the RefSeq data into the RefSeq collection, we matched each RefSeq entry to their respective genes through the “gene_document_id” key, which points to the corresponding “_id” in the gene collection. The ID and reference are highlighted in green in the image below:

references_POST_CROPPED

The following Python code shows how to implement a client-side match by these identifiers:

# useful libraries
from pymongo import MongoClient

# connect to MongoDB
client = MongoClient('localhost', 27017)
db = client.gene
refseq_collection = db.refseq_collection
gene_collection = db.gene_collection

# find gene 5663
gene = gene_collection.find_one({'GeneID' : 5663})
gene_document_id = gene['_id']

print
print 'Showing RefSeq sequences for gene 5663:'
print
print '\t', 'RNA', '\t', '\t', 'Genomic', '\t', 'RefSeq Collection Document ID'

# find RefSeq data for gene (by reference)
refseq = refseq_collection.find({'gene_document_id' : gene_document_id})
for r in refseq:
    if r.has_key('RNA_nucleotide_accession_version') and r.has_key('genomic_nucleotide_accession_version'):
        print '\t', r['RNA_nucleotide_accession_version'], '\t', r['genomic_nucleotide_accession_version'], '\t', r['_id']

print

Running this code shows all the RefSeq RNA and genomic nucleotide sequences associated with gene 5663:

manual_references

Conclusion

I have not shown all the features of MongoDB using this gene annotation database example. For instance I left out discussion of aggregation, scripting, and scalability across multiple servers. However, I believe I have shown enough to demonstrate knowledge of MongoDB and to demonstrate that it is a viable alternative to traditional relational databases for Big Data applications.

Code

Python code for loading NCBI’s “gene_info” table [3] into MongoDB is:

# useful libraries
import datetime
from pymongo import MongoClient

# connect to MongoDB
client = MongoClient('localhost', 27017)
db = client.gene
collection = db.gene_collection

# get header
f = open('data/gene_info')
for line in f:
    header = line.strip().split(' ')[1:16]
    break
f.close()

# load data
f = open('data/gene_info')
for line in f:
    line = line.strip()

    # remove commented out lines
    if line[0] == '#':  continue

    # deal with missing values
    split_line_with_dashes = line.split('\t')
    split_line = []
    for s in split_line_with_dashes:
        if s == '-':
            split_line.append(None)
        else:
            split_line.append(s)

    # specify data types of values
    split_line[0] = int(split_line[0])
    split_line[1] = int(split_line[1])
    year = int(split_line[14][0:4])
    month = int(split_line[14][4:6])
    day = int(split_line[14][6:8])
    split_line[14] = datetime.datetime(year, month, day, 0, 0)

    # split the aliases
    if split_line[4] != None:
        split_line[4] = split_line[4].replace(' ', '').split('|')

    # split the other designations
    if split_line[13] != None:
        split_line[13] = split_line[13].split('|')

    # split the DB cross references
    if split_line[5] != None:
        db_cross_refs = split_line[5].replace(' ', '').split('|')
        cross_refs = {}
        for dcr in db_cross_refs:
            db = dcr.split(':')[0]
            ref = dcr.split(':')[1]
            cross_refs[db] = ref
        split_line[5] = cross_refs

    # build document
    document = {}
    for h, s in zip(header, split_line):
        if s != None:
            document[h] = s

    # insert document into MongoDB
    collection.insert(document)

f.close()

Python code for loading NCBI’s “gene2refseq” table [3] into MongoDB is:

# useful libraries
from pymongo import MongoClient

# connect to MongoDB
client = MongoClient('localhost', 27017)
db = client.gene
collection = db.refseq_collection
gene_collection = db.gene_collection

# get header
f = open('data/gene2refseq')
for line in f:
    header = line.strip().split(' ')[1:17]
    header = [h.replace('.', '_') for h in header]
    break
f.close()

# load data
count = 0
f = open('data/gene2refseq')
for line in f:
    line = line.strip()

    # remove commented out lines
    if line[0] == '#':  continue

    # deal with missing values
    split_line_with_dashes = line.split('\t')
    split_line = []
    for i, s in enumerate(split_line_with_dashes):
        if s == '-' and header[i] != 'orientation':
            split_line.append(None)
        else:
            split_line.append(s)

    # convert data types
    if split_line[0] != None:  split_line[0] = int(split_line[0])
    if split_line[1] != None:  split_line[1] = int(split_line[1])
    if split_line[4] != None:  split_line[4] = int(split_line[4])
    if split_line[6] != None:  split_line[6] = int(split_line[6])
    if split_line[8] != None:  split_line[8] = int(split_line[8])
    if split_line[9] != None:  split_line[9] = int(split_line[9])
    if split_line[10] != None:  split_line[10] = int(split_line[10])
    if split_line[14] != None:  split_line[14] = int(split_line[14])

    # build document
    document = {}
    for h, s in zip(header, split_line):
        if s != None:
            document[h] = s

    # get ID of corresponding gene document
    gene_id = split_line[1]
    gene = gene_collection.find_one({'GeneID' : gene_id})
    document['gene_document_id'] = gene['_id']

    # remove 'GeneID' from document
    del document['GeneID']
    
    # insert document into MongoDB
    collection.insert(document)

f.close()

References

  1. http://www.datanami.com/2015/01/07/9-must-skills-land-top-big-data-jobs-2015/
  2. http://badassdatascience.com/2014/07/11/graph-database-for-gene-annotation/
  3. ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
Posted in big data, bioinformatics, data science, science | Tagged , , , , , , , , , , , , , | Leave a comment

new year’s revolutions

One has to constantly remake themselves to adapt to the market and to the times. This activity serves those who practice it well economically and keeps them from becoming bored. It requires a constant state of personal “revolution”.

Here I describe my revolutionary plans for 2015; namely which technologies I intend to learn and which actions I will consider for the year.

Synthetic Biology

Synthetic biology has the potential to revolutionize manufacturing, first the manufacturing of drugs and biofuels, then the manufacturing of more exotic materials. To do this we (researchers and engineers) will have to learn how to model biological circuits effectively (see my recent post “iBioSim: a CAD package for genetic circuits“). My intention this year is to strengthen my knowledge of synthetic biology’s computing side, leveraging my experience as an engineer and a bioinformatician in the process. Particularly I intend to combine this effort with my work to improve my statistical and machine learning knowledge, as both can be applied to the identification of genetic circuit parameters.

(Also see my post “synthetic biology: an emerging engineering discipline“) for a description of where the subject is at the moment.

Unstructured Data

Unstructured data is the frontier in data science. Currently such data (images, prose) is called “dark data” because it is difficult for a computer algorithm to automatically derive actionable information from it. However, new tools are becoming available for dealing with unstructured data. I therefore intend to learn how to deal with the kind of data found in prose: For example, extracting useful biological information from a bulk analysis of PubMed abstracts or beneficial stock tips from financial writing.

Internet of Things

I have no plans to deploy any Internet of Things objects this year, but am certainly interested in working with the data made available by such devices. If someone or some company wants to partner with me on this, I’m game.

Machine Learning

Machine learning is becoming easier due to new tools and software. I want to get better at choosing among and applying machine learning techniques in the next year. I have already started this effort with two blog posts on clustered stock market analysis (“clustering stocks by price correlation, part 1” where I explored use of hierarchical clustering and “clustering stocks by price correlation, part 2” where I applied k-means clustering).

Particularly, I want to learn the intersection between Big Data and machine learning, e.g., how to combine basic Apache Spark with Spark’s MLlib toolkit.

I see opportunity to combine machine learning efforts with all the above-stated subjects, since my treatment of those subjects will be highly data driven.

Thinking Big: Warp Equations

I’m no physicist and certainly won’t become one in a year, but I want to start investigating the state of space-folding equations and related ideas. In particular I want to examine how close the ideas are to trial by commercial venture.

Posted in big data, bioinformatics, data science, engineering, science | Tagged , , , , , , , , , , , , | Leave a comment

iBioSim: a CAD package for genetic circuits

iBioSim is a CAD package for the design, analysis, and simulation of genetic circuits. It can also be used for modeling metabolic networks, pathways, and other biological/chemical processes [1]. The tool provides a graphical user interface (GUI) for specifying circuit design and parameters, and a GUI for running simulations on the resulting models and viewing results. iBioSim also provides a means to “learn” system parameters from measured lab data, since it is often unclear at the model design stage what the values of all the required equilibrium constants and reaction rates are.

Below I’ll demonstrate the specification and simulation of a genetic circuit in iBioSim. I have demonstrated the simulation of genetic circuits using CAD on this blog before [2], but at that time used a generic system dynamics package called Vensim. iBioSim is more specialized than Vensim, being designed explicitly for modeling the particulars of genetic circuits (such as handling cell division) rather than any abstract dynamic system. Under the hood though, the simulation engine of both tools is the same: ODE simulation using Runge-Kutta methods. I call such tools “drag and drop differential equations tools” due to their specification of ODE models using a mouse pointer.

iBioSim is written in Java and is therefore available on any modern computing platform. It saves models in Systems Biology Markup Language (SBML) to facilitate communication with other tools. SBML is an XML-based standard developed by the research community to enable portable descriptions of biological processes [3]. iBioSim is produced by the Myers Research Group at the University of Utah and is freely available.

Model Specification

Exercise 1.3 of Chris Myers’ Engineering Genetic Circuits text [4], provides the following genetic circuit:

genetic_circuit_CROPPED

In this model, protein CI represses promoter P1, which produces protein LacI. Similarly, LacI represses promoter P2 which produces protein TetR. Finally, TetR represses promoter P3 which generates protein CI. In the exercise, the author specifies that CI must dimerize before acting as a repressor to P1.

The author of the book’s exercise also specifies the following model constants:

constants

Implemented in iBioSim, this circuit looks like:

iBioSim_model

Model constants are entered during the creation of each model element. For example, for the repressor relationship between CI2 and P1:

influence_editor

Simulation

Once the model is specified, it can be simulated. First users must select the simulation parameters:

iBioSim_simulation_options

Then users can run the simulation and obtain time series results indicating the expected amounts of each protein:

iBioSim_simulation_results

From this we see that the genetic circuit behaves similar to the oscillator described in [2], but that due to protein degradation the oscillation quickly dampens and the system reaches steady state operation.

References

  1. http://www.async.ece.utah.edu/iBioSim/docs/iBioSim_Tutorial.html
  2. http://badassdatascience.com/2012/10/30/simulating-a-synthetic-biology-circuit/
  3. http://en.wikipedia.org/wiki/SBML
  4. C. J. Myers, Engineering Genetic Circuits, Chapman & Hall/CRC Press, July 2009.
Posted in bioinformatics, engineering, science, system dynamics | Tagged , , , , , , , , , , , , | 1 Comment

clustering stocks by price correlation (part 2)

In my last post, “clustering stocks by price correlation (part 1)“, I performed hierarchical clustering of NYSE stocks by correlation in weekly closing price. I expected the stocks to cluster by industry, and found that they did not. I proposed several explanations for this observation, including that perhaps I chose a poor distance metric for the clustering procedure, that perhaps daily rather than weekly sampling would have generated better results, or that stocks simply don’t cluster by industry very much. I also proposed generating k-means clusters to see if that revealed different results. This post details that k-means clustering effort.

K-Means Clustering

K-means clustering starts with a matrix of observation and feature values, resembling the data structure one might use in a regression [1]. This differs from hierarchical clustering where the data matrix is an N-by-N pairwise comparison of each of the observations by a similarity (or distance) metric. An example of an N-by-3 matrix for k-means clustering is:

feature_matrix_POST_CROP

This matrix is then normalized (or “whitened” in the language of k-means clustering).

Each observation in the “whitened” matrix is then randomly assigned to one of k clusters [2], and the following steps are repeated until a stop condition (usually minimal change in cluster assignment) is reached [2]:

  1. The algorithm computes the average value of each current cluster.
  2. The algorithm then assigns each observation to the nearest cluster (by the calculated average value).
  3. Finally, the algorithm tests for the stop condition and repeats if it is not reached.

The result is a set of k clusters grouped by similarity of feature values.

We will implement our calculation using Python’s scipy.cluster.vq [3] module.

Method

I started with a list of daily closing stock prices for all the stocks listed in the New York Stock Exchange (NYSE), e.g.,

source_data_POST_CROP

I then extracted the closing price for every stock for every Tuesday between October 22, 2013 and October 28, 2014, so that we are sampling weekly and avoiding holidays. Stocks that were removed from the NYSE or that entered the NYSE during this period were discarded, thereby ensuring that each stock in the resulting data set had the same number of time points. [4]

For every stock in the dataset, I next calculated the Pearson correlation coefficient between the stock’s price time series and each of the stocks in the DJIA (NYSE only) price time series. This treats each correlation with a DJIA stock as a feature as required by the data matrix used by k-means clustering. The resulting feature matrix had dimension of 2601-by-27, or 2601 stocks compared to each of the 27 DJIA stocks listed on the NYSE, e.g,:

DJIA_features_POST_CROP_2

Finally, I “whitened” the feature matrix and ran the Python k-means module on the whitened feature matrix.

Python code implementing these calculations is appended below.

Results

Like the hierarchical clustering results shown in [4], the groups do not seem to be organized by industrial sector:

sectors_by_group

This suggests a couple of possibilities: First, correlation in stock prices may simply not cluster by industry very much. Second, my use of k=50 (50 groups) may have been inappropriate, perhaps a better division number exists. Third, maybe sampling prices daily rather than weekly would produce better results. Finally, perhaps correlation is not the best feature to use.

I also tried using the weekly change in prices rather than the actual prices, but that attempt did not show any better grouping by industry.

Code

# import useful libraries
import scipy.cluster.vq
import scipy.stats
import numpy as np
import datetime

# user settings
number_of_groups_to_use = 50

# set seed for repeatibility
np.random.seed(1000)

# load the Dow Jones Industrial Average symbols
djia = []
f = open('DJIA.tsv')
for line in f:
    line = line.strip()
    company = line.split('\t')[0]
    exchange = line.split('\t')[1]
    symbol = line.split('\t')[2]
    if exchange == 'NYSE':
        djia.append(symbol)
f.close()
djia = sorted(djia)

# specify the time points we want to sample and sort them
end_date = datetime.datetime(2014, 10, 28, 0, 0)
dates_to_use = [end_date]
dt = datetime.timedelta(days=-7)
for i in range(0, 53):
    dates_to_use.append(dates_to_use[-1] + dt)
dates_to_use = sorted(dates_to_use)
new_dates = []
for x in dates_to_use:
    year = str(x.year)
    month = str(x.month)
    day = str(x.day)
    if len(month) == 1:  month = '0' + month
    if len(day) == 1:  day = '0' + day
    new_dates.append(year + '-' + month + '-' + day)
dates_to_use = new_dates

# load the closing prices for the given dates into memory
prices = {}
f = open('../../combined_data.csv')
for line in f:
    line = line.strip()
    date_string = line.split(',')[1]
    if not date_string in dates_to_use:  continue
    symbol = line.split(',')[0]
    close = float(line.split(',')[5])
    if not prices.has_key(symbol):
        prices[symbol] = {}
    prices[symbol][date_string] = close
f.close()

# delete cases without all of the dates in dates_to_use
symbol_list = prices.keys()
for symbol in symbol_list:
    date_list = prices[symbol].keys()
    if len(date_list) != len(dates_to_use):
        del(prices[symbol])

# generate price time series
price_time_series = {}
for symbol in prices.keys():

    # actual prices
    price_time_series[symbol] = [prices[symbol][d] for d in dates_to_use]

    # change in prices
    #price_time_series[symbol] = [(prices[symbol][dates_to_use[n]] - prices[symbol][dates_to_use[n-1]]) / prices[symbol][dates_to_use[n-1]] for n in range(1, len(dates_to_use))]

# calculate R
new_symbol_list = sorted(price_time_series.keys())
R_dict = {}
for i in range(0, len(new_symbol_list)):
    if not R_dict.has_key(i):
        R_dict[i] = {}
    for j in range(0, len(djia)):
        symbol_i = new_symbol_list[i]
        symbol_j = djia[j]

        # Pearson's R
        R = 1.0
        if symbol_i != symbol_j:
            R = scipy.stats.pearsonr(price_time_series[symbol_i], price_time_series[symbol_j])[0]

        # record R
        R_dict[i][j] = R

# create the feature matrix
feature_matrix = np.zeros([len(new_symbol_list), len(djia)])
for i in R_dict.keys():
    for j in R_dict[i].keys():
        feature = R_dict[i][j] + 1.   # range 0 to 2
        feature_matrix[i][j] = feature

# cluster using k-means
W = scipy.cluster.vq.whiten(feature_matrix)
K = scipy.cluster.vq.kmeans(W, number_of_groups_to_use)
VQ = scipy.cluster.vq.vq(W, K[0])
groups = VQ[0]

# report results
f = open('output/groups.csv', 'w')
f.write('symbol,group\n')
for g, s in zip(groups, new_symbol_list):
    f.write(s + ',' + str(g) + '\n')
f.close()

# write the feature matrix to disk
f = open('output/features.csv', 'w')
f.write('symbol,' + ','.join(djia) + '\n')
for i in R_dict.keys():
    symbol_i = new_symbol_list[i]
    f.write(symbol_i + ',')
    values = []
    for j in R_dict[i].keys():
        values.append(str(R_dict[i][j]))
    f.write(','.join(values) + '\n')
f.close()

References

  1. http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.vq.whiten.html#scipy.cluster.vq.whiten
  2. http://homes.di.unimi.it/~valenti/SlideCorsi/MB0910/IntroClustering.pdf
  3. http://docs.scipy.org/doc/scipy-0.14.0/reference/cluster.vq.html
  4. http://badassdatascience.com/2014/12/31/clustering-stocks-1/
Posted in data science, econometrics, engineering, statistics | Tagged , , , , , , , , , , , , , , , | Leave a comment

clustering stocks by price correlation (part 1)

I’ve been building my knowledge of clustering techniques to apply to genetic circuit engineering, and decided to try the same tools for stock price analysis. In this post I describe building a hierarchical cluster of stocks by pairwise correlation in weekly price, to see how well the stocks cluster by industry, and compare the derived groupings to an outside categorization by industry. In a later post I will try k-means clustering of the same source data.

Hierarchical Clustering

Agglomerative hierarchical clustering begins with a pool of clusters (single nodes) that have not been added to any hierarchy. These clusters are then iteratively grouped by similarity into new clusters, while the relationship between the single nodes established by each iteration is maintained. When only one cluster remains, the procedure stops [1].

Reference [2] shows a useful image outlining when data may benefit from hierarchical clustering:

structure

Here we see that cases where a hierarchy is present in the data or where the data has largely differing densities are candidates for the use of hierarchical clustering.

Method

I started with a list of daily closing stock prices for all the stocks listed in the New York Stock Exchange (NYSE), e.g.,

source_data_POST_CROP

I then extracted the closing price for every stock for every Tuesday between October 22, 2013 and October 28, 2014, so that we are sampling weekly and avoiding holidays. Stocks that were removed from the NYSE or that entered the NYSE during this period were discarded, thereby ensuring that each stock in the resulting data set had the same number of time points.

For every pair of stocks in the dataset, I then computed the Pearson correlation coefficient between the two weekly closing price time series, compiling these values into a similarity matrix describing the similarity between each pair of stocks. I then converted the similarity matrix into a distance matrix as required by my clustering algorithm. (I’m not sure how valid using the correlation coefficient is for autocorrelated time-series, need to check this later).

Finally, I ran a hierarchical clustering algorithm on the distance matrix, and then flattened the results into 50 groups, producing the outcome discussed below.

Python code implementing these calculations follows this text.

Results

The resulting dendrogram looks like:

dendrogram_actual_pearson_FINAL

I flattened the results into 50 groups, some containing many stocks and some containing only two. A histogram of the number of stocks per group is shown below. The median group membership was 21 stocks with a maximum of 779. This makes me suspect (though I’m not certain yet) that the data has widely varying densities and sizes as per the density image shown above under the “Hierarchical Clustering” heading.

distribution_50_actual_pearson

A mapping of the stocks by industrial sector [3] showed that my groupings did not cluster well by sector, as I expected them to. An example for the 20 stocks placed into group six by the algorithm demonstrates this result:

sectors_assigned_to_groups

Possible Improvements

The lack of successful clustering by industrial sector suggests a few possibilities: First, use of the Pearson correlation coefficient as a distance metric might not be an appropriate measure of dissimilarity between stocks. Second, perhaps daily sampling instead of weekly sampling would have produced better results. Third, hierarchical clustering might not be the best means of clustering for this data; I’m going to try k-means clustering next. Finally, it is possible that stocks simply don’t naturally cluster by industrial sector as I expected them to.

Code

# import useful libraries
import pprint as pp
import scipy.cluster.hierarchy
import scipy.stats
import numpy as np
import matplotlib.pyplot as plt
import datetime

# specify the time points we want to sample and sort them
end_date = datetime.datetime(2014, 10, 28, 0, 0)
dates_to_use = [end_date]
dt = datetime.timedelta(days=-7)
for i in range(0, 53):
    dates_to_use.append(dates_to_use[-1] + dt)
dates_to_use = sorted(dates_to_use)
new_dates = []
for x in dates_to_use:
    year = str(x.year)
    month = str(x.month)
    day = str(x.day)
    if len(month) == 1:  month = '0' + month
    if len(day) == 1:  day = '0' + day
    new_dates.append(year + '-' + month + '-' + day)
dates_to_use = new_dates

# load the closing prices for the given dates into memory
prices = {}
f = open('../combined_data.csv')
for line in f:
    line = line.strip()
    date_string = line.split(',')[1]
    if not date_string in dates_to_use:  continue
    symbol = line.split(',')[0]
    close = float(line.split(',')[5])
    if not prices.has_key(symbol):
        prices[symbol] = {}
    prices[symbol][date_string] = close
f.close()

# delete cases without all of the dates in dates_to_use
symbol_list = prices.keys()
for symbol in symbol_list:
    date_list = prices[symbol].keys()
    if len(date_list) != len(dates_to_use):
        del(prices[symbol])

# generate price time series
price_time_series = {}
for symbol in prices.keys():
    price_time_series[symbol] = [prices[symbol][d] for d in dates_to_use]

# calculate R
new_symbol_list = sorted(price_time_series.keys())
R_dict = {}
for i in range(0, len(new_symbol_list)):
    if not R_dict.has_key(i):
        R_dict[i] = {}
    for j in range(i, len(new_symbol_list)):
        symbol_i = new_symbol_list[i]
        symbol_j = new_symbol_list[j]
        R = 1.0
        if symbol_i != symbol_j:
            R = scipy.stats.pearsonr(price_time_series[symbol_i], price_time_series[symbol_j])[0]
        R_dict[i][j] = R

# create the distance matrix
distance_matrix = np.zeros([len(new_symbol_list), len(new_symbol_list)])
for i in R_dict.keys():
    for j in R_dict[i].keys():
        similarity = R_dict[i][j] + 1.   # range 0 to 2
        distance = (-1.0 * (similarity - 2.0)) / 2.0
        distance_matrix[i][j] = distance
        distance_matrix[j][i] = distance

# create the linkage
Z = scipy.cluster.hierarchy.linkage(distance_matrix, method='average')

# function to set blank dendrogram labels
def llf(id):
    return ''

# plot
scipy.cluster.hierarchy.dendrogram(Z, leaf_label_func=llf)
plt.title('Hierarchical Clustering of Stocks by Closing Price Correlation')
plt.show()

# flatten
number_of_groups_to_use = [50]
for n in number_of_groups_to_use:
    f = open('output/' + str(n) + '.csv', 'w')
    f.write('symbol,group\n')
    T = scipy.cluster.hierarchy.fcluster(Z, n, criterion='maxclust')
    counts = {}
    for i, t in enumerate(T):
        if not counts.has_key(t):
            counts[t] = 0
        counts[t] += 1
        f.write(new_symbol_list[i] + ',' + str(t) + '\n')
    counts_list = []
    for c in counts.keys():
        counts_list.append(counts1)
    f.close()

    # output
    print
    print n
    print np.min(counts_list)
    print np.median(counts_list)
    print np.max(counts_list)
    plt.hist(counts_list)
    plt.title('Distribution of Number of Stocks per Group')
    plt.xlabel('Number of Stocks per Group')
    plt.ylabel('Frequency')
    plt.show()

References

  1. http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage
  2. http://homes.di.unimi.it/~valenti/SlideCorsi/MB0910/IntroClustering.pdf
  3. http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NYSE&render=download
Posted in data science, econometrics, engineering, statistics | Tagged , , , , , , , , , , , , , , | 2 Comments

simulating RNA-seq read counts

The Challenge

I want to explore the statistics of RNA sequencing (RNA-seq) on next-generation sequencing (NGS) platforms in greater detail, so I thought I’d start by simulating read counts to experiment with. This post details how I constructed a simulated set of read counts, examines its concordance with the expected negative binomial distribution of the counts across replicates, and then tests DESeq2’s analysis of the simulated read counts to determine how well it matches the expected outcome. Code implementing the calculations follows the description.

Constructing Simulated Read Counts

I first randomly sampled 1000 transcripts from RefSeq to obtain a list of representative transcript lengths, ensuring that no gene is represented twice in the sample (I made sure that no gene has two or more isoforms in the set). The resulting list and distribution of lengths looks like:

transcript_counts

length_distribution

I then simulated a relative abundance metric for the 1000 transcripts using a gamma distribution. This relative abundance metric is completely unrelated to the true expression levels of the genes I drew length information from, it is simply a randomly selected value assigned to each transcript length. (This is why I label the transcripts “0” to “999” instead of by their original RefSeq accession numbers, since at this point in the creation of simulated reads we are departing from prior biological information). The distribution of the relative abundance metric is shown below. The values on the X axis are only meaningful in relation to each other–my use of “relative” here is with regard to each gene’s abundance compared with each other, not “relative” across biological samples or technical replicates.

gamma_distribution

At this point I copied the data into two parts: length/abundance information for condition “A” and length/abundance information for condition “B”. Condition “B” is an exact copy of condition “A” except for approximately 10% of its transcripts in which the abundance metric has been cut in half, and another approximately 10% of its transcripts in which the abundance metric has been doubled. These conditions represent two simulated biological samples we are comparing. For each condition I then multiplied the abundance metric by 20 to increase the dynamic range, and converted the value to an integer.

For each transcript and each condition, I then multiplied the transcript length by the increased abundance metric to get a relative amount of sequence that a read might come from. For each transcript, I added the transcript to an array for the given condition this amount of times, which we later sample from to count reads. Call these arrays “available hits A” and “available hits B” for each respective condition.

I wanted to create three replicates of each biological condition, labeled “A1″, “A2″, “A3″, “B1″, “B2″, and “B3″. For each replicate, I randomly selected a sample size (total reads) between 12 and 15 million. I then sampled from “available hits A” three times and “available hits B” three times using the randomly selected sample sizes, and counted the number of times a transcript was selected for each. The generated counts data looks like:

simulated_counts

To confirm that the simulation worked correctly, I plotted the log2 of counts A1 versus the log2 of counts B1, by transcript, producing the following graph. It is clear from the plot that roughly 10% of the B1 transcripts are one log2 value behind the corresponding A1 value, and that roughly 10% of the B1 transcripts are one log2 value ahead of the corresponding A1 value, as expected.

log2_counts_per_transcript

Python code implementing the computations described above is appended to the end of this blog post.

Distribution of Transcript Counts Across Replicates

For each transcript where the nominal expression level was the same across both biological conditions, i.e., no doubling or halving of the expression level took place, I calculated the maximum likelihood estimators of the mean and variance of the simulated replicate counts according to a negative binomial distribution. I then plotted the estimated mean versus the estimated variance as follows:

mean_vs_variance_neg_binom

From the image we see that the estimated variance grows faster than the estimated mean–the system is overdispersed–as expected. This indicates that a negative binomial distribution is probably a good model for the replicates. I plan to try to come up with a more formal goodness of fit test to confirm this in the near future, but that will have to wait until a later blog post.

DESeq2 Analysis

I next ran DESeq2 analysis on the two simulated biological samples’ sets of read counts, producing the following output:

R_output

The first block in the above image shows the simulated read counts for transcripts zero through four. We see that transcripts zero and one have “B” expression that is roughly twice that of their “A” expression. We also see that transcript four has “B” expression that is approximately half that of its “A” expression.

The second block in the image above shows that a log2 fold change of one is detected by DESeq2 for transcripts zero and one, and that a log2 fold change of negative one is detected for transcript four. This matches our expectation given the read counts listed in the first block. The p-values for these assessments of fold change are very small, indicating significance.

The third block in the image above shows a summary of all 1,000 transcripts’ results. We see that approximately 10% are upwardly differentially expressed between “A” and “B” and that approximately 10% are downwardly differentially expressed between “A” and “B”. This matches our original simulation design.

R code implementing these calculations follows.

Code

Python code used to produce the simulated counts:

# load useful libraries
from scipy.stats import gamma
import matplotlib.pyplot as plt
import random
import numpy as np

# user settings
number_of_transcripts = 1000
gamma_alpha = 2.7

# set the random seeds for repeatability
random.seed(2)
np.random.seed(seed=2)

# generate gamma-distributed transcript "expression" levels (relative expression, not absolute)
r = gamma.rvs(gamma_alpha, size=number_of_transcripts)
r1_list = r.tolist()
plt.hist(r1_list, color="lightblue")
plt.title("Simulated Distribution of Relative Abundance for 1000 Transcripts")
plt.xlabel("Simulated Relative Abundance")
plt.ylabel("Frequency")
plt.show()

# double 10% and halve 10% of the relative "expression" levels to simulate a different "biological" circumstance or cell
r2_list = []
for r1 in r1_list:
    r2 = r1
    unif = random.uniform(0, 1)
    if unif >= 0.8:
        r2 = r1 / 2.
    if unif >= 0.9:
        r2 = r1 * 2.
    r2_list.append(r2)

# convert the "expression" levels to integers and multiply by 20 to increase dynamic range
r1_int = []
r2_int = []
for r1, r2 in zip(r1_list, r2_list):
    r1_int.append(int(20. * r1))
    r2_int.append(int(20. * r2))
plt.hist(r1_int)
plt.show()
plt.hist(r2_int)
plt.show()

# write the "known" expression levels to disk so that we may compare them to DESeq2 calculations
f = open('data/transcript_lengths.csv')
f_out = open('output/expression.csv', 'w')
f_out.write('transcript,length,expression_A,expression_B\n')
length = []
transcript = []
for i, line in enumerate(f):
    line = line.strip()
    if line.find('transcript') >= 0:  continue
    f_out.write(line + ',' + str(r1_int[i-1]) + ',' + str(r2_int[i-1]) + '\n')
    length.append(int(line.split(',')[1]))
    transcript.append(int(line.split(',')[0]))
f_out.close()
f.close()
plt.hist(length, color="lightblue")
plt.title("Distribution of Lengths for 1000 Selected Transcripts")
plt.xlabel("Length (Nucleotides)")
plt.ylabel("Frequency")
plt.show()

# create populations to sample from that reflect transcript size and relative abundance
r1 = []
for i, j, k in zip(transcript, length, r1_int):
    for n in range(0, j * k):
        r1.append(i)
r2 = []
for i, j, k in zip(transcript, length, r2_int):
    for n in range(0, j * k):
        r2.append(i)

# simulate read hits to transcripts:  generate three replicates for each of the two "biological" samples
samples = {1 : {}, 2: {}}
samples[1]['A'] = random.sample(r1, random.randint(12000000, 15000000))
samples[1]['B'] = random.sample(r1, random.randint(12000000, 15000000))
samples[1]['C'] = random.sample(r1, random.randint(12000000, 15000000))
samples[2]['A'] = random.sample(r2, random.randint(12000000, 15000000))
samples[2]['B'] = random.sample(r2, random.randint(12000000, 15000000))
samples[2]['C'] = random.sample(r2, random.randint(12000000, 15000000))

# count hits to transcripts
counts = {}
for bio in samples.keys():
    if not counts.has_key(bio):  counts[bio] = {}
    for replicate in samples[bio].keys():
        if not counts[bio].has_key(replicate):  counts[bio][replicate] = {}
        for t in samples[bio][replicate]:
            if not counts[bio][replicate].has_key(t):
                counts[bio][replicate][t] = 0
            counts[bio][replicate][t] += 1

# write simulated counts to file
f_out = open('output/counts.csv', 'w')
f_out.write('transcript,length,A1,A2,A3,B1,B2,B3\n')
for i, ln in enumerate(length):
    f_out.write(str(transcript[i]) + ',' + str(ln) + ',' + str(counts[1]['A'][i]) + ',' + str(counts[1]['B'][i]) + ',' + str(counts[1]['C'][i]) + ',' + str(counts[2]['A'][i]) + ',' + str(counts[2]['B'][i]) + ',' + str(counts[2]['C'][i]) + '\n')
f_out.close()

R code used to conduct differential expression analysis:

# the following relies heavily on the text at:
# http://www.bioconductor.org/help/workflows/rnaseqGene/

# clear memory
rm(list=ls())

# load useful libraries
library(DESeq2)
library(gplots)
library(RColorBrewer)
library(MASS)

# load counts data into momory and generate matrix
raw.counts.data <- read.csv("counts.csv")
str(raw.counts.data)
raw.counts.matrix <- as.matrix(raw.counts.data[,c(3:8)])
rownames(raw.counts.matrix) <- raw.counts.data$transcript

# specify how the columns in the matrix are related
coldata <- data.frame(experiment=c("A", "A", "A", "B", "B", "B"))
rownames(coldata) <- c("A1", "A2", "A3", "B1", "B2", "B3")

# create a DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData = raw.counts.matrix,
                                 colData = coldata,
					design = ~ experiment)

# normalize
dds <- estimateSizeFactors(dds)

# plot counts per transcript
plot(log2(1 + counts(dds, normalized=TRUE)[,c(1,4)]), pch=16, cex=0.4, col="blue",
	xlab="Log2 Counts for Sample A1",
	ylab="Log2 Counts for Sample B1",
	main="Counts per Transcript:  Sample A1 vs. Sample B1"
)

# generate the regularized-logarithm transformation
rld <- rlog(dds)

# similarity by clustering
sampleDists <- dist(t(assay(rld)))
sampleDistMatrix <- as.matrix( sampleDists )
rownames(sampleDistMatrix) <- paste( rld$dex, rld$cell, sep="-" )
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
hc <- hclust(sampleDists)
heatmap.2( sampleDistMatrix, Rowv=as.dendrogram(hc),
          symm=TRUE, trace="none", col=colors,
          margins=c(2,10), labCol=FALSE )

# run differential expression analysis
dds <- DESeq(dds)

# view results
res <- results(dds)
res
summary(res)

# test negative Binomial mean vs. variance
MU <- c()
VAR <- c()
for (i in 1:1000) {
    X <- raw.counts.data[i,3:8]
    ratio <- X[1] / X[4]
    if (0.70 <= ratio && ratio <= 1.3) {
        fit <- fitdistr(as.numeric(X), "negative binomial")
        size <- fit$estimate[1]
        mu <- fit$estimate[2]
        variance <- mu + (mu^2) / size
        MU <- c(MU, mu)
        VAR <- c(VAR, variance)
    }
}
plot(MU, VAR, main="Mean vs. Variance of Negative Binomial Fit of Read Counts, per Transcript", xlab="Mean", ylab="Variance", pch=16, cex=0.4, col="blue")
Posted in bioinformatics, engineering, science, statistics | Tagged , , , , , , , , , | 1 Comment