the science of gender identity (part 2: brain anatomy)


This is the second post in a mult-part series surveying the current science of gender identity, particularly with regard to the transgendered population. In my previous post I discussed the proposed genetic associations and corresponding research. A future post, if I can find sufficient data, will address neuropsychology research related to the transgender experience.

Here I discuss studies exploring differences in brain anatomy between transsexuals and cisgendered controls. The analysis is biased toward the male-to-female transsexual population versus the female-to-male population due to the availability of research data, which is unfortunate. As more research becomes available I will remedy this imbalance.

The research I describe below clearly points to structural differences in the brain between transsexual and cisgendered individuals. This is observational data—It still doesn’t answer why the structural differences emerge in the first place, though a predominant theory suggests variation in sex hormone uptake in the brain during fetal development is a cause [5]. My previous post discusses how this hypothesis connects to genetic features.

A Bit About the Words I’m Using

I’d prefer to use the umbrella term “transgender” to label the study participants described below. However, “transgender” is too broad, as the research I describe focused on those who particularly want to or have modified their bodies to become a member of a different sex, which not all transgendered individuals want to do. Therefore I use the medical term for this population: “transsexuals”.

Increased Putamen Volume

A comparison of MRIs from 24 male-to-female transsexuals, along MRIs from with 30 male and 30 female age-matched controls is reported in [1]. The transsexual study participants had not started hormone replacement therapy, therefore the research avoided a major possible confounding effect due to the possibility of hormone treatment altering brain anatomy [4]. However, the study did not control for the sexual orientation of subjects and controls, which has also been associated with brain anatomy [2, 3] and could confound the analysis. This decision was made because the sexual orientation of the controls was unknown since their MRIs came from a database that did not record that information.

The research found that male control and transsexual gray matter volumes were similar for all regions of the brain under investigation except the putamen, which was significantly larger for the transsexual group. The transsexual left and right putamen volume distributions were closer to that of the female controls, as shown in the following boxplots taken directly from the paper [1]:


A different view, also taken directly from the paper, displays the regions of the brain having significant volume difference between the transsexual subjects and male controls at p < 0.001 (FDR-corrected). Only the right putamen (in red) appears at this significance level.


These results suggest that male-to-female transsexuals carry a “feminized” putamen, which may help explain their differing gender identity compared to the male controls.

Increased Cortical Thickness

The authors of the study I described in the last section performed an additional study looking at cortical thickness differences between male-to-female transsexuals and cisgendered male controls [6]. As in the previous study, the subjects had not started hormone replacement therapy, and the researchers did not control for sexual orientation.

The study examined MRIs from 24 transsexuals and 24 age matched controls, comparing cortex thickness at thousands of points along the cortical surfaces. The statistical tests for difference at each point were corrected for multiple comparisons using false discovery rate. A map showing the significantly different regions, taken directly from the paper, is shown below [6]:


The following cortical regions were identified as different between the two groups: frontal cortex, orbito-frontal cortex, central sulcus, perisylvian regions, paracentral gyrus, pre/post-central gyrus, parietal cortex, temporal cortex, precuneus, fusiform, lingual, and orbito-frontal gyrus.

These findings strengthen the argument that brain anatomical differences exist between male-to-female transsexuals and cisgendered males.

Sex-Atypical Hypothalamus Activation

The research described in this section evaluated the hypothalamus activation of 12 male-to-female transsexuals when smelling two steroids known to elicit sex-differentiated responses: 4,16-androstadien-3-one (AND) and estra-1,3,5(10),16-tetraen-3-ol (EST) [7]. Data from a similar study by the researchers with an unspecified number of heterosexual male and female controls was available for comparison. The male-to-female transsexuals in the study were all heterosexual with regard to their birth sex; this eliminates the confounding effect of sexual orientation.

Both the transsexuals and the female controls experienced hypothalamus activation by AND, while the male controls experienced hypothalamus activation by EST. The activation heatmap image below, taken directly from the paper [7], illustrates the similar response to AND for male-to-female transsexuals (MFTR) and female controls (HeW), versus the response by the male controls (HeM). Furthermore, this image shows how the response to EST is distinct between transsexuals and control males.


This analysis suggests that transsexual hypothalamus activation by these steroids is birth-sex atypical.

White Matter Microstructure

[8] is the first analysis I found concerning female-to-male transsexualism. Unfortunately, I could only read the abstract since I couldn’t find the full article for free, but here is a summary of the findings:

Fractional anisotropy (FA) was performed on 18 female-to-male transsexuals, 24 male controls, and 19 female controls. The controls were heterosexual. The transsexual subjects had yet not started hormone replacement therapy. The FA procedure evaluated the white matter fibers of the whole brain.

FA values for the right superior longitudinal fasciculus, the forceps minor, and the corticospinal tract were compared between the groups. The values for the female-to-male transsexuals more closely resembled those of the control males than the control females. This lends support for the existence of brain structure differences between female-to-male transsexuals and cisgendered females.

Brain Anatomy More Congruent With Gender Identity Than Biological Sex

Another study [9] considered seven female-to-male transsexuals and ten male-to-female transsexuals simultaneously, along with age matched controls (eleven cis-females and seven cis-males). This seems small for a study, but the authors cite a limited subject pool. The subjects were given MRIs and brain regions distinguishable by the gender identity variable and the interaction between the gender identity variable and biological sex variable were identified.

The researchers identified four brain regions where gray matter volume of the transsexual subjects were identical to that of the controls sharing the subjects’ gender identity, but different from the controls sharing the subjects’ biological sex. The gray matter volume was higher in the right middle and inferior occipital gyri, the fusiform, the lingual gyri, and the right inferior temporal gyrus for male-to-female transsexuals and cis-female controls. In contrast, the gray matter volume was greater in the left pre-and postcentral gyri, the left posterior cingulate, the calcarine gyrus, and the precuneus in female-to-male transsexuals and cis-male controls.

To limit confounding effects, all transsexual recruits for the study were homosexual (with regard to birth sex) and had not started hormone replacement therapy. Nothing is stated in the paper about the sexual orientation of the controls.

Post Mortem Studies

Two post mortem studies are of note, though they are limited by small sample sizes and the fact that the male-to-female transsexual subjects involved had been treated by estrogen, which may impact brain plasticity [10].

The first study [11] observed that the size of male-to-female transsexuals’ bed nucleus of the stria terminalis was more typical of cis-female size. Similarly, [12] reported that male-to-female transsexuals’ volume and neuronal density of the interstial nucleus of the anterior hypothalamus was more cis-female typical.

Related Posts

the science of gender identity (part 1: genetics)


Posted in science, statistics | Tagged , , , , , , , , , | 1 Comment

the science of gender identity (part 1: genetics)


This is the first in a multi-part series surveying the current science of gender identity, particularly with regard to the transgendered population. I intend to discuss the genetic, brain anatomic, and neuropsychological findings of recent studies on the matter. As always, I will incorporate my own statistical analysis of raw study data wherever possible.

Here I start by discussing four studies involving genetic variations thought to be correlated with transsexualism. Some of these studies show promising leads toward increasing our understanding, others report limited or no findings. Limited or no findings does not imply that no genetic factors relate to transsexualism, just that none were found for the particular gene variant examined by the study.

My only beef with these studies is that they consider only one or a few genetic variations at a time. This is a limitation of the technology used. As the cost of whole-genome sequencing decreases, we’ll be able to look for simultaneous genetic variations that play a role in concert with each other.

Code and data for the analyses presented below is attached.

A Bit About the Words I’m Using

Two words I use in this post bother me, so I thought I’d explain my choice to use them.

First, I’d prefer to use the umbrella term “transgender” to label the study participants described below. However, “transgender” is too broad, as the research I describe focused on those who particularly modify their bodies to become a member of a different sex, which not all transgendered individuals want to do. Therefore I use the medical term for this population: “transsexuals”.

Second, “nucleotide variation”, which I associate below through analysis with transsexualism, implies there is a “normal” non-variation. The word is used to indicate that the particular DNA sequence involved is not present in most individuals’ genome. More common DNA variations are those that result in blue eyes vs. the more frequent brown, and certainly nothing is pathological about have blue eyes. In the same vein, I assert that nothing is pathological about transsexualism; its hypothesized genetic component is simply part of our genetic diversity.

Gene Promoter Variation rs549669867

A nucleotide variation (rs549669867) in the promoter for the gene CYP17A1 associates with female-to-male transsexualism according to a study outlined in [1]. CYP17A1 is a key gene involved in steroid metabolism, and this particular variation causes carriers to possess higher concentrations of both testosterone and estrodiol in their bodies [1]. These findings are consistent with a prevailing theory that extra testosterone causes masculinization of the female brain during fetal development, thereby contributing to development of gender dysphoria.

Here I present independent statistical reasoning based on data obtained from the study paper, which supports the researchers’ conclusions. These conclusions do not fully explain the origins of female-to-male transsexualism, as there were non-transsexuals included in the study who had the nucleotide variation, and there were transsexuals in the study who did not. However, the difference in frequencies of the variation’s occurrence between the transsexual and non-transsexual study participant groups is statistically significant.

First I’ll discuss the nucleotide variation itself. The following screenshot from the UCSC Genome Browser [2] shows 50 nucleotides upstream and downstream from the start of gene CYP17A1 on chromosome 10 of the human genome:


The variation we are examining is shown in the lower left, 34 nucleotides before the start of CYP17A1 (this is inside the “promoter” region of the gene). For the genomic strand sequenced in the study (any of two could have been chosen), the normal nucleotide at this position is a “T” and the variation is a “C”. From analysis of 1000 Genomes Project data, this variation is expected to occur on one of an individual’s two copies of chromosome 10 with a frequency of 0.02% [3].

Now the statistical analysis:

The study recruited 49 female-to-male transsexuals and 913 female controls, then sequenced their DNA in the promoter region of gene CYP17A1 to determine their genotype. The genotype could be one of three outcomes: “TT”, indicating lack of the nucleotide variation on both copies of chromosome 10; “CT”, indicating the variation occurs on only one of the chromosome 10 copies; and “CC”, indicating the variation is present on both copies of chromosome 10. The genotypes and their frequencies by group are listed in the following table:


We make two comparisons: The number of recessive genotypes vs. non-recessive genotypes (CC vs. CT + TT), and the number of dominant genotypes vs. non-dominant genotypes (TT vs. CT + CC). A variation often has to be recessive (present on both copies of its chromosome) to be biologically active, though this is not always the case.

Testing recessive vs. non-recessive genotype counts by study group using Fisher’s exact test yields a p-value of 0.03081, indicating a statistically significant difference exists between the transsexual and non-transsexual groups with regard to presence or absence of the recessive genotype.

Testing dominant vs. non-dominant genotype counts by study group using Fisher’s exact test yields a p-value of 0.05533, which is just over the commonly used threshold for declaring statistical significance.

It follows from this data and analysis that we can conclude that the recessive genotype is associated with female-to-male transsexualism. Again, this association does not explain all cases, e.g., why some non-transsexuals also have the recessive genotype, but it contributes to scientific efforts to understand transsexualism’s origins.

Gene Variation rs743572

Nucleotide variation rs743572 also impacts gene CYP17A1. Rather than residing in the promoter region of the gene as did rs549669867, this variation lies within the gene itself.

In the my analysis of this variation’s study data discussed below [4], the association between the variation and transsexualism (comparing transsexuals vs. controls) is not significant. However, the difference in the frequency of the variation between female-to-male transsexuals and male-to-female transsexuals is significant according to one of two statistical tests I conducted. (The study authors concluded the same thing, just with different p-values). Therefore I’m reporting this variation as notable with regard to our efforts to understand the genetic underpinnings of transsexualism. The difference between this variation’s frequency in female-to-male transsexuals vs. male-to-female transsexuals may lead to insight into the origin of each outcome separately (per nominal biological sex), rather than help provide a “one size fits all” explanation for transsexualism.

rs743572 resides 139 nucleotide positions from the start of gene CYP17A1. It occurs on one of individuals’ two copies of chromosome 10 with a frequency of 41% [5]. The fact that this variation is much more common than rs549669867 probably explains why the transsexualism vs. control association for the variation I investigate below does not prove statistically significant. The following screenshot from the UCSC Genome Browser [2] shows the variation on gene CYP17A1 within chromosome 10 of the human genome:


The study [4] whose data I analyze here recruited 151 male-to-female and 142 female-to-male transsexuals. The researchers also recruited 167 male and 168 female non-transsexuals. All were Spaniards with no possibly confounding health issues. Of these subjects, 36% of the male-to-female and 45% of the female-to-male transsexuals carried the variation. 39% of the male and 38% of the female non-transsexuals also carried the variation. Presence or absence of the variation was determined through DNA sequencing. From this data I constructed the following contingency table, rounding to get whole numbers:


Performing pairwise comparisons of the count proportions using both Fisher’s exact test and the Chi-squared goodness of fit test yields the following p-values:


As mentioned above, the only significant difference in variation proportions is in the comparison of female-to-male vs. male-to-female transsexuals, and this is only according to one of the two tests I used. Therefore this variation does not by itself seem a strong contributor to our effort to explain the differences in terms of genetics. However, a whole-genome comparison study on similar test subjects could elucidate whether this variation interacts with other variations to form a combined association with transsexualism.

Androgen Receptor Repeat Length Variation rs193922933

A study [6] correlated the androgen receptor (AR) gene’s CAG repeat length variation (rs193922933) with male-to-female transsexualism. I feel the researchers did not perform their statistical analysis correctly, and have remedied the situation below. However my conclusion was the same.

The AR gene’s CAG repeat length is highly variable between individuals. Each occurrence of the repeat appends an extra amino acid to the androgen receptor protein, as shown below. No information about the frequency distribution of this variation was readily available [7].


Longer CAG repeat lengths are known to diminish testosterone signaling, which impacts masculinization of the brain during development [6].

The study authors sequenced the CAG repeat region of 112 male-to-female transsexuals and 258 male controls. They report the length data in the following plot (but not their raw data) [6]:


Using the GNU Image Manipulation Program, I measured each bar to determine the percentages and reconstructed the source data, re-plotted as follows:


Here we see that the CAG repeat length medians between the transsexual subjects and the controls differ by one (with the transsexual group’s median being longer), and that the interquartile limits are identical. The control group has a heavier lower tail.

The researchers compared the means using a t-test, which I am uncomfortable with due to the skew in the male controls’ distribution. Therefore I performed a quasi-Poisson regression since this is underdispersed count data. That analysis reported a statistically significant difference between the two groups.

I could not find data on the practical significance of a median difference of one CAG repeat length.

Negative Results

Another study [8], found no association between CAG repeat length variation in the AR gene and transsexualism. Furthermore, it found no association between transsexualism and variations in four other sex hormone-related genes: estrogen receptors alpha and beta, aromatase CYP19, and progesterone receptor PGR.

However, I was not able to review the study’s paper due to its cost, to make sure the statistical analysis was valid (it often isn’t in medical journals). Furthermore, I don’t know whether the study considered interactions between the genes, a missed opportunity if it hadn’t. Eventually I’ll buy the paper and report on these open questions.

More Research Needed

A search of DisGeNet (a database of disease-gene annotations) [9] for the term “transsexualism” shows only five genes and five PubMed publications covering the subject. This reveals the dearth of research on the matter. The image below showing the genes and PubMed articles extracted from the search comes from my own implementation of DisGeNet’s data within a graph database, which I will discuss in a future blog post.



I of course object to DisGeNet’s labeling of “transsexualism” as a disease, and to its connection with the MeSH term “mental disorders”. I’ll contact DisGeNet about this issue shortly.

Related Posts

the science of gender identity (part 2: brain anatomy)

Code and Data





Posted in bioinformatics, science, statistics | Tagged , , , , , , | 1 Comment

HRC Corporate Equality Index correlates with Fortune’s 50 most admired companies


The Human Right’s Campaign, one of America’s largest civil rights groups, scores companies in its yearly Corporate Equality Index (CEI) according to their treatment of lesbian, gay, bisexual, and transgender employees [1]. The companies automatically evaluated are the Fortune 1000 and American Lawyer’s top 200. Additionally, any sufficiently large private sector organization can request inclusion in the CEI [2].

Similarly, Fortune Magazine publishes an annual list of 50 of the world’s most admired companies [3]. Companies are rated by financial health, stock performance, leadership effectiveness, customer sentiment, scandals, and social responsibility.

I became curious whether CEI scores correlate with membership in Fortune’s most admired list, so I matched the two datasets and analyzed the outcome. The results (below) are striking. Code implementing the calculations, with the source data, is attached.


Plotting the CEI score distributions by whether a company was included in Fortune’s list produced:


From this difference in distributions it is clear that the status of being “most admired” correlates with a high CEI score, though there are a few outliers. In the distribution on the left, we see that over 50% of the companies in Fortune’s list held the top CEI score of 100, whereas only 25% of the companies not contained in Fortune’s held the top score. The median score for the most admired group was 100 while for the companies not included in Fortune’s list it is about 80. Over 80% of the most admired companies scored 90 or above. The variance is much wider for the companies not included on the list. Statistical analysis comparing the two groups, detailed below, confirms the correlation.

While correlation does not imply causality, this analysis suggests two things: First, the type of leadership necessary to achieve a high CEI score is the same type of leadership that leads to inclusion in Fortune’s most admired companies group. Second, any company aspiring to membership in the most admired group might consider developing its CEI score.

There is one possible source of bias, but I don’t expect that it is large: “Social responsibility” is used in Fortune’s rankings, which may include CEI scores (I don’t know). However, Fortune’s emphasis on financial health and stock price probably trumps any contribution that the CEI would generate alone. Furthermore, in the CEI score distribution for the most admired companies, there are outliers containing extremely low scores. This suggests that the CEI played little if any role in the selection of most admired companies.


I manually copied and pasted the company names and scores from the CEI online database [1]. Then I cleaned up the results to create a manageable CSV file. Similarly, I copied and pasted the Fortune 50 most admired company list [3] into another CSV file. After that, I matched the two datasets by hand. Perhaps I could have performed the match algorithmically, but I would have had to worry about different representations of company names between the two datasets, e.g. “3M Co.” vs. “3M”. There was only 50 cases so the manual match did not take long.

Two cases in Fortune’s list had to be excluded, BMW and Singapore Airlines, because they were not included in the CEI, possibly because they are based outside the USA. In the case of two other non-US companies in Fortune’s list, Toyota and Volkswagen, I matched to Toyota Motor Sales USA and Volkswagen Group of America, respectively.

Finally, I plotted the CEI score distributions shown above and performed the statistical analysis reported below using the attached Python code.

Statistical Analysis

The extreme difference in variance between the two groups makes it impossible to compare medians using a non-parametric test, and the distribution of the CEI scores does not lend itself to a clean regression analysis. Therefore I built the following contingency table from the data:


The p-value for this table obtained from Fisher’s exact test is 4.53e-08, indicating that the proportions are significantly different.



Code and Data



Posted in data science, econometrics, marketing, statistics | Tagged , , , , , , , , , , | Leave a comment

fast genomic coordinate comparison using PostgreSQL’s geometric operators

PostgreSQL provides operators for comparing geometric data types, for example for computing whether two boxes overlap or whether one box contains another. Such operators are quick compared to similar calculations implemented using normal comparison operators, which I’ll demonstrate below. Here I show use of such geometric data types and operators for determining whether one segment of genomic sequence is contained within another.

Suppose we have a 391 row table called “oligo” containing the following information:


“Sequence” and “chromosome” are strings, but “coordinates” is of PostgreSQL geometric data type “box”. The coordinates show the (x, y) position of one corner and its opposite corner. Notice that I set the y values to zero, since we are only dealing with nucleotide positions on the chromosome. (For some reason the procedure I’m demonstrating here does not work with the “line segment” data type).

Similarly, suppose we have a 4,536,771 row table called “chromatin” containing:


We want to construct a list of oligo table entries whose coordinates are contained within a chromatin table entry’s coordinates, by chromosome. To do this we use PostgreSQL’s “@>” (contains) operator as follows:

SELECT o.sequence, chr.chromosome, CL.cell_line, 
m.method FROM oligo o, chromatin c, 
chromosome chr, cell_line CL, method m 
WHERE = c.chromosome_id AND 
o.chromosome_id = c.chromosome_id AND = c.cell_line_id AND = c.method_id 
AND c.coordinates @> o.coordinates;

Assuming that we properly indexed the tables (see attached SQL code), this procedure takes about 11.5 seconds on an Amazon t2.micro instance to get through all the rows under consideration.

Now suppose instead that we used ANSI standard SQL’s comparison operators. We would need modified tables:



After indexing, we compare the rows using:

SELECT o.sequence, chr.chromosome, CL.cell_line, 
m.method FROM oligo_start_stop o, 
chromatin_start_stop c, chromosome chr, cell_line CL, 
method m WHERE = c.chromosome_id AND 
o.chromosome_id = c.chromosome_id AND = c.cell_line_id AND = c.method_id AND 
c.start <= o.start AND c.stop >= o.stop;

This procedure now takes about 15.3 seconds.

If we run both methods 100 times and collect the query time statistics, we obtain the following boxplot:


The medians for the groups are a full 3.8 seconds apart, indicating that the PostgreSQL geometric “contains” operator performs significantly faster than the ANSI SQL comparison operators (p-value = 1.28e-34, Mann-Whitney U test).

Note on Indexing

To run the indexing commands in the attached code for a combination of an integer (chromosome ID) and a geometric box (genomic coordinates), you need to install the PostgreSQL extension “btree_gist”.



Posted in big data, bioinformatics, data science, engineering, science | Tagged , , , | Leave a comment

Bayesian network modeling stock price change

Taking a cue from the systems biology folks, I decided to model stock price change interactions using a dynamic Bayesian network. For this analysis I focused on the members of the Dow Jones Industrial Average (DJIA) that are listed on the New York Stock Exchange (NYSE).

Bayesian Networks

A Bayesian network is an acyclic directed graph where the nodes are random variables and the edges indicate conditional probabilistic relationships between the nodes. For example, a given node might represent the percent change in a stock’s price, and edges entering the node might represent factors that influence the node’s value.

The graph’s structure may be specified by a knowledgeable researcher, or may be learned from data using a search algorithm. We will demonstrate the latter case below. When learning from time series data, the graph need no longer be acyclic and the resulting graph is called a dynamic Bayesian network. The reason the acyclic requirement is dropped is that one node might influence a second node at one time point, while the second node influences the first node at another time point. This is a common occurrence in biological circuits which contain feedback loops.


The following is a dynamic Bayesian network learned from a year of NYSE DJIA stocks’ daily changes in closing prices. The method used to learn the network from the data is described below.


From this we see that the probability associated with stock JNJ’s daily change in closing price is conditional on stock V’s daily change in closing price, and subsequently the probability associated with stock XOM’s daily change in closing price is conditional on JNJ’s daily change in closing price. Because this is a dynamic Bayesian network, we also see pairs such as XOM and UTX which are mutually dependent on each other. This results from the fact that the first stock in the pair might influence the second at one date, and the second might influence the first on a different date.

These results are pure unsupervised machine learning; I have not yet analyzed whether the proposed interactions make sense economically. Such analysis will be required to validate the method.

The graph shown above was generated by GraphViz.


The following describes the procedure used to learn the above network from the data. Python code implementing the procedure is attached to the end of this post.

I started with a list of daily stock prices for all the stocks listed in the NYSE, e.g.,


I then filtered out all symbols not found in the DJIA and all dates outside of the range 29 October 2013 through 28 October 2014. I also discarded the open, high, low, and adjusted prices from the above data structure, as well as the volume, since we are only modeling change in closing price.

Next, I sorted each symbol’s closing price by date and computed a “percent change in closing price” time series for each symbol from the sorted closing price time series.

I then “discretized” the data as follows: For each trading day in the time series, I computed the mean percent change in closing price across all stocks. Using this I created a new time series for each stock, assigning a zero to the date if the percent change was below the daily mean, and a one to the date if the percent change was equal to or above the daily mean. This produced a series of ones and zeros for each stock.

I performed this discretization process to simplify my learning of Bayesian network learning procedures, taking the idea from gene expression modeling where indication of whether a gene is expressed or not might be dependent on its expression level in relation to a control gene’s expression level. Nonetheless, a more sophisticated model could have used a trinary discretization, or no discretization at all.

I next “transposed” the discretized time series as follows: Placed each time series for each stock as a row in a matrix, and then transposed the matrix so that each stock is a column and each row is a “system state” for the stocks on a given day. (The “system state” is made up of zeros and ones indicating movement of the percent change in price, per stock). I assigned a probability (hereafter called the “base state probability”) to each of these system states as one divided by the total number of system states, and then checked for duplicate system states, adjusting the base state probability accordingly.

I then created an initial Bayesian network graph, assigning a node to each NYSE DJIA stock and no edges, as shown by the following networkx image:


The learning procedure applies a greedy algorithm to search for an optimal graph: mutating the graph (randomly adding, reversing, or removing an edge) and scoring it against the data. If the mutated graph’s score exceeds the previous graph’s score, the new graph is retained and is further mutated until a set number of iterations has run. The scoring procedure is described below:

Bayes theorem provides a means to relate the probability of observing data given a graph structure to the probability of observing a graph given data


In this equation, P(Data) is a constant that does not matter because we are simply trying to maximize P(Graph|Data). The prior P(Graph) is a function of the graph structure that we use to penalize graphs with many edges over graphs with fewer edges, seeking the graph that provides the simplest explanation of the data. Taking the log of each side gives us the equation we use for scoring a graph given the data


We are left with the need to compute P(Graph|Data), given by the following equation


Here t = 1 through T are the individual time points, and i = 1 through n are the nodes (stocks). For each time point t, we compute the probability of Xi given the state of its parent nodes in the graph. This is computed using the base state probabilities defined above and the system state at the time point.

I mutated the graph as described above 10,000 times, scoring each iteration and comparing it to the previous high score, which generated the network shown in the Results section. However, for the final graph I computed P(Data|Graph) for each time point, and found that the resulting probabilities proved really small:


These low probabilities make me suspicious of the resulting network. Again, analysis of how the particular stocks that the algorithm joined together relate economically is necessary to evaluate the method.


Posted in econometrics, engineering, science | Tagged , , , , , , , , , , , , , , , , , | Leave a comment

reflections on living in an RV for six months

I’ve been living in my RV for six months now, and overall I like doing so. Recently, one of my friends expressed interest in full time RV living, so I wanted to write down all the challenges so that he knows what to expect before taking the plunge. Before I do that though, I’m going to list the good stuff:

Benefits of Full Time RV Living

Cheap Rent

The primary benefit is cheap rent. Monthly rates at RV parks are very inexpensive compared to apartment rentals in my area (San Diego county). When you add in the cost of financing the RV, the total monthly cost resembles that of a studio apartment, but you get to keep the RV. This assumes you live in an RV park. I suppose it would be possible to boondock frequently for even less money, but since I have a job I need a stable address. (See below for a listing of the RV park-related challenges that must be considered).

Less Stuff

Fitting into an RV means having less stuff, unless you rent storage space which I didn’t. I find owning few things liberating. I still have room in my RV for musical instruments (sans amplifiers), a few books, and a computer, so I have no trouble staying entertained. Not having many material things to manage has provided substantial freedom from intellectual distraction.

Easy to Move

If I get laid off from my job, which I don’t expect to happen, moving will be easy. I’ll just hitch my trailer to my truck and be off to wherever new economic opportunity resides. Alternatively, if I choose to pursue a PhD after completing my Master’s degree I’ll simply tow my rig to wherever I attend school.

Ecological Footprint (The Benefit)

An RV impacts the land it is set upon less than an apartment or house does, leaving less of an ecological footprint.

Feeling Less Screwed

While I still have to pay rent to the RV park, I don’t feel I’m paying as much to “the man” as I would in an apartment setting, which suits my slightly anarchistic world view. Saving more money increases my overall freedom.

Challenges of Full Time RV Living


My cat does not have as much room in my RV as he had in my last apartment. However, he is an old cat who doesn’t move around much. And he has lots of toys to play with so I don’t worry much about him. The biggest challenge was figuring out where to fit in litter boxes, and how to keep litter from getting all over the RV. (Litter gets into the floor-mounted heating ducts for instance). I partially solved the problem by laying plastic sheeting in a corner and placing the litter boxes on top. This is not a perfect solution but it works well enough:


Insulation and Propane Use

While I live in mild San Diego county, I live in the foothills of the mountains and therefore it can get cold at night during the winter. During these nights I go through a lot of propane since my RV’s insulation is poor. I estimate that during December I spent $100 on propane, which I used only for heating (not cooking or water heating). Fortunately, my RV park has a propane filling service where you leave an empty bottle on the driveway in the morning and they return it full in the afternoon, billing you at the end of the month.

The poor insulation was a problem in the summer too where I’d have to run my air conditioner during the day to keep my cat cool. I estimate the monthly electricity bill was $100.

I hear that the trick to getting an RV with good insulation is to buy one made for Canadians. I’ve thought of fixing the “reflective bubble wrap” insulation you can buy at Home Depot to the walls to improve the situation, but haven’t figured out a good way to attach the insulation in a way that doesn’t leave a mark when the insulation is removed.

Internet Access

My RV park has very poor WiFi, which also costs $30 a month. This did not meet my schooling needs so I bought mobile hotspot service from my phone provider. This provides only 4G of data at 4G LTE speed per month, which is enough for browsing but not enough for much video (I had to get rid of Netflix streaming service). I am fortunate that my RV park, which is rather remote, is still within 4G range. To adapt to the limited monthly download volume, I download my class lectures at work and save them to a flash drive, and also download large data files to an Amazon EC2 instance and manipulate the files there (more on Amazon EC2 below). I also take my laptop to the library often to use the WiFi there.

Computing Power

I do not have room for a large desktop computer, which limits the computing power I have available locally. Readers of this blog will be aware that I perform computationally demanding tasks regularly. To get around the problem I rent computing power from Amazon when I need it. Using Amazon, I probably break even with the overall cost of buying a powerful desktop computer, so nothing is really lost. All I need to make this arrangement work is the 4G internet access discussed above. This works because all data is processed and stored on the Amazon EC2 instances, so only results and commands need to be passed between my laptop and the EC2 instances, thereby minimizing the amount of the monthly data download availability I consume.

Lab Space

I am both a scientist and an engineer, and therefore am constantly tinkering with something. Usually this tinkering involves only a computer so there is no need for lab space. However, there are projects I would like to take on that require space, such as learning to use 3D printers, that I simply do not have. I only have one table and that holds my laptop and rather large monitor. The table provides enough room to mess with things like Raspberry Pis, but again, a 3D printer or an autonomous robot might be too big for the space I have.

Small Space (Need for Storage Efficiency)

Living in an RV requires storage efficiency. For example, I use my bunks as storage racks:


Similarly, under the bed I store books:


Note that when planning storage, you must not exceed your towable weight limit.

Having to Own a Truck

My RV is a travel trailer, so I have to own a truck to tow it with. Since it makes little sense for me to own two vehicles, I’m not buying a more economical hybrid vehicle that would make my compute cheaper and more ecologically friendly. So I feel I am stuck with my gas-hogging truck for the time I own the travel trailer.

I could have bought an RV with its own engine (at substantially greater cost) and then bought a vehicle suitable for towing behind it. However, this arrangement was out of my price range at the time I bought the travel trailer, since I already owned the truck with no debt.

Issues With RV Parks

I like the RV park I am in, having no complaints other than the great distance I have to travel to get to work. However, there are three issues one needs to be aware of when staying at a park:

1. Your RV must be younger than some date set by the park, and in good condition, or they won’t let you stay there. This precludes buying a very cheap, very used RV and fixing it up.

2. The RV parks in the cities around me (San Diego county) have limits on how long you can stay. Usually, they require you to leave for two days following a six-month stay. I think these are city laws but am not sure. Out in the county, I’ve not seen the same constraint, so I live in the county far from my workplace.

3. The RV parks close to where I work cost substantially more, up to twice as much per month. This is because they are closer to the beach. However, there is a trade off since living at one of these would reduce my commute, but I’d have to deal with issue #2 above.

Long Commute

As mentioned above, I live in a relatively remote part of San Diego county compared to where I work, so my commute is typically 45 minutes one way. Fortunately, the county has a good public radio station.


I haven’t mastered cooking in such a small environment yet, and therefore use a microwave for most of my meals. My RV has a propane stove top with three burners but no oven. There is also no dishwasher, making cleanup a bit more annoying. I also haven’t mastered food storage in the small space yet. The small refrigerator has proven big enough for my needs though.

Ecological Footprint (The Challenge)

My long commute, having to own a truck, and the poor insulation all contribute to the ecological footprint of my living in an RV.

Lack of Study Space

Similar to the lack of lab space is a lack of good study space. Again, my only table is taken up by my computer and monitor, leaving no room to spread out books and lecture notes. I find studying on my bed instead makes me tired so I usually do my homework at work (after hours).

Not Sure What To Do When Lifestyle Changes

I expect I won’t live in an RV forever, e.g., when I start a family. Therefore I’ll eventually need to go through the hassle of retiring the RV. This will mean either storing it or selling it. Storing it will be expensive, and selling it suffers the effect of great depreciation. Moreover, I might still owe on the RV loan after such a change in lifestyle. My response to this eventuality has been to pay off the loan aggressively.


My RV was not designed for full time living (some are). For example, latches are weak and cabinets are easily damaged. Moreover, full time living might void the warranty according to the users’ manual. I did not realize this until I had bought the unit. If there is a next time, I’ll buy a unit made for full time living.

Solar Power

Using solar power to charge batteries makes sense for boondocking, but mixing solar power with park-based living may not provide much benefit. I haven’t seen RV solar installations that feed power back into the grid once the batteries are charged (though admittedly I haven’t looked as hard as I could). This is something I’ll explore more in the future, but for now I’m suggesting that investing in solar power for park-based RV living needs to be carefully evaluated before taking the plunge.

Tank Sensors Giving Me Trouble

From the moment I started using my RV, my sensor on my black-water holding tank has been reading three-quarters full, no matter what chemicals I dump into it. Therefore I’m not always sure when to dump the tank. Even when living in a park, it is best for the holding tanks to let them fill before emptying them, rather than simply leaving the valve open all the time.

12 V

As expressed in my post “engineer moves into an RV“, there are no 12 VDC receptacles for running 12V appliances in my RV. This is no problem for park living, but if I boondock using the batteries for power I’d like to be able to use 12V devices. Furthermore, there is no easy place to connect an inverter to my 12V power supply. These are things I wish I checked for when shopping for RVs.


I’ve found plumbing to be a pain in my RV. There are two problems: First, you can’t use chemicals like Draino to clear blockages since they will damage your holding tanks, so you have to manually clear pipes. Second, the tight spaces make getting to the pipes a challenge.


If you let the water heater run for 45 minutes, you can take a warm, roughly 10 minute shower. However, you’ll lose warm water after that. In drought-stricken southern California, one should not take a long shower anyway. You have to remember to turn off your water heater when you are not using it to save propane.

Related Posts

selecting travel trailers by regression
engineer moves into an RV

Posted in Uncategorized | Tagged , , , | Leave a comment

gene annotation database with MongoDB

After reading Datanami’s recent post “9 Must-Have Skills to Land Top Big Data Jobs in 2015″ [1], I decided to round out my NoSQL knowledge by learning MongoDB. I have previously reported NoSQL work with Neo4j on this blog, where I discussed building a gene annotation graph database [2]. Here I build a similar gene annotation database, but using the MongoDB platform and its design patterns.


MongoDB is a database system that stores documents made up of key-value pairs. The values may themselves be other documents of key-value pairs, or arrays, resulting in an overall document format similar to JSON. These JSON-like documents are then compiled into sets called “collections”, which resemble tables in relational databases (typically documents describing similar things are grouped within the same collection). Unlike relational databases, however, is the fact that the documents in a collection do not have to have exactly the same keys (fields). Like a relational database table, a collection may be indexed by one or more of its keys, or by a key of embedded documents within the documents of the collection. Additionally, two collections may be matched on mutual identifiers, similar to a join in a relational database, but the matching is done on the client side rather than on the server.

These concepts will be illustrated by the example case shown below: a gene annotation database containing gene information for 15079 species, along with RefSeq information for each gene.

Gene Annotation Database

I first downloaded the “gene_info” and “gene2refseq” files from NCBI’s FTP site [3] and inserted them into MongoDB using the Python code appended to the bottom of this post. This created two collections, “gene_collection” which stores gene information–one gene per document–and “refseq_collection” which stores protein/transcript information–one RefSeq entry per document. Each document in the RefSeq collection points to the gene document that the RefSeq sequences correspond to.

Searching the gene collection by NCBI Entrez gene ID 675 yields a full gene document:


Here we can see that MongoDB assigned an internal ID “_id” to the document, which we will use later to reference it from the RefSeq documents. We see standard key-value pairs such as “GeneID” which contains an integer value specifying the Entrez gene ID and “type_of_gene” which contains a string describing whether the gene is known to code for a protein. For the key “dbXrefs”, notice that its value is an embedded document that contains its own key-value pairs linking the gene to other database identifiers. Keys “Synonyms” and “Other_designations” have arrays as values.

Searching the same collection by NCBI Entrez gene ID 544203 yields a gene document with less keys:


This shows how documents within a collection need not share the same keys. It is helpful to have some of the same keys to have a meaningful data model design, but by not forcing every document to have the same schema one avoids the heavy use of null values often found in relational database tables.

One can search a collection by a top-level key as demonstrated above. You can also search for documents with an array containing an element, e.g.,


In this case we searched for human genes with “FAD” contained in their documents’ “Synonyms” array and listed only the IDs and symbols from the resulting documents.

Finally, we can search by the key of an embedded document:


Here we searched for the Ensembl ID “ENSG00000080815″ in the documents embedded under the key “dbXrefs”.


Like a relational database system, MongoDB enables users to index collections to improve query performance. However, unlike a relational database, the indices need not only be on the top-level keys, they can also be built from keys inside embedded documents or built from array contents.

The following table shows the performance improvement (in query times) between an indexed and non-indexed version of this gene annotation database:



MongoDB does not allow server-side joins between collections like those found in relational databases. Instead, collections may be matched on the client side by an application through use of a common reference. For example, in the image below we see that gene 5663 has an internal “_id” of “54b067d44f2ff124afc15cbf” in the gene collection. When we loaded the RefSeq data into the RefSeq collection, we matched each RefSeq entry to their respective genes through the “gene_document_id” key, which points to the corresponding “_id” in the gene collection. The ID and reference are highlighted in green in the image below:


The following Python code shows how to implement a client-side match by these identifiers:

# useful libraries
from pymongo import MongoClient

# connect to MongoDB
client = MongoClient('localhost', 27017)
db = client.gene
refseq_collection = db.refseq_collection
gene_collection = db.gene_collection

# find gene 5663
gene = gene_collection.find_one({'GeneID' : 5663})
gene_document_id = gene['_id']

print 'Showing RefSeq sequences for gene 5663:'
print '\t', 'RNA', '\t', '\t', 'Genomic', '\t', 'RefSeq Collection Document ID'

# find RefSeq data for gene (by reference)
refseq = refseq_collection.find({'gene_document_id' : gene_document_id})
for r in refseq:
    if r.has_key('RNA_nucleotide_accession_version') and r.has_key('genomic_nucleotide_accession_version'):
        print '\t', r['RNA_nucleotide_accession_version'], '\t', r['genomic_nucleotide_accession_version'], '\t', r['_id']


Running this code shows all the RefSeq RNA and genomic nucleotide sequences associated with gene 5663:



I have not shown all the features of MongoDB using this gene annotation database example. For instance I left out discussion of aggregation, scripting, and scalability across multiple servers. However, I believe I have shown enough to demonstrate knowledge of MongoDB and to demonstrate that it is a viable alternative to traditional relational databases for Big Data applications.


Python code for loading NCBI’s “gene_info” table [3] into MongoDB is:

# useful libraries
import datetime
from pymongo import MongoClient

# connect to MongoDB
client = MongoClient('localhost', 27017)
db = client.gene
collection = db.gene_collection

# get header
f = open('data/gene_info')
for line in f:
    header = line.strip().split(' ')[1:16]

# load data
f = open('data/gene_info')
for line in f:
    line = line.strip()

    # remove commented out lines
    if line[0] == '#':  continue

    # deal with missing values
    split_line_with_dashes = line.split('\t')
    split_line = []
    for s in split_line_with_dashes:
        if s == '-':

    # specify data types of values
    split_line[0] = int(split_line[0])
    split_line[1] = int(split_line[1])
    year = int(split_line[14][0:4])
    month = int(split_line[14][4:6])
    day = int(split_line[14][6:8])
    split_line[14] = datetime.datetime(year, month, day, 0, 0)

    # split the aliases
    if split_line[4] != None:
        split_line[4] = split_line[4].replace(' ', '').split('|')

    # split the other designations
    if split_line[13] != None:
        split_line[13] = split_line[13].split('|')

    # split the DB cross references
    if split_line[5] != None:
        db_cross_refs = split_line[5].replace(' ', '').split('|')
        cross_refs = {}
        for dcr in db_cross_refs:
            db = dcr.split(':')[0]
            ref = dcr.split(':')[1]
            cross_refs[db] = ref
        split_line[5] = cross_refs

    # build document
    document = {}
    for h, s in zip(header, split_line):
        if s != None:
            document[h] = s

    # insert document into MongoDB


Python code for loading NCBI’s “gene2refseq” table [3] into MongoDB is:

# useful libraries
from pymongo import MongoClient

# connect to MongoDB
client = MongoClient('localhost', 27017)
db = client.gene
collection = db.refseq_collection
gene_collection = db.gene_collection

# get header
f = open('data/gene2refseq')
for line in f:
    header = line.strip().split(' ')[1:17]
    header = [h.replace('.', '_') for h in header]

# load data
count = 0
f = open('data/gene2refseq')
for line in f:
    line = line.strip()

    # remove commented out lines
    if line[0] == '#':  continue

    # deal with missing values
    split_line_with_dashes = line.split('\t')
    split_line = []
    for i, s in enumerate(split_line_with_dashes):
        if s == '-' and header[i] != 'orientation':

    # convert data types
    if split_line[0] != None:  split_line[0] = int(split_line[0])
    if split_line[1] != None:  split_line[1] = int(split_line[1])
    if split_line[4] != None:  split_line[4] = int(split_line[4])
    if split_line[6] != None:  split_line[6] = int(split_line[6])
    if split_line[8] != None:  split_line[8] = int(split_line[8])
    if split_line[9] != None:  split_line[9] = int(split_line[9])
    if split_line[10] != None:  split_line[10] = int(split_line[10])
    if split_line[14] != None:  split_line[14] = int(split_line[14])

    # build document
    document = {}
    for h, s in zip(header, split_line):
        if s != None:
            document[h] = s

    # get ID of corresponding gene document
    gene_id = split_line[1]
    gene = gene_collection.find_one({'GeneID' : gene_id})
    document['gene_document_id'] = gene['_id']

    # remove 'GeneID' from document
    del document['GeneID']
    # insert document into MongoDB



Posted in big data, bioinformatics, data science, science | Tagged , , , , , , , , , , , , , | Leave a comment

new year’s revolutions

One has to constantly remake themselves to adapt to the market and to the times. This activity serves those who practice it well economically and keeps them from becoming bored. It requires a constant state of personal “revolution”.

Here I describe my revolutionary plans for 2015; namely which technologies I intend to learn and which actions I will consider for the year.

Synthetic Biology

Synthetic biology has the potential to revolutionize manufacturing, first the manufacturing of drugs and biofuels, then the manufacturing of more exotic materials. To do this we (researchers and engineers) will have to learn how to model biological circuits effectively (see my recent post “iBioSim: a CAD package for genetic circuits“). My intention this year is to strengthen my knowledge of synthetic biology’s computing side, leveraging my experience as an engineer and a bioinformatician in the process. Particularly I intend to combine this effort with my work to improve my statistical and machine learning knowledge, as both can be applied to the identification of genetic circuit parameters.

(Also see my post “synthetic biology: an emerging engineering discipline“) for a description of where the subject is at the moment.

Unstructured Data

Unstructured data is the frontier in data science. Currently such data (images, prose) is called “dark data” because it is difficult for a computer algorithm to automatically derive actionable information from it. However, new tools are becoming available for dealing with unstructured data. I therefore intend to learn how to deal with the kind of data found in prose: For example, extracting useful biological information from a bulk analysis of PubMed abstracts or beneficial stock tips from financial writing.

Internet of Things

I have no plans to deploy any Internet of Things objects this year, but am certainly interested in working with the data made available by such devices. If someone or some company wants to partner with me on this, I’m game.

Machine Learning

Machine learning is becoming easier due to new tools and software. I want to get better at choosing among and applying machine learning techniques in the next year. I have already started this effort with two blog posts on clustered stock market analysis (“clustering stocks by price correlation, part 1” where I explored use of hierarchical clustering and “clustering stocks by price correlation, part 2” where I applied k-means clustering).

Particularly, I want to learn the intersection between Big Data and machine learning, e.g., how to combine basic Apache Spark with Spark’s MLlib toolkit.

I see opportunity to combine machine learning efforts with all the above-stated subjects, since my treatment of those subjects will be highly data driven.

Thinking Big: Warp Equations

I’m no physicist and certainly won’t become one in a year, but I want to start investigating the state of space-folding equations and related ideas. In particular I want to examine how close the ideas are to trial by commercial venture.

Posted in big data, bioinformatics, data science, engineering, science | Tagged , , , , , , , , , , , , | Leave a comment

iBioSim: a CAD package for genetic circuits

iBioSim is a CAD package for the design, analysis, and simulation of genetic circuits. It can also be used for modeling metabolic networks, pathways, and other biological/chemical processes [1]. The tool provides a graphical user interface (GUI) for specifying circuit design and parameters, and a GUI for running simulations on the resulting models and viewing results. iBioSim also provides a means to “learn” system parameters from measured lab data, since it is often unclear at the model design stage what the values of all the required equilibrium constants and reaction rates are.

Below I’ll demonstrate the specification and simulation of a genetic circuit in iBioSim. I have demonstrated the simulation of genetic circuits using CAD on this blog before [2], but at that time used a generic system dynamics package called Vensim. iBioSim is more specialized than Vensim, being designed explicitly for modeling the particulars of genetic circuits (such as handling cell division) rather than any abstract dynamic system. Under the hood though, the simulation engine of both tools is the same: ODE simulation using Runge-Kutta methods. I call such tools “drag and drop differential equations tools” due to their specification of ODE models using a mouse pointer.

iBioSim is written in Java and is therefore available on any modern computing platform. It saves models in Systems Biology Markup Language (SBML) to facilitate communication with other tools. SBML is an XML-based standard developed by the research community to enable portable descriptions of biological processes [3]. iBioSim is produced by the Myers Research Group at the University of Utah and is freely available.

Model Specification

Exercise 1.3 of Chris Myers’ Engineering Genetic Circuits text [4], provides the following genetic circuit:


In this model, protein CI represses promoter P1, which produces protein LacI. Similarly, LacI represses promoter P2 which produces protein TetR. Finally, TetR represses promoter P3 which generates protein CI. In the exercise, the author specifies that CI must dimerize before acting as a repressor to P1.

The author of the book’s exercise also specifies the following model constants:


Implemented in iBioSim, this circuit looks like:


Model constants are entered during the creation of each model element. For example, for the repressor relationship between CI2 and P1:



Once the model is specified, it can be simulated. First users must select the simulation parameters:


Then users can run the simulation and obtain time series results indicating the expected amounts of each protein:


From this we see that the genetic circuit behaves similar to the oscillator described in [2], but that due to protein degradation the oscillation quickly dampens and the system reaches steady state operation.


  4. C. J. Myers, Engineering Genetic Circuits, Chapman & Hall/CRC Press, July 2009.
Posted in bioinformatics, engineering, science, system dynamics | Tagged , , , , , , , , , , , , | 1 Comment

clustering stocks by price correlation (part 2)

In my last post, “clustering stocks by price correlation (part 1)“, I performed hierarchical clustering of NYSE stocks by correlation in weekly closing price. I expected the stocks to cluster by industry, and found that they did not. I proposed several explanations for this observation, including that perhaps I chose a poor distance metric for the clustering procedure, that perhaps daily rather than weekly sampling would have generated better results, or that stocks simply don’t cluster by industry very much. I also proposed generating k-means clusters to see if that revealed different results. This post details that k-means clustering effort.

K-Means Clustering

K-means clustering starts with a matrix of observation and feature values, resembling the data structure one might use in a regression [1]. This differs from hierarchical clustering where the data matrix is an N-by-N pairwise comparison of each of the observations by a similarity (or distance) metric. An example of an N-by-3 matrix for k-means clustering is:


This matrix is then normalized (or “whitened” in the language of k-means clustering).

Each observation in the “whitened” matrix is then randomly assigned to one of k clusters [2], and the following steps are repeated until a stop condition (usually minimal change in cluster assignment) is reached [2]:

  1. The algorithm computes the average value of each current cluster.
  2. The algorithm then assigns each observation to the nearest cluster (by the calculated average value).
  3. Finally, the algorithm tests for the stop condition and repeats if it is not reached.

The result is a set of k clusters grouped by similarity of feature values.

We will implement our calculation using Python’s scipy.cluster.vq [3] module.


I started with a list of daily closing stock prices for all the stocks listed in the New York Stock Exchange (NYSE), e.g.,


I then extracted the closing price for every stock for every Tuesday between October 22, 2013 and October 28, 2014, so that we are sampling weekly and avoiding holidays. Stocks that were removed from the NYSE or that entered the NYSE during this period were discarded, thereby ensuring that each stock in the resulting data set had the same number of time points. [4]

For every stock in the dataset, I next calculated the Pearson correlation coefficient between the stock’s price time series and each of the stocks in the DJIA (NYSE only) price time series. This treats each correlation with a DJIA stock as a feature as required by the data matrix used by k-means clustering. The resulting feature matrix had dimension of 2601-by-27, or 2601 stocks compared to each of the 27 DJIA stocks listed on the NYSE, e.g,:


Finally, I “whitened” the feature matrix and ran the Python k-means module on the whitened feature matrix.

Python code implementing these calculations is appended below.


Like the hierarchical clustering results shown in [4], the groups do not seem to be organized by industrial sector:


This suggests a couple of possibilities: First, correlation in stock prices may simply not cluster by industry very much. Second, my use of k=50 (50 groups) may have been inappropriate, perhaps a better division number exists. Third, maybe sampling prices daily rather than weekly would produce better results. Finally, perhaps correlation is not the best feature to use.

I also tried using the weekly change in prices rather than the actual prices, but that attempt did not show any better grouping by industry.


# import useful libraries
import scipy.cluster.vq
import scipy.stats
import numpy as np
import datetime

# user settings
number_of_groups_to_use = 50

# set seed for repeatibility

# load the Dow Jones Industrial Average symbols
djia = []
f = open('DJIA.tsv')
for line in f:
    line = line.strip()
    company = line.split('\t')[0]
    exchange = line.split('\t')[1]
    symbol = line.split('\t')[2]
    if exchange == 'NYSE':
djia = sorted(djia)

# specify the time points we want to sample and sort them
end_date = datetime.datetime(2014, 10, 28, 0, 0)
dates_to_use = [end_date]
dt = datetime.timedelta(days=-7)
for i in range(0, 53):
    dates_to_use.append(dates_to_use[-1] + dt)
dates_to_use = sorted(dates_to_use)
new_dates = []
for x in dates_to_use:
    year = str(x.year)
    month = str(x.month)
    day = str(
    if len(month) == 1:  month = '0' + month
    if len(day) == 1:  day = '0' + day
    new_dates.append(year + '-' + month + '-' + day)
dates_to_use = new_dates

# load the closing prices for the given dates into memory
prices = {}
f = open('../../combined_data.csv')
for line in f:
    line = line.strip()
    date_string = line.split(',')[1]
    if not date_string in dates_to_use:  continue
    symbol = line.split(',')[0]
    close = float(line.split(',')[5])
    if not prices.has_key(symbol):
        prices[symbol] = {}
    prices[symbol][date_string] = close

# delete cases without all of the dates in dates_to_use
symbol_list = prices.keys()
for symbol in symbol_list:
    date_list = prices[symbol].keys()
    if len(date_list) != len(dates_to_use):

# generate price time series
price_time_series = {}
for symbol in prices.keys():

    # actual prices
    price_time_series[symbol] = [prices[symbol][d] for d in dates_to_use]

    # change in prices
    #price_time_series[symbol] = [(prices[symbol][dates_to_use[n]] - prices[symbol][dates_to_use[n-1]]) / prices[symbol][dates_to_use[n-1]] for n in range(1, len(dates_to_use))]

# calculate R
new_symbol_list = sorted(price_time_series.keys())
R_dict = {}
for i in range(0, len(new_symbol_list)):
    if not R_dict.has_key(i):
        R_dict[i] = {}
    for j in range(0, len(djia)):
        symbol_i = new_symbol_list[i]
        symbol_j = djia[j]

        # Pearson's R
        R = 1.0
        if symbol_i != symbol_j:
            R = scipy.stats.pearsonr(price_time_series[symbol_i], price_time_series[symbol_j])[0]

        # record R
        R_dict[i][j] = R

# create the feature matrix
feature_matrix = np.zeros([len(new_symbol_list), len(djia)])
for i in R_dict.keys():
    for j in R_dict[i].keys():
        feature = R_dict[i][j] + 1.   # range 0 to 2
        feature_matrix[i][j] = feature

# cluster using k-means
W = scipy.cluster.vq.whiten(feature_matrix)
K = scipy.cluster.vq.kmeans(W, number_of_groups_to_use)
VQ = scipy.cluster.vq.vq(W, K[0])
groups = VQ[0]

# report results
f = open('output/groups.csv', 'w')
for g, s in zip(groups, new_symbol_list):
    f.write(s + ',' + str(g) + '\n')

# write the feature matrix to disk
f = open('output/features.csv', 'w')
f.write('symbol,' + ','.join(djia) + '\n')
for i in R_dict.keys():
    symbol_i = new_symbol_list[i]
    f.write(symbol_i + ',')
    values = []
    for j in R_dict[i].keys():
    f.write(','.join(values) + '\n')


Posted in data science, econometrics, engineering, statistics | Tagged , , , , , , , , , , , , , , , | Leave a comment