statistical reasoning in the “The Simpsons”

FOX recently* broadcasted a fundamental question that drives good science:

“I’m sure there’s a correlation, but could there be a causation?”

The intrepid Lisa Simpson, the greatest cartoon scientist of our time, spoke these words after observing a pair of scorpions become docile in the presence of a specific plant.

Quality statistical reasoning rarely gets expressed in popular media. Thanks FOX!

* “The Scorpion’s Tale”, Aired 6 March 2011

Posted in Uncategorized | Tagged , , | Leave a comment

RNAfold and sequence length

I’ve been looking for a way to compare RNAfold [1] dG results for two RNA sequences, where the two sequences differ in length. My initial thought was to simply divide the computed dG’s by sequence length (i.e., normalize by sequence length) and then compare the results. The analysis presented below shows why this won’t work.

Given an RNA sequence of the form CCCCC…GGGGG…, such that the same number of G’s as C’s follow the C’s, we fold the RNA into a hairpin structure such as the one pictured below using RNAfold:

We follow this procedure for varying lengths of the hairpin:

and observe that the computed dG varies linearly with hairpin length:

However, despite the linearity demonstrated above, we cannot simply normalize these dG results by dividing them by sequence length, as I had hoped:

The problem is that the linear relation shown above does not intercept the y axis at the origin:

If normalization by dividing dG by sequence length was valid, the above graph would show a horizontal line. But only as sequences get really long do they approach a constant value.

In a later post I’ll analyze whether a folding dG result for a sequence of a given length can be normalized by dividing the dG by the maximum possible dG for that sequence length, similar to the BLAST score normalization approach described here.

Python and R code used in this analysis is posted on the Badass Data Science wiki here.

[1] Fast Folding and Comparison of RNA Secondary Structures (The Vienna RNA Package). Ivo L. Hofacker , Walter Fontana , Peter F. Stadler , L. Sebastian Bonhoeffer , Manfred Tacker , Peter Schuster

Posted in science | Tagged , , , , , , , , | Leave a comment

data scientist claims squatter’s rights

After extensive research, Data Scientist discovered that titles on abandoned land and buildings can be transferred to individuals who claim the land and occupy it for a specified number of years. (The required length of time differs by US state). This legal practice is called “adverse possession” and it lies deeply rooted in English Common Law, where it was used to keep absentee nobles from retaining unused land which a poor farmer could make better use of.

But how could our hero efficiently locate abandoned real estate? By an automated property tax record search. The particular county that Data Scientist lived in at the time posted all property tax records online. The ID field in the online tax records corresponded to the lot numbers shown in publicly available survey maps. Data Scientist reasoned that tax delinquent properties are most likely to be abandoned, and set to work identifying these properties.

Using Perl and wget—our hero was less experienced in those days—Data Scientist iterated through all the lot numbers in the county, downloading the records via wget and parsing them in Perl. Using a simple filter (properties four years delinquent), Data Scientist identified places to search by car.

Many of the properties identified ended up being burnt, buried by landslides, or former drug houses. However, our hero found a sweet vacant property in the mountains, and rolled in with a camping trailer. Data Scientist immediately set up remedial solar power generation and direct compost-based waste disposal.

Our hero lived there for four months.

Posted in data science | Tagged , , , | Leave a comment

designing a geodesic house (part 1)

Some badass design science:

Data Scientist plans to leave the suburbs for the sci-fi, design-science life off the grid, which motivated the hydroponics work featured in the DIY hydroponics post. Here our hero starts designing a suitable geodesic house.

Start with an icosahedron:

Divide each face into smaller, equal-sized triangles:

Project the points (triangle intersections) created in the last step onto the unit sphere while preserving the chord pattern:

Truncate the sphere at the equator:

Import the resulting set of points and chords into a CAD program (in this case QCAD) to enable further design work:

Source code

The code used to create the above images is embarrassingly messy at the moment. A future post will publish the code, after it gets cleaned up and better documented.

Posted in engineering | Tagged , , , , , | Leave a comment

comparing BLAST results by bit score ratio

I recently read that two separate BLAST alignments to the same reference sequence can be compared to each other by normalizing the alignments by the maximum bit score of the reference sequence BLASTed against itself [1].

In this procedure, the user first aligns the reference sequence to itself to find the maximum possible bit score, and then calculates the percentages aligned from the query sequences by dividing each of the two query alignments’ bit scores by the maximum bit score.

To convince myself that the method is valid, I aligned known length subsets of a sequence against the whole sequence and calculated the percentage aligned by dividing the alignments’ bit scores by the maximum possible bit score. The following plot shows the linear relationship between bit score and known percentage aligned, which demonstrates that the normalization approach is valid:

It should be noted that I used exact subsets. No insertions or deletions were considered.

BLAST results used in this analysis are posted below. Python code to generate these results is posted on the Badass Data Science wiki.

Example sequences used in this analysis (also available on the Badass Data Science wiki):

References

[1] Rasko, D.A., Myers, G.S., Ravel, J. Visualization of comparative genomic analyses by BLAST score ratio. BMC Bioinformatics, 2005, 6:2. doi:10.1186/1471-2105-6-2.

Posted in data science, science | Tagged , , , , , , , | 2 Comments

DIY hydroponics

The coming freshwater supply crisis prompts a need to design food-growing methods that require less water than current methods do. Hydroponics provides one such method. Here I report on my recent effort to design and build a hydroponic strawberry grower.

But first, what does this have to do with data science? Not much at the moment, but I envision data scientists will soon model freshwater supply and demand as the crisis accelerates, from ecological, economic, and military viewpoints. It is likely that one of us will incorporate hydroponics into such a model. Similarly, large commercial growers using hydroponics will require systems modeling to optimize production, an engineering task well suited for data science talent.

Here is a photo of my first-generation grower design. I’ve posted instructions on how to build this design on the Badass Data Science wiki at http://badassdatascience.com/wiki/index.php?title=Hydroponics.

This design uses a “flood and drain” method where nutrient solution (compost tea) is allowed to saturate the growing media, and then excess nutrient solution is drained and stored for later use. A small amount of nutrient solution remains in the grower after draining the excess, which the growing media “wicks” up to the plants’ roots.

I’ll report in future posts how this is going.

Posted in engineering, hydroponics, science | Tagged , , , , , , , | 1 Comment

artificial intelligence and algorithm ecologists

“In some sense, you can argue that the science fiction scenario is already starting to happen,” Thinking Machines’ Hillis says. “The computers are in control, and we just live in their world.” — Wired

Wired magazine recently reported that artificial intelligence (AI) has arrived in full force, though not in the manner anticipated by the field’s visionaries in the 70s and 80s [1].

AI never came in the form of processors that mimic the human brain, as envisioned by researchers 30 years ago. Instead AI emerged as the net effect of millions of specialized algorithms running simultaneously, each making decisions for niche tasks largely out of view. While these decisions are highly informed within the problem space they occupy (informed by extensive data, feedback control, machine-learning, etc.), the algorithms themselves are overall quite dumb. They were designed only to solve particular problems and therefore generalize poorly.

What is also new is the scale of our increasing reliance on such algorithms and the shear number that are operating. Equally amazing are the effects of “crosstalk” between algorithms, where the response of one program to a circumstance triggers an unexpected cascade of reactions from other programs developed by different institutions.

We are entering a world where physical interactions and information synthesis is increasingly moderated by millions of dumb computer programs.

Power then resides in those able to bias this decision-making mass toward their goals. Some of this power resides in programmers, but most of it resides in the hands of the owners of the algorithms’ products—the individuals and institutions that direct programmers’ labor through wages.

Big Data businesses are not just Big Data anymore; they are the accumulation of decisions made with the data, many of which are highly automated. The future then belongs to “algorithm ecologists” who can mediate our interdependence on the algorithms and bias them toward profit.

[1] Wired magazine, 27 December 2010

Posted in Uncategorized | 4 Comments

when you lack potential (energy), drive fast to compensate

Each gear on a car having a manual transmission offers a specific acceleration level to the driver. The lower the gear, the greater amount of acceleration available for use. If we imagine each gear setting as a distinct configuration of the automobile allowing it to perform work (e.g., accelerate out of a hazardous traffic setting) we can say each gear setting impacts the car’s potential energy. Taking this idea further, we can also say that a car in first gear offers more potential energy than a car in fifth gear.

Unfortunately, my car lacked first and second gear for a few days due to a shifting cable malfunction. I could accelerate out of a stopped position in third gear, but it took awhile. Once going, I could continue the slow acceleration to build up kinetic energy in the form of my moving vehicle.

We need energy to effectively respond to traffic conditions while driving, and my car offered only low potential energy options. So I started driving extremely fast to store high kinetic energy. This allowed me to change speed rapidly if I needed to by applying the brakes. The approach worked very well for freeway driving, particularly when entering from on-ramps.

The take-home message: Compensate for your shortcomings by speeding.

 

Posted in science | Tagged , | Leave a comment

British crime vs. internet use

Jordan Cashmore, a student at Nottingham Trent University, recently asked me for help determining if correlation and causality exist between the rise of internet use over the last 15 years and the drop in British crime over the same period. Jordan, for his dissertation, proposes that correlation and causality do exist based on criminology theory, and requested help quantifying the proposed relationship with statistics. Using data he provided, I took on the challenge to practice my statistical consulting:

Results

Correlation

After accounting for the possible distortions in linear models brought about by regressions involving time-series data, I concluded that internet usage in the UK inversely correlates with a drop in British crime over the last 15 years (Pearson’s R of -0.949 with a p-value of 6.079e-10). The correlation is clearly significant.

Causality

The data is insufficient for statistical (Granger) causal inference. Analysts will have to rely on criminological theory to discern whether a causal relationship exists between the rise of internet use and Britain’s drop in crime.

Method

I obtained crime victimization counts from the British Crime Survey for the years 1991 through 2009, and used linear interpolation to fill in missing values. This interpolation induces possible noise into the calculations; I chose to accept this risk. For internet usage, the World Bank provides yearly data detailing the percentage of the UK population using the internet.

Correlation

Plotting the two time-series against each other and conducting OLS regression yields:

The regression lends strong evidence for correlation between the two time-series. However, because regressions involving time-series can be dodgy, I computed the Durbin-Watson statistic from the residuals, to test for serial correlation. Since the Durbin-Watson statistic exceeds the R2 value of the residuals vs. their lags, I concluded that no serial correlation distorts the model [1].

I also ensured the residuals are normally distributed:

After deciding from this analysis that linear models are appropriate, I computed the Pearson and Spearman correlations, along with tests for the null hypothesis that the true correlation is zero:

Causality

I tested for Granger causality between the differenced time-series with one through four lags, and in both directions. The data showed no Granger causality between the two series. However, this is unsurprising since there is relatively little data to work with (15 years, sampled yearly). Therefore, investigators will have to rely on criminology theory to infer causality; these Granger tests should be treated as inconclusive.

References

1. Marcus Marktanner, Chapter Four of online class notes, http://marcusmarktanner.com/Lecture%20Notes/Applied%20Econometrics/CHAPTER%204%20PERFORMING%20STEPS%20IN%20TIME%20SERIES%20REGRESSION.pdf, Accessed 4 March 2012.

Code

R code used for this analysis is posted at http://badassdatascience.com/wiki/index.php?title=British_Crime_and_Internet_Use.

Posted in data science, econometrics | Tagged , , , | 2 Comments

principles of respectable self-promotion

I generally limit my writing about leadership to 128 character declarations (the length of a tweet minus the “#leadership” tag). Anything longer feels too verbose for the subject. However, I’ve been asked twice in the last week for advice on self-promotion, and need a bit more space to wrestle the ideas into prose:

Promote those around you

It is impossible to sustainably promote yourself without promoting the success of those around you. Therefore, promote your team and worksite shamelessly, but never at the expense of other teams or worksites. While engaged in this process, promotion of self just happens.

Cast your pearls liberally

Cast your pearls (i.e. your commitment and labor) liberally to team-oriented, caring, and forward-thinking coworkers and customers, no matter what their rank or skill level. However, for jerks and those who never express gratitude, do only the necessary and sufficient bare-minimum (i.e., limit the effort cast to swine). Save your energy for uplifting those who would also uplift you.

The Fundamental Recipe

The Golden Rule is the single best recipe for self-promotion in existence. It is neither altruistic nor egotistical; instead, it provides a framework for creating mutually successful outcomes. Under the Golden Rule, self-promotion and team/site promotion prove inseparable.

Posted in marketing | Tagged | Leave a comment