In a previous post, I wrote about how Lisa Simpson applied statistical reasoning during an episode of The Simpsons. Another recent episode (“The Saga of Carl Carlson”, aired 19 May 2013) demonstrated probabilistic thinking: At the science museum, the Simpsons family enters the Hall of Probability: There they see a demonstration of the binomial distribution, with…

## data scientist goes to graduate school

I’ve signed up for a graduate Statistics program at Texas A&M and am now attending the first course. It is an online program; I’ll stay in Austin to work full time while attending. This will change my blogging in two major ways: First, I’ll have less time for writing, so posts may appear less frequently….

## simulated confidence intervals

In my last post, I demonstrated how repeated sampling from any probability distribution produces a normally-distributed distribution of the sample means, given a sufficiently large sample size. Here I describe how to use this distribution of sample means to define a confidence interval around the mean of any given sample, and simulate production of such…

## demonstration of the central limit theorem through simulation

The central limit theorem (CLT) states that the sampling distribution of the mean of a population that is not normally distributed is approximately normally distributed around the population mean, given a sufficiently large sample size. I do not currently have the math chops to prove the CLT, but can provide the following simulation to demonstrate…

## SWIG, C++, Python, and Monte-Carlo simulation

In the previous post, I introduced MCS-libre, my C++ library for Monte-Carlo simulation. Here I show how to access it from Python using the Simplified Wrapper and Interface Generator (SWIG), while in the process demonstrating how to use SWIG with C++ classes. First we download and decompress the MCS-libre library code: Next, we create a SWIG…

## monte-carlo simulation in C++ with MCS-libre

Monte-Carlo simulation is a sometimes elegant (and sometimes crude) method for simulating complex systems. Parameters that affect the system are selected from random distributions and the system response to these values is then calculated. Repeating this process many times produces often useful information about the system. The method is especially useful for examining non-linear systems…

## gene fusion variants mapped by shared PubMed IDs

Introduction A gene fusion occurs when parts of two genes’ RNA combine to form one hybrid mRNA molecule before translation into protein. A common set of fusions found in lung cancer are the multiple combinations of genes EML4 and ALK [1], hereafter denoted EML4-ALK. The COSMIC database [2] lists 29 distinct fusions of EML4-ALK, which…

## comparing mRNA half-life survival curves

In my last post, I illustrated how the Kaplan-Meier estimator can be used to estimate the survival curve of mRNA half-lives. In this post I will expand on that analysis and show how to compare two mRNA half-life Kaplan-Meier curves, each corresponding to a measured gene outcome, to see if mRNA half-life differs between outcomes….

## mRNA half-life survival curve estimation

In a recent post, I demonstrated the use of the Kaplan-Meier estimator for estimating survival curves of fictional characters undergoing treatment in a fictional drug trial. Here I illustrate the Kaplan-Meier estimator on real data, data that is unique from normal survival analysis data in that the event under consideration is neither time until death…

## stalling an airplane mid-flight

Thanks to my brother, I recently had the opportunity to fly a small Cessna aircraft under supervision of a flight instructor. The instructor took off and landed, but gave me the controls during flight. During this time we went through a few instructive maneuvers, including stalling the plane mid-flight. Here I explain how stalling an…

## the Kaplan-Meier estimator

In my last post, I wrote about censored data. This post continues the survival analysis theme by focusing on estimation of survival curves. In survival and reliability analysis, it is useful to determine the survival curve for the population under study. This is the curve defined by the probability that a random variable indicating the…

## “right censored” data

In clinical trials and reliability studies, researchers often measure the time until an event occurs for each patient or object in the study. That event may be patient death in the case of clinical trials for a new cancer drug, or bridge failure in the case of a reliability study of bridges. Sometimes, however, the…

## Excel mangles NCBI gene symbols

Using Microsoft’s Excel for bioinformatics work sucks, but sometimes a spreadsheet is the best format for communicating results to other scientists. The program’s default behavior mangles some NCBI gene symbols when you import them from a text file. Here is how to deal with it. Suppose you have the following list of gene symbols, and…

## the future orientation index

I’ve recently discovered Google Trends and have been looking for an opportunity to use it. Today I found such opportunity in a paper [1] published last April that computes countries’ “future orientation index” from Google Trends data and correlates it with national per-capita GDP. The authors report correlation for 2010; my experiment with Google Trends…

## using Hadoop to examine county-level industrial diversity

In a previous post, I computed each U.S. county’s industrial diversity from the 2009 County Business Patterns data published by the U.S. Census Bureau. The diversity calculation made use of Shannon’s information entropy equation, which is similarly used by ecologists to calculate species diversity for a region. Here I perform the same calculation using Hadoop,…

## a Java class for generating box plot statistics

While working on a Hadoop project I found myself needing a lightweight Java class that computes box plot statistics (e.g. quartiles, outliers, etc.). So I wrote the class appended below. The code does not display the plots, only computes necessary values, since I’m planning on displaying the plots with matplotlib:

## Excel-like HTML table manipulation with Handsontable

I continually stress the importance of web programming skills to the data scientists’ repertoire, due to the fact that most users of data interact with it through a web browser. (See my previous post “Data Scientist Makes Peace with Web Programming”). Therefore I periodically showcase web development techniques from a data scientist’s perspective. Today I’m…

## simulating a synthetic biology circuit with system dynamics

McAdams and Arkin report the following synthetic biology oscillator circuit in their paper “Gene regulation: Towards a circuit engineering discipline” [1]: The circuit works by having gene R1’s protein inhibit production of R3, who’s protein inhibits production of R2, which in turn inhibits production of R1. Delays in the inhibition processes cause sufficient expression of…

## CPAP and the “bends”

Can using a CPAP machine cause decompression sickness (aka. the “bends”)? No. The following discussion outlines why. Decompression Sickness As they descend underwater, scuba divers breath air compressed to the same pressure as the surrounding water. For example, at sea level they breathe air at a pressure of 1 atm, while at a depth of…

## writing a software pipeline manager (part 1)

I find myself regularly chaining programs together into software pipelines, and decided that having a pipeline management tool would be helpful. So I wrote one and posted the code here. To illustrate how the tool works, consider the following software pipeline: Here we see that the steps “get weather forecast” and “get stock quote” depend…

## industrial diversity vs percent change in unemployment rate

This analysis may exceed the bounds of my statistics knowledge, but I will deliver it anyway in the name of “process” blogging. I welcome experienced critique of the method! Result A modest positive correlation exists between a county’s industrial diversity and its percent change in unemployment rate over the period 2007-2010. Method Several months ago…

## when pi = 4

We are all familiar with the equation relating a circle’s radius to its circumference: Rearranging, we get and in Euclidean geometry we determine pi’s value of 3.14158… given a circle of any radius. But in the non-Euclidean taxicab geometry, pi equals four, which will be demonstrated below. Taxicab geometry is a two-dimensional geometry where points…

## myers-briggs personality interaction map

After my last consulting gig for Yoyodyne Propulsion Systems (YPS), they invited me back to troubleshoot their R&D team’s group dynamics [1][2]. To get started, I administered a web-based Myers-Briggs Type Indicator (MBTI) assessment to each member of YPS’s R&D team to discern their personality types [3]. I then plotted the personality similarities between the individuals…

## statistical reasoning in the “The Simpsons”

FOX recently* broadcasted a fundamental question that drives good science: “I’m sure there’s a correlation, but could there be a causation?” The intrepid Lisa Simpson, the greatest cartoon scientist of our time, spoke these words after observing a pair of scorpions become docile in the presence of a specific plant. Quality statistical reasoning rarely gets…

## RNAfold and sequence length

I’ve been looking for a way to compare RNAfold [1] dG results for two RNA sequences, where the two sequences differ in length. My initial thought was to simply divide the computed dG’s by sequence length (i.e., normalize by sequence length) and then compare the results. The analysis presented below shows why this won’t work….

## data scientist claims squatter’s rights

After extensive research, Data Scientist discovered that titles on abandoned land and buildings can be transferred to individuals who claim the land and occupy it for a specified number of years. (The required length of time differs by US state). This legal practice is called “adverse possession” and it lies deeply rooted in English Common…

## designing a geodesic house (part 1)

Some badass design science: Data Scientist plans to leave the suburbs for the sci-fi, design-science life off the grid, which motivated the hydroponics work featured in the DIY hydroponics post. Here our hero starts designing a suitable geodesic house. Start with an icosahedron: Divide each face into smaller, equal-sized triangles: Project the points (triangle intersections)…

## comparing BLAST results by bit score ratio

I recently read that two separate BLAST alignments to the same reference sequence can be compared to each other by normalizing the alignments by the maximum bit score of the reference sequence BLASTed against itself [1]. In this procedure, the user first aligns the reference sequence to itself to find the maximum possible bit score,…

## DIY hydroponics

The coming freshwater supply crisis prompts a need to design food-growing methods that require less water than current methods do. Hydroponics provides one such method. Here I report on my recent effort to design and build a hydroponic strawberry grower. But first, what does this have to do with data science? Not much at the…

## artificial intelligence and algorithm ecologists

“In some sense, you can argue that the science fiction scenario is already starting to happen,” Thinking Machines’ Hillis says. “The computers are in control, and we just live in their world.” — Wired Wired magazine recently reported that artificial intelligence (AI) has arrived in full force, though not in the manner anticipated by the…

## when you lack potential (energy), drive fast to compensate

Each gear on a car having a manual transmission offers a specific acceleration level to the driver. The lower the gear, the greater amount of acceleration available for use. If we imagine each gear setting as a distinct configuration of the automobile allowing it to perform work (e.g., accelerate out of a hazardous traffic setting)…

## British crime vs. internet use

Jordan Cashmore, a student at Nottingham Trent University, recently asked me for help determining if correlation and causality exist between the rise of internet use over the last 15 years and the drop in British crime over the same period. Jordan, for his dissertation, proposes that correlation and causality do exist based on criminology theory,…

## principles of respectable self-promotion

I generally limit my writing about leadership to 128 character declarations (the length of a tweet minus the “#leadership” tag). Anything longer feels too verbose for the subject. However, I’ve been asked twice in the last week for advice on self-promotion, and need a bit more space to wrestle the ideas into prose: Promote those…

## on leadership: risk-taking

About once a year I nearly get fired. This is not a virtue, but consider my calculus: I highly value risk-taking, believing it is the only way to initiate any real progress in business and in life. Here is how it usually turns out: 90% of the time, nothing good or bad happens. Null program….

## industry diversity (via Shannon entropy) per US county

Ecologists use Shannon entropy to measure species diversity in a given region. Here I apply the same equation to determine industry diversity in each US county. In the map below, darker color indicates greater industrial diversity: Method Downloaded the 2009 County Business Patterns data from the US Census Bureau and extracted the business establishment counts…

## lunar ephemeris calculations with PyEphem

While analyzing data for my recent post demonstrating that the lunar cycle does not correlate with crime incidents, I needed to compute daily lunar ephemeris data to match with daily crime incident counts. To accomplish this I turned to PyEphem, a Python package that computes–among other things–lunar position and phase for any given date. I first…