This marks the 100th post to badass data science. I’ve written about everything from Lady Gaga to computational fluid dynamics, usually with a science or data related spin. I thought I’d look at my posts analytically rather than simply reminisce. First, here is a tag cloud for the first 99 posts: From this tag cloud,…

## engineer moves into an RV

I recently moved into a travel trailer to lessen the southern California cost of living (and because I like the idea of portable structures as an answer to housing scarcity). This living arrangement sparks my engineering creativity, which is the motivation for this post. Here I discuss RV living from a mechanical and software engineer’s…

## graph database for gene annotation

Lately I’ve been experimenting with graph databases using Neo4j and the Cypher query language. To get a feel for these tools, I created the following gene annotation network. The Cypher commands I used are discussed in this post, followed by a demonstration of querying the database. Creating the Graph Database We are creating the following…

## setting up an Amazon RDS instance on a VPC private subnet

As a scientist, I tend not to think about database security much. However, security is an important concern for the database-driven web applications I write, so I decided to learn more about how to use Amazon EC2 and RDS instances securely. As part of this effort, I created a virtual private cloud (VPC) to hide my…

## pyDome updates: tangential and spoke angles

In a previous post, I introduced pyDome, a Python program for calculating geodesic dome vertices, chords, and faces. I have since added two hub angle computations to the program, and report on that progress here. Face angle calculations still need to be implemented. Angles Between Chords and the Hub Tangent Plane The angle between a…

## building a forecasting application with AngularJS

Lately I’ve been working on an application that forecasts Amazon EC2 spot instance prices. (The forecasting element of this application will be described in a forthcoming blog post once I finalize the forecasting method). The tool needed a user interface, and I decided to write it with AngularJS as a learning exercise. This post describes…

## maximized entropy of a finite distribution

I received the following tweet yesterday from @ProbFact and decided to check it out in more detail: Two-Dimensional Case I generated the following test to investigate the claim: Create four category discrete distributions where two of the categories have 0.25 probability each, and the third category probability varies between 0.1 and 0.4. The fourth category’s…

## selecting travel trailers by regression

Data Scientist has been thinking recently of moving into a used travel trailer. However, the weight of the trailer to be purchased is limited by that which our hero’s truck can pull. But most online used travel trailer listings only specify length of the vehicle, not its weight. So Data Scientist needed a quick way…

## examining mRNA complexity by annotation region using MapReduce

I became interested in how annotated mRNA regions (e.g., 5′ UTR, coding, and 3′ UTR) vary in information content, speculating that coding regions (CDS) of transcripts will be generally more complex than other regions due to their role in specifying protein recipes. Measuring sequence complexity using Shannon entropy validated this hypothesis, at least with regard…

## test driving the Kepler scientific workflow system

The Kepler scientific workflow system enables scientists and engineers to specify their software pipelines as chains of visual dependencies. Each node in a pipeline runs a specific task, and it does not matter what programming language the task is written in since Kepler only manages the inputs and outputs of each step. Here I describe…

## test driving the Seven Bridges Genomics bioinformatics platform

I recently examined the Seven Bridges Genomics (SBG) platform, building and running a short-read alignment pipeline. Overall, I am impressed by the software. Here I describe my test of the program and then report on my investigation of how it works. Test Drive The test pipeline I devised consisted of two steps, FastQC analysis of…

## proportional-integral (PI) controller in Vensim

In my last post, I discussed an attempt at designing a PID controller using the Kepler Scientific Workflow system. Here I report on a similar (yet successful) development of a proportional-integral (PI) controller in Vensim PLE. Vensim is a software package for describing and simulating dynamic models, particularly those involving feedback. I’ve often described it…

## attempted PID controller with Kepler

I wanted to check out the Kepler scientific workflow system (https://kepler-project.org/), and decided to build a PID controller model with it. Here I report on my results. The following schematic, taken from the Wikipedia entry http://en.wikipedia.org/wiki/PID_controller, shows the basic configuration of a PID controller. PID stands for “proportional integral derivative”, reflecting the fact that the…

## command line Hadoop with a “live” Elastic MapReduce cluster

There are two ways to run Hadoop from the command line on an Elastic MapReduce (EMR) cluster that is active in “waiting” mode. First the hard way: Running Hadoop Directly by Logging into the Cluster’s Head Node The following commands show how you can log into the cluster’s head node and run Hadoop from the…

## listing an Amazon S3 directory’s contents in Java

After much struggle, I have figured out how to list an Amazon S3 directory’s contents in Java using the AWS SDK. Here is how to do it: First, you need to import the following libraries: Then, in your main function (or elsewhere in your code) you need: Be sure to change the “prefix” variable to…

## thank you to my Facebook fan!

An unknown reader consistently shares my links on Facebook. I just wanted to say thanks! – Emily

## chaining map operations in Hadoop

Suppose we have a list of RNA sequences (pictured below), and we want to calculate both the “GC” nucleotide content and the RNA folding energy for each sequence using Hadoop 2.2.0. Furthermore, we want to chain the two operations so that each GC content result is fed to the corresponding sequence’s dG calculation. We also…

## dynamically generated matplotlib images via django

I finally figured out how to serve a dynamically generated matplotlib image through Django. Here is the necessary views.py and urls.py code for an example case: views.py urls.py Result

## sorting out R’s distribution functions

R’s distribution functions come in four flavors: “d”, “p”, “q”, and “r” (e.g., “dnorm”, “pnorm”, “qnorm”, and “rnorm”). I regularly get them mixed up, so am writing down here what they do for future reference. Density “d” produces the density curve as a function of a random variable. For example, “dnorm” produces: Cumulative Distribution Function…

## industrial diversity correlates with population

It seems logical that U.S. counties having greater populations would support more diverse industry than counties having lesser population. Perhaps this has been proven already, but I recently stumbled upon my own verification of the idea: The above plot shows industry diversity (expressed in the form of Shannon entropy, discussed below) as a function of…

## test driving Amazon Web Services’ Elastic MapReduce

Hadoop provides software infrastructure for running MapReduce tasks, but it requires substantial setup time and availability of a compute cluster to take full advantage of. Amazon’s Elastic MapReduce (EMR) solves these problems; delivering pre-configured Hadoop virtual machines running on the cloud for only the time they are required, and billing only for the computation minutes…

## SymPy: a computer algebra system for Python

In a previous post, I examined Maxima, a free computer algebra system (CAS). Yesterday I discovered SymPy, a Python library that adds CAS functionality to the Python language, and decided to give it the same test drive I gave Maxima. I report the results here, and then provide a brief summary of why using CAS…

## pyDome: a geodesic dome designer

I am pleased to announce the release of pyDome, a geodesic dome designer written in Python. The software is freely available on GitHub at https://github.com/badassdatascience/pyDome. User modification of the code is encouraged. In a previous post (http://badassdatascience.com/2012/04/15/geodesic-dome-design-part-1/), I described the procedure for calculating a geodesic dome’s vertices and chords for a class one dome. pyDome…

## the humble sum of the squared errors

As part of my effort to master statistical theory, I’m deconstructing basic statistics principles in blog posts, on the idea that writing about the principles is the best way to learn them more deeply. The humble sum of the squared errors (SSE) calculation has been a workhorse of statistics for the past 200 years. Here…

## why I read evolutionary psychology papers

It might seem odd that a bioinformatician with an engineer’s training would read evolutionary psychology papers on a regular basis. Here is why I do so: Suppose the human brain changes on an evolutionary time scale. It would follow then that not much has changed about our minds since before the invention of agriculture, and…

## overfitting in statistics and machine learning (part one)

Overfitting is a common risk when designing statistical and machine-learning models. Here I give a brief demonstration of overfitting in action, using simple regression models. A later post will more rigorously address how to quantify and avoid overfitting. We start by sampling data from the process using the R code: Then we produce a linear…

## the first Big Data recession

The “Great Recession” of 2007-2009 may be the first “Big Data” recession, i.e., the first recession which we can examine using the vast information delivered by the advent of Big Data. Certainly the next recession will be studied through that lens. To test whether new data is available that can be cast in an economic…

## designing a battery array to power a CPAP machine

My friend Irene Dubois enjoys camping (tent camping, not RV) but recently began using a CPAP machine while sleeping to treat obstructive sleep apnea. Failure to use this machine results in severe snoring, which would drive fellow campers crazy. The CPAP machine’s power adaptor runs off of AC power, which is usually unavailable at tent…

## diminishing returns on increased sample size

We often invoke the Central Limit Theorem to model the sampling distribution of the mean as a normal distribution, and in doing so usually calculate the standard error of the mean (SEM) using the formula Here s is the sample standard deviation and n is the sample size. The SEM is then used as the…

## simulating the Monty Hall puzzle

The Monty Hall puzzle (named after the original host of Lets Make a Deal) offers contestants three identical doors. One of these doors leads to the contestant winning a new car, while the other two lead to rooms containing goats. The contestant selects a door, and the host immediately opens up one of the two…

## comprehensive, anticipatory design in the age of Big Data

Buckminster (Bucky) Fuller wrote in the 1950s that a strategy of “comprehensive anticipatory design science” [1] was required to create technology and systems suitable to sustainable living and sustainable business. This post examines what Bucky meant by comprehensive anticipatory design and then explores how Big Data can play a role in its deployment. Bucky’s vision…

## Cartesian products in Python

Given Python lists A and B, the Cartesian product of the two lists is the set of all ordered pairs (a, b) such that a is an element of A and b is an element of B [1]. In mathematical notation, this is denoted: We can write a more general definition of the Cartesian product for…

## pondering Chebyshev’s inequality

Chebyshev’s inequality states that the probability that a random variable falls within k standard deviations of the mean of a probability distribution is at least Checking this out in R for an arbitrary gamma distribution yields: We can compare the two areas by first plotting the area under the gamma curve within k standard deviations…

## Maxima: a free symbolic algebra program

I recently discovered Maxima, a free (as in GNU) computer algebra system that can perform symbolic integration and differentiation, as well as numerical computation. Here is a test drive using the normal distribution: We can first verify that the area under the normal density curve equals one: We then compute the first moment using symbolic…

## a first look at SAS OnDemand

I recently started using SAS OnDemand, the SAS Institute’s web-based interface to their SAS computing platform, as part of a course I am taking in statistical computing. The program is one of the smoothest web applications I have ever used; shifting from the stand-alone SAS application to SAS OnDemand proved very intuitive. The code editor…

## church to bar ratio, by U.S. county

Church to bar ratio by county from U.S. Census Bureau data: The brighter the color, the higher the church to bar ratio. Counties missing data necessary for the computation are shown in black. Method From the 2011 County Business Patterns data published at http://www.census.gov/econ/cbp/download/index.htm, I extracted the number of establishments in each county that have…