The Challenge I want to explore the statistics of RNA sequencing (RNA-seq) on next-generation sequencing (NGS) platforms in greater detail, so I thought I’d start by simulating read counts to experiment with. This post details how I constructed a simulated set of read counts, examines its concordance with the expected negative binomial distribution of the […]

## EC2 spot instance price change: no correlation with day of week

My plans for world domination involve heavy use of Amazon EC2 instances, but I have to be frugal about it so I’m running spot instances to save cash. Therefore a means of forecasting spot instance prices would be helpful. Thus far I’ve had little success using mainstream forecasting tools such as ARIMA and exponential smoothing. […]

## sorting out R’s distribution functions

R’s distribution functions come in four flavors: “d”, “p”, “q”, and “r” (e.g., “dnorm”, “pnorm”, “qnorm”, and “rnorm”). I regularly get them mixed up, so am writing down here what they do for future reference. Density “d” produces the density curve as a function of a random variable. For example, “dnorm” produces: Cumulative Distribution Function […]

## overfitting in statistics and machine learning (part one)

Overfitting is a common risk when designing statistical and machine-learning models. Here I give a brief demonstration of overfitting in action, using simple regression models. A later post will more rigorously address how to quantify and avoid overfitting. We start by sampling data from the process using the R code: Then we produce a linear […]

## CPAP and the “bends”

Can using a CPAP machine cause decompression sickness (aka. the “bends”)? No. The following discussion outlines why. Decompression Sickness As they descend underwater, scuba divers breath air compressed to the same pressure as the surrounding water. For example, at sea level they breathe air at a pressure of 1 atm, while at a depth of […]

## RNAfold and sequence length

I’ve been looking for a way to compare RNAfold [1] dG results for two RNA sequences, where the two sequences differ in length. My initial thought was to simply divide the computed dG’s by sequence length (i.e., normalize by sequence length) and then compare the results. The analysis presented below shows why this won’t work. […]

## British crime vs. internet use

Jordan Cashmore, a student at Nottingham Trent University, recently asked me for help determining if correlation and causality exist between the rise of internet use over the last 15 years and the drop in British crime over the same period. Jordan, for his dissertation, proposes that correlation and causality do exist based on criminology theory, […]

## DIY mood tracker with LimeSurvey

My friend and fellow adventurer Irene Dubois—not her real name of course—recently came home from a psychiatric appointment needing a way to track her daily mood. She tested several smartphone apps designed for this purpose, but found them too inflexible. Consequently, she asked me to create a custom mood tracker for her. The design requirements […]

## system dynamics model of the Oregon Health Plan’s client caseload

Developed this model and wrote this description in 2007 as an analyst for the State of Oregon. We ultimately never used or published this model; I’m posting it here in hopes that someone will find it useful when a Google search delivers it. Introduction The State of Oregon offers medical assistance to low-income individuals […]

## Austin heat wave office wager

Data scientist enters the office betting pool:  Whoever most accurately predicts the day Austin’s heat wave breaks (first day with a high temperature less than 100 degrees Fahrenheit) wins. Seeking a defendable approach, our hero generates an ARIMA time-series forecast based on the last eleven years of daily high temperatures: The model suggests September 1st […]