I recently performed analysis for a media watchdog organization regarding Big Tech’s political influence. Consequently, I’ve received some (American) right-wing press cut. Not too happy that only one side of the aisle is paying attention to what, in my opinion, is a centrist concern, but I’ll take whatever attention (to the issue) that I can…

# Tag: statistics

## FOREX correlation and causality

Consider the seven major currency pairs, sampled hourly over the last six months. We calculate the pairwise Pearson correlation coefficients to determine the degree with which each pair “moves” together: Values near one or negative one indicate high correlation, values with lower absolute value less so. Positive values indicate movement in the same direction; negative…

## women’s style recommendation with artificial intelligence (part #2)

In “women’s style recommendation with artificial intelligence (part #1)”, I introduced my work toward developing artificial intelligence (AI) for fashion and style recommendation. Essentially, its an expert system built on a Bayesian belief network. Now I discuss model validation and next steps in the design iteration process. I first wanted to see if the trained…

## women’s style recommendation with artificial intelligence (part #1)

Introduction We know several basic style “rules” (ha!) based on body shape: Skirts: “Apple” Body Shape: IF body shape is apple AND skirt has front zipper THEN don’t wear IF body shape is apple AND skirt has side zipper THEN wear IF body shape is apple AND skirt has no zipper THEN wear “Rectangular” Body…

## machine learning in FOREX (part one: establishing a performance baseline)

Introduction We’ve been applying machine learning to FOREX price prediction. The performance of our models varies widely, so to establish a baseline we created a simple linear regression model with which we can compare performance of more sophisticated models against. What We Are Trying To Do Given a time-series of 26 four-hour price samples, we…

## autocorrelation in FOREX

To inform the construction of a machine learning-based price prediction algorithm, we want to understand how many lags prove statistically significant with regard to autocorrelation in the seven major FOREX pairs. So we first choose 10,000 random time points between January 1, 2000 and January 1, 2017 for each of the seven pairs. Then we…

## artificial intelligence in fashion (part one: brainstorming)

Brainstorming as usual: Fashion dictums involve many IF-THEN-ELSE rules. One can convert this into a decision engine (inference engine). User specifies their body shape, and a recommendation engine selects suitable clothing for them, taking into account the user’s tastes. Upload an image of a dress you want to buy, and specify the dress’s given size….

## Bayesian method for filtering out mRNA turnover rate bias from siRNA knockdown measurements

Abstract siRNA performance prediction calculations for a given siRNA may be divided into two broad categories: functions of the siRNA’s sequence, hereafter referred to as “intrinsic” properties of the siRNA, and functions of the target mRNA, hereafter referred to as “extrinsic” properties of the siRNA. When training a statistical or machine learning model to select…

## DIY Twitter analytics (part 2: correlations)

I’ve been working with the Twitter API to develop my own Twitter analytics tool chain, and have been documenting the results on this blog. My last post on the subject described clustering my followers by their hashtag use to see whose tweets are most like mine. My goal of this project is to figure out best…

## HRC Corporate Equality Index correlates with Fortune’s 50 most admired companies

The Human Right’s Campaign, one of America’s largest civil rights groups, scores companies in its yearly Corporate Equality Index (CEI) according to their treatment of lesbian, gay, bisexual, and transgender employees [1]. The companies automatically evaluated are the Fortune 1000 and American Lawyer’s top 200. Additionally, any sufficiently large private sector organization can request inclusion…

## clustering stocks by price correlation (part 2)

In my last post, “clustering stocks by price correlation (part 1)“, I performed hierarchical clustering of NYSE stocks by correlation in weekly closing price. I expected the stocks to cluster by industry, and found that they did not. I proposed several explanations for this observation, including that perhaps I chose a poor distance metric for…

## clustering stocks by price correlation (part 1)

I’ve been building my knowledge of clustering techniques to apply to genetic circuit engineering, and decided to try the same tools for stock price analysis. In this post I describe building a hierarchical cluster of stocks by pairwise correlation in weekly price, to see how well the stocks cluster by industry, and compare the derived…

## simulating RNA-seq read counts

The Challenge I want to explore the statistics of RNA sequencing (RNA-seq) on next-generation sequencing (NGS) platforms in greater detail, so I thought I’d start by simulating read counts to experiment with. This post details how I constructed a simulated set of read counts, examines its concordance with the expected negative binomial distribution of the…

## reporting negative results

Two of my recent posts have reported negative results, meaning that no meaningful effects were found during the investigations. Had these investigations been framed as hypothesis tests, we would have failed to reject the null hypotheses. Sounds boring. However there are good reasons to report these results. The first is that negative results still generate…

## fuzzy logic toolkit in C++

I recently came across some old C++ code I wrote about 10 years ago to assist fuzzy logic reasoning. This program is now posted on GitHub at https://github.com/badassdatascience/fuzzy-logic-toolkit and is described below. An example of the tool in action (with code) follows the description. Fuzzy Logic Suppose we have a numerical value for distance to…

## Apache Spark and stock price causality

The Challenge I wanted to compute Granger causality (described below) for each pair of stocks listed in the New York Stock Exchange. Moreover, I wanted to analyze between one and thirty lags for each pair’s comparison. Needless to say, this requires massive computing power. I used Amazon EC2 as the computing platform, but needed a…

## Kaplan-Meier estimator in Python

The following Python class computes and draws Kaplan-Meier product limit estimators for given data. An example of how to use the class follows the code. Code Usage Example

## GNU Octave: a free, open source MATLAB-like language for numerical computing

I tend to use Python with the Numpy, SciPy, and Matplotlib stack whenever I have to do scientific computing. For statistical computing I use R whenever this Python stack does not provide the necessary features. However, I want to draw readers’ attention to another tool for free, open-source numerical computing: GNU Octave (hereafter called “Octave”),…

## maximized entropy of a finite distribution

I received the following tweet yesterday from @ProbFact and decided to check it out in more detail: Two-Dimensional Case I generated the following test to investigate the claim: Create four category discrete distributions where two of the categories have 0.25 probability each, and the third category probability varies between 0.1 and 0.4. The fourth category’s…

## selecting travel trailers by regression

Data Scientist has been thinking recently of moving into a used travel trailer. However, the weight of the trailer to be purchased is limited by that which our hero’s truck can pull. But most online used travel trailer listings only specify length of the vehicle, not its weight. So Data Scientist needed a quick way…

## sorting out R’s distribution functions

R’s distribution functions come in four flavors: “d”, “p”, “q”, and “r” (e.g., “dnorm”, “pnorm”, “qnorm”, and “rnorm”). I regularly get them mixed up, so am writing down here what they do for future reference. Density “d” produces the density curve as a function of a random variable. For example, “dnorm” produces: Cumulative Distribution Function…

## SymPy: a computer algebra system for Python

In a previous post, I examined Maxima, a free computer algebra system (CAS). Yesterday I discovered SymPy, a Python library that adds CAS functionality to the Python language, and decided to give it the same test drive I gave Maxima. I report the results here, and then provide a brief summary of why using CAS…

## the humble sum of the squared errors

As part of my effort to master statistical theory, I’m deconstructing basic statistics principles in blog posts, on the idea that writing about the principles is the best way to learn them more deeply. The humble sum of the squared errors (SSE) calculation has been a workhorse of statistics for the past 200 years. Here…

## overfitting in statistics and machine learning (part one)

Overfitting is a common risk when designing statistical and machine-learning models. Here I give a brief demonstration of overfitting in action, using simple regression models. A later post will more rigorously address how to quantify and avoid overfitting. We start by sampling data from the process using the R code: Then we produce a linear…

## diminishing returns on increased sample size

We often invoke the Central Limit Theorem to model the sampling distribution of the mean as a normal distribution, and in doing so usually calculate the standard error of the mean (SEM) using the formula Here s is the sample standard deviation and n is the sample size. The SEM is then used as the…

## pondering Chebyshev’s inequality

Chebyshev’s inequality states that the probability that a random variable falls within k standard deviations of the mean of a probability distribution is at least Checking this out in R for an arbitrary gamma distribution yields: We can compare the two areas by first plotting the area under the gamma curve within k standard deviations…

## Maxima: a free symbolic algebra program

I recently discovered Maxima, a free (as in GNU) computer algebra system that can perform symbolic integration and differentiation, as well as numerical computation. Here is a test drive using the normal distribution: We can first verify that the area under the normal density curve equals one: We then compute the first moment using symbolic…

## a first look at SAS OnDemand

I recently started using SAS OnDemand, the SAS Institute’s web-based interface to their SAS computing platform, as part of a course I am taking in statistical computing. The program is one of the smoothest web applications I have ever used; shifting from the stand-alone SAS application to SAS OnDemand proved very intuitive. The code editor…

## statistical reasoning in “The Simpsons”, part two

In a previous post, I wrote about how Lisa Simpson applied statistical reasoning during an episode of The Simpsons. Another recent episode (“The Saga of Carl Carlson”, aired 19 May 2013) demonstrated probabilistic thinking: At the science museum, the Simpsons family enters the Hall of Probability: There they see a demonstration of the binomial distribution, with…

## data scientist goes to graduate school

I’ve signed up for a graduate Statistics program at Texas A&M and am now attending the first course. It is an online program; I’ll stay in Austin to work full time while attending. This will change my blogging in two major ways: First, I’ll have less time for writing, so posts may appear less frequently….

## simulated confidence intervals

In my last post, I demonstrated how repeated sampling from any probability distribution produces a normally-distributed distribution of the sample means, given a sufficiently large sample size. Here I describe how to use this distribution of sample means to define a confidence interval around the mean of any given sample, and simulate production of such…

## demonstration of the central limit theorem through simulation

The central limit theorem (CLT) states that the sampling distribution of the mean of a population that is not normally distributed is approximately normally distributed around the population mean, given a sufficiently large sample size. I do not currently have the math chops to prove the CLT, but can provide the following simulation to demonstrate…

## monte-carlo simulation in C++ with MCS-libre

Monte-Carlo simulation is a sometimes elegant (and sometimes crude) method for simulating complex systems. Parameters that affect the system are selected from random distributions and the system response to these values is then calculated. Repeating this process many times produces often useful information about the system. The method is especially useful for examining non-linear systems…

## comparing mRNA half-life survival curves

In my last post, I illustrated how the Kaplan-Meier estimator can be used to estimate the survival curve of mRNA half-lives. In this post I will expand on that analysis and show how to compare two mRNA half-life Kaplan-Meier curves, each corresponding to a measured gene outcome, to see if mRNA half-life differs between outcomes….

## mRNA half-life survival curve estimation

In a recent post, I demonstrated the use of the Kaplan-Meier estimator for estimating survival curves of fictional characters undergoing treatment in a fictional drug trial. Here I illustrate the Kaplan-Meier estimator on real data, data that is unique from normal survival analysis data in that the event under consideration is neither time until death…

## the Kaplan-Meier estimator

In my last post, I wrote about censored data. This post continues the survival analysis theme by focusing on estimation of survival curves. In survival and reliability analysis, it is useful to determine the survival curve for the population under study. This is the curve defined by the probability that a random variable indicating the…