why this article? A journalist recently asked me to comment on the feasibility of a conspiracy theory involving one of Facebook’s AI algorithms. He wanted to know whether it was likely, or even possible, that Facebook was using its existing algorithm for suicide video detection to screen and censor conservative media sources. To answer the…

# Category: statistics

## FOREX correlation and causality

Consider the seven major currency pairs, sampled hourly over the last six months. We calculate the pairwise Pearson correlation coefficients to determine the degree with which each pair “moves” together: Values near one or negative one indicate high correlation, values with lower absolute value less so. Positive values indicate movement in the same direction; negative…

## fans control the music: using AI to measure fan enthusiasm at EDC

We invented technology to enhance the fan/performer connection. Vote for Team Ambience at EDC! DJs and more traditional musicians require realtime audience feedback during performances. However, often we cannot see our audience—their movement, their facial expressions, etc.—during shows due to stage lighting. Therefore we cannot gauge their enthusiasm, and therefore cannot alter our performance to respond….

## women’s style recommendation with artificial intelligence (part #2)

In “women’s style recommendation with artificial intelligence (part #1)”, I introduced my work toward developing artificial intelligence (AI) for fashion and style recommendation. Essentially, its an expert system built on a Bayesian belief network. Now I discuss model validation and next steps in the design iteration process. I first wanted to see if the trained…

## women’s style recommendation with artificial intelligence (part #1)

Introduction We know several basic style “rules” (ha!) based on body shape: Skirts: “Apple” Body Shape: IF body shape is apple AND skirt has front zipper THEN don’t wear IF body shape is apple AND skirt has side zipper THEN wear IF body shape is apple AND skirt has no zipper THEN wear “Rectangular” Body…

## machine learning in FOREX (part one: establishing a performance baseline)

Introduction We’ve been applying machine learning to FOREX price prediction. The performance of our models varies widely, so to establish a baseline we created a simple linear regression model with which we can compare performance of more sophisticated models against. What We Are Trying To Do Given a time-series of 26 four-hour price samples, we…

## autocorrelation in FOREX

To inform the construction of a machine learning-based price prediction algorithm, we want to understand how many lags prove statistically significant with regard to autocorrelation in the seven major FOREX pairs. So we first choose 10,000 random time points between January 1, 2000 and January 1, 2017 for each of the seven pairs. Then we…

## summary of our FOREX experiments and next steps

We started by building a support vector machine model based on features used in harmonic trading, with the idea that ideal “harmonic” ratios can be learned rather than explicitly specified. This worked on testing sets but not when we started trading with it. We abandoned the model before we realized that we need to manage…

## artificial intelligence in fashion (part one: brainstorming)

Brainstorming as usual: Fashion dictums involve many IF-THEN-ELSE rules. One can convert this into a decision engine (inference engine). User specifies their body shape, and a recommendation engine selects suitable clothing for them, taking into account the user’s tastes. Upload an image of a dress you want to buy, and specify the dress’s given size….

## rapidly identifying potential CRISPR/Cas9 off-target sites (part one)

Before we can score segments in the genome having a small number of mismatches to a CRISPR for their off-target risk, we must first find these segments. Searching for every possible mismatch permutation proves computationally expensive, so we apply the following heuristic: We only search for mismatches in the top positions relevant to CRISPR efficiency….

## applying market basket analysis to the stock market

I’ve started learning market basket analysis and decided to test drive my knowledge against the stock market: I own a (proprietary) database of predicted stock causality relationships. An export to tabular form looks something like this: I won’t tell you what the “causality” is, as that is the proprietary part, and the example data shown…

## pseudo-harmonic FOREX prediction with machine learning (part one)

“Harmonic” trading methods seek patterns in the relationships between neighboring peaks and valleys in the time series. Particularly, harmonic traders seek pre-specified ratios in the price differences among a series of peaks and valleys. For example, a trader might observe the following pattern: Let A, B, C, D, and E be the points in the…

## picking stocks by graph database (part one)

Historical stock price data comes readily available at daily resolution. So we calculated the Granger causality for each pair of stocks we hold data for, at one and two day lags (testing the question “does daily percent change in volume for stock X Granger cause daily percent change in adjusted close price for stock Y?”)….

## Bayesian method for filtering out mRNA turnover rate bias from siRNA knockdown measurements

Abstract siRNA performance prediction calculations for a given siRNA may be divided into two broad categories: functions of the siRNA’s sequence, hereafter referred to as “intrinsic” properties of the siRNA, and functions of the target mRNA, hereafter referred to as “extrinsic” properties of the siRNA. When training a statistical or machine learning model to select…

## how I make a living: what is bioinformatics? (part #1)

I’m constantly asked to explain what I do for a living. Here is an attempt to do so in laypersons’ terms. I’ll assume my readers are non-scientists and non-engineers, but that they’ve taken a high school biology class. “Bioinformatics” is the application of mathematics and computer science to biological data, particularly molecular biology data. By…

## DIY Twitter analytics (part 3: hashtag network)

I’ve been mathematically analyzing my Twitter feed to determine how best to position my tweets for maximum impact, and have been documenting the work on this blog. While I’ve not come to any brilliant conclusions yet, I’ve made progress. My first post on the subject described clustering my followers by their hashtag use to see…

## the science of gender identity (part 4: summary)

To prepare for a book I intend to write on the science of gender identity, I drafted the following three blog posts to collect my thoughts. They are highly technical; I need to recast the content for the layperson. I also assembled some of my own biological data to analyze. The first blog post, http://badassdatascience.com/2015/06/06/sci-gender-identity-01/,…

## DIY Twitter analytics (part 2: correlations)

I’ve been working with the Twitter API to develop my own Twitter analytics tool chain, and have been documenting the results on this blog. My last post on the subject described clustering my followers by their hashtag use to see whose tweets are most like mine. My goal of this project is to figure out best…

## DIY Twitter analytics (part 1: clustering related users)

I’ve started working with the Twitter API to develop my own Twitter analytics tool chain. My goals are to figure out who the influencers in my subjects are, figure out how best to position my tweets, etc. I could certainly pay for this service, but then I wouldn’t learn any new technical skills in the…

## the science of gender identity (part 3: psychology)

This is the third post in a multi-part series surveying the current science of gender identity, particularly with regard to the transgendered population. My first post on the subject covered proposed genetic associations and corresponding research. The second post on the matter discussed observed differences in brain anatomy between transgendered and cisgendered individuals. Here I…

## the science of gender identity (part 2: brain anatomy)

This is the second post in a mult-part series surveying the current science of gender identity, particularly with regard to the transgendered population. In my previous post I discussed the proposed genetic associations and corresponding research. A future post, if I can find sufficient data, will address neuropsychology research related to the transgender experience. Here…

## the science of gender identity (part 1: genetics)

This is the first in a multi-part series surveying the current science of gender identity, particularly with regard to the transgendered population. I intend to discuss the genetic, brain anatomic, and neuropsychological findings of recent studies on the matter. As always, I will incorporate my own statistical analysis of raw study data wherever possible. Here…

## HRC Corporate Equality Index correlates with Fortune’s 50 most admired companies

The Human Right’s Campaign, one of America’s largest civil rights groups, scores companies in its yearly Corporate Equality Index (CEI) according to their treatment of lesbian, gay, bisexual, and transgender employees [1]. The companies automatically evaluated are the Fortune 1000 and American Lawyer’s top 200. Additionally, any sufficiently large private sector organization can request inclusion…

## clustering stocks by price correlation (part 2)

In my last post, “clustering stocks by price correlation (part 1)“, I performed hierarchical clustering of NYSE stocks by correlation in weekly closing price. I expected the stocks to cluster by industry, and found that they did not. I proposed several explanations for this observation, including that perhaps I chose a poor distance metric for…

## clustering stocks by price correlation (part 1)

I’ve been building my knowledge of clustering techniques to apply to genetic circuit engineering, and decided to try the same tools for stock price analysis. In this post I describe building a hierarchical cluster of stocks by pairwise correlation in weekly price, to see how well the stocks cluster by industry, and compare the derived…

## simulating RNA-seq read counts

The Challenge I want to explore the statistics of RNA sequencing (RNA-seq) on next-generation sequencing (NGS) platforms in greater detail, so I thought I’d start by simulating read counts to experiment with. This post details how I constructed a simulated set of read counts, examines its concordance with the expected negative binomial distribution of the…

## reporting negative results

Two of my recent posts have reported negative results, meaning that no meaningful effects were found during the investigations. Had these investigations been framed as hypothesis tests, we would have failed to reject the null hypotheses. Sounds boring. However there are good reasons to report these results. The first is that negative results still generate…

## fuzzy logic toolkit in C++

I recently came across some old C++ code I wrote about 10 years ago to assist fuzzy logic reasoning. This program is now posted on GitHub at https://github.com/badassdatascience/fuzzy-logic-toolkit and is described below. An example of the tool in action (with code) follows the description. Fuzzy Logic Suppose we have a numerical value for distance to…

## Apache Spark and stock price causality

The Challenge I wanted to compute Granger causality (described below) for each pair of stocks listed in the New York Stock Exchange. Moreover, I wanted to analyze between one and thirty lags for each pair’s comparison. Needless to say, this requires massive computing power. I used Amazon EC2 as the computing platform, but needed a…

## hacking the stock market (part 1)

Caveat: I am not a technical investor–just a hobbyist, so take this analysis with a grain of salt. I am also just beginning with my Master’s work in statistics. I wanted to examine the correlation between changes in the daily closing price of the Dow Jones Industrial Average (DJIA) and lags of those changes, to…

## Kaplan-Meier estimator in Python

The following Python class computes and draws Kaplan-Meier product limit estimators for given data. An example of how to use the class follows the code. Code Usage Example

## data natives

We hear a lot of marketing yammer about “digital natives”, that is, folks fluent in social media and in particular marketing using social media. Writers who use this term often juxtapose such digital natives against “analog natives”, i.e., individuals who matured or were educated before online social media became such a significant part of our…

## GNU Octave: a free, open source MATLAB-like language for numerical computing

I tend to use Python with the Numpy, SciPy, and Matplotlib stack whenever I have to do scientific computing. For statistical computing I use R whenever this Python stack does not provide the necessary features. However, I want to draw readers’ attention to another tool for free, open-source numerical computing: GNU Octave (hereafter called “Octave”),…

## maximized entropy of a finite distribution

I received the following tweet yesterday from @ProbFact and decided to check it out in more detail: Two-Dimensional Case I generated the following test to investigate the claim: Create four category discrete distributions where two of the categories have 0.25 probability each, and the third category probability varies between 0.1 and 0.4. The fourth category’s…

## selecting travel trailers by regression

Data Scientist has been thinking recently of moving into a used travel trailer. However, the weight of the trailer to be purchased is limited by that which our hero’s truck can pull. But most online used travel trailer listings only specify length of the vehicle, not its weight. So Data Scientist needed a quick way…

## sorting out R’s distribution functions

R’s distribution functions come in four flavors: “d”, “p”, “q”, and “r” (e.g., “dnorm”, “pnorm”, “qnorm”, and “rnorm”). I regularly get them mixed up, so am writing down here what they do for future reference. Density “d” produces the density curve as a function of a random variable. For example, “dnorm” produces: Cumulative Distribution Function…