Consider the seven major currency pairs, sampled hourly over the last six months. We calculate the pairwise Pearson correlation coefficients to determine the degree with which each pair “moves” together: Values near one or negative one indicate high correlation, values with lower absolute value less so. Positive values indicate movement in the same direction; negative…

# Category: econometrics

## machine learning in FOREX (part one: establishing a performance baseline)

Introduction We’ve been applying machine learning to FOREX price prediction. The performance of our models varies widely, so to establish a baseline we created a simple linear regression model with which we can compare performance of more sophisticated models against. What We Are Trying To Do Given a time-series of 26 four-hour price samples, we…

## Emily’s laws of system complexity

First Law For every reduction, there is a greater and opposite clusterfuck. Second Law The first law is a reductionist statement.

## autocorrelation in FOREX

To inform the construction of a machine learning-based price prediction algorithm, we want to understand how many lags prove statistically significant with regard to autocorrelation in the seven major FOREX pairs. So we first choose 10,000 random time points between January 1, 2000 and January 1, 2017 for each of the seven pairs. Then we…

## summary of our FOREX experiments and next steps

We started by building a support vector machine model based on features used in harmonic trading, with the idea that ideal “harmonic” ratios can be learned rather than explicitly specified. This worked on testing sets but not when we started trading with it. We abandoned the model before we realized that we need to manage…

## applying market basket analysis to the stock market

I’ve started learning market basket analysis and decided to test drive my knowledge against the stock market: I own a (proprietary) database of predicted stock causality relationships. An export to tabular form looks something like this: I won’t tell you what the “causality” is, as that is the proprietary part, and the example data shown…

## pseudo-harmonic FOREX prediction with machine learning (part one)

“Harmonic” trading methods seek patterns in the relationships between neighboring peaks and valleys in the time series. Particularly, harmonic traders seek pre-specified ratios in the price differences among a series of peaks and valleys. For example, a trader might observe the following pattern: Let A, B, C, D, and E be the points in the…

## picking stocks by graph database (part 2: machine learning)

In our last post, we demonstrated a graph database created to enable study of the stock market, particularly the study of causality relationships. So how to proceed from there? At this stage we want to pick winning stocks, not write an academic paper, so our focus turns toward practical machine learning. Source Data We start…

## picking stocks by graph database (part one)

Historical stock price data comes readily available at daily resolution. So we calculated the Granger causality for each pair of stocks we hold data for, at one and two day lags (testing the question “does daily percent change in volume for stock X Granger cause daily percent change in adjusted close price for stock Y?”)….

## church to bar ratio, by U.S. county (3rd edition)

Church to bar ratio by county from U.S. Census Bureau data: The brighter the color, the higher the church to bar ratio. Counties missing data necessary for the computation are shown in black. Method From the 2013 County Business Patterns data published at http://www.census.gov/econ/cbp/download/, I extracted the number of establishments in each county that have…

## HRC Corporate Equality Index correlates with Fortune’s 50 most admired companies

The Human Right’s Campaign, one of America’s largest civil rights groups, scores companies in its yearly Corporate Equality Index (CEI) according to their treatment of lesbian, gay, bisexual, and transgender employees [1]. The companies automatically evaluated are the Fortune 1000 and American Lawyer’s top 200. Additionally, any sufficiently large private sector organization can request inclusion…

## clustering stocks by price correlation (part 2)

In my last post, “clustering stocks by price correlation (part 1)“, I performed hierarchical clustering of NYSE stocks by correlation in weekly closing price. I expected the stocks to cluster by industry, and found that they did not. I proposed several explanations for this observation, including that perhaps I chose a poor distance metric for…

## clustering stocks by price correlation (part 1)

I’ve been building my knowledge of clustering techniques to apply to genetic circuit engineering, and decided to try the same tools for stock price analysis. In this post I describe building a hierarchical cluster of stocks by pairwise correlation in weekly price, to see how well the stocks cluster by industry, and compare the derived…

## net change of zero between closing and opening stock prices

I decided to investigate the variation between trading days’ closing prices and the following trading days’ opening prices for stocks listed on the New York Stock Exchange. I started with data in the following format for all trading days between January 2nd 2000 and October 30th 2014: I then calculated the percent change between one…

## Apache Spark and stock price causality

The Challenge I wanted to compute Granger causality (described below) for each pair of stocks listed in the New York Stock Exchange. Moreover, I wanted to analyze between one and thirty lags for each pair’s comparison. Needless to say, this requires massive computing power. I used Amazon EC2 as the computing platform, but needed a…

## hacking the stock market (part 1)

Caveat: I am not a technical investor–just a hobbyist, so take this analysis with a grain of salt. I am also just beginning with my Master’s work in statistics. I wanted to examine the correlation between changes in the daily closing price of the Dow Jones Industrial Average (DJIA) and lags of those changes, to…

## industrial diversity correlates with population

It seems logical that U.S. counties having greater populations would support more diverse industry than counties having lesser population. Perhaps this has been proven already, but I recently stumbled upon my own verification of the idea: The above plot shows industry diversity (expressed in the form of Shannon entropy, discussed below) as a function of…

## test driving Amazon Web Services’ Elastic MapReduce

Hadoop provides software infrastructure for running MapReduce tasks, but it requires substantial setup time and availability of a compute cluster to take full advantage of. Amazon’s Elastic MapReduce (EMR) solves these problems; delivering pre-configured Hadoop virtual machines running on the cloud for only the time they are required, and billing only for the computation minutes…

## the first Big Data recession

The “Great Recession” of 2007-2009 may be the first “Big Data” recession, i.e., the first recession which we can examine using the vast information delivered by the advent of Big Data. Certainly the next recession will be studied through that lens. To test whether new data is available that can be cast in an economic…

## church to bar ratio, by U.S. county

Church to bar ratio by county from U.S. Census Bureau data: The brighter the color, the higher the church to bar ratio. Counties missing data necessary for the computation are shown in black. Method From the 2011 County Business Patterns data published at http://www.census.gov/econ/cbp/download/index.htm, I extracted the number of establishments in each county that have…

## the future orientation index

I’ve recently discovered Google Trends and have been looking for an opportunity to use it. Today I found such opportunity in a paper [1] published last April that computes countries’ “future orientation index” from Google Trends data and correlates it with national per-capita GDP. The authors report correlation for 2010; my experiment with Google Trends…

## using Hadoop to examine county-level industrial diversity

In a previous post, I computed each U.S. county’s industrial diversity from the 2009 County Business Patterns data published by the U.S. Census Bureau. The diversity calculation made use of Shannon’s information entropy equation, which is similarly used by ecologists to calculate species diversity for a region. Here I perform the same calculation using Hadoop,…

## industrial diversity vs percent change in unemployment rate

This analysis may exceed the bounds of my statistics knowledge, but I will deliver it anyway in the name of “process” blogging. I welcome experienced critique of the method! Result A modest positive correlation exists between a county’s industrial diversity and its percent change in unemployment rate over the period 2007-2010. Method Several months ago…

## British crime vs. internet use

Jordan Cashmore, a student at Nottingham Trent University, recently asked me for help determining if correlation and causality exist between the rise of internet use over the last 15 years and the drop in British crime over the same period. Jordan, for his dissertation, proposes that correlation and causality do exist based on criminology theory,…

## industry diversity (via Shannon entropy) per US county

Ecologists use Shannon entropy to measure species diversity in a given region. Here I apply the same equation to determine industry diversity in each US county. In the map below, darker color indicates greater industrial diversity: Method Downloaded the 2009 County Business Patterns data from the US Census Bureau and extracted the business establishment counts…

## simulated ROC curves

How receiver operating characteristic (ROC) curves vary with simulated data having stepped degrees of separation: Computational Notes These were created in R using the “ROCR” package. Be sure to say “ROCR” really fast! The simulated data are normally distributed within each group.

## system dynamics model of the Oregon Health Plan’s client caseload

Developed this model and wrote this description in 2007 as an analyst for the State of Oregon. We ultimately never used or published this model; I’m posting it here in hopes that someone will find it useful when a Google search delivers it. Introduction The State of Oregon offers medical assistance to low-income individuals…

## spinach superpowers and Granger causality

While traversing the darker residuals of the blogosphere, Data Scientist happens upon a blogger in distress. Our hero quickly swallows a can of Red Bull-infused spinach and springs to action: The Popeye Challenge Dr. Mike Sutton of Dysology.org requested assistance demonstrating (or debunking) the proposed causal link between high spinach production and the popularity of Popeye…

## 21504 to 1 odds the sun will rise tomorrow: an illustration of Bayesian reasoning

The following preposterous case illustrates the Bayesian worldview: Prior estimate If you ask a mathematically-gifted newborn for the probability that the sun will rise tomorrow, they might reply: “The probability that the sun will rise tomorrow follows a beta distribution with parameters a = b = 2.” Since the mean of the above distribution is…

## the lunar cycle: not a partner in crime

Emily Williams and Stacie Dutton, SETEC Astronomy, San Francisco, California, USA Despite abundant scientific evidence refuting the connection, the “lunar effect” persists as a common explanation for temporal variation in human behavior. Adherents of this idea implicate the lunar cycle in outcomes as diverse as lost elections and hemophilic episodes. We find the myth woven…

## cluster analysis of marketing survey ranking questions

Recently I’ve become extremely interested in survey analysis and, more broadly, the social consequences of survey-based decision making. So when a friend asked for help extracting business intelligence from a market research survey they conducted, I jumped at the opportunity to test out some ideas. The analysis presented below details a use of hierarchical clustering…

## Stephen Colbert teaches proper data normalization

Colbert Report devotees recently witnessed a true miracle—Stephen Colbert spoke data science: Due to the massive volume [of suggestions received], we … used computers to crunch the data. Mr. Colbert had just received approximately 53,000 e-mailed suggestions from his minions proposing social issues for the newly formed Colbert Super PAC to address. Financial contributions to…

## Austin heat wave office wager

Data scientist enters the office betting pool: Whoever most accurately predicts the day Austin’s heat wave breaks (first day with a high temperature less than 100 degrees Fahrenheit) wins. Seeking a defendable approach, our hero generates an ARIMA time-series forecast based on the last eleven years of daily high temperatures: The model suggests September 1st…

## church to bar ratio in the lower 48, by county

Calculated the church to bar ratio by county from US Census Bureau data: The color partitions were derived from the log-transformed ratio distribution to facilitate visual clarity. Method From the 2009 County Business Patterns data published at http://www.census.gov/econ/cbp/download/index.htm, extracted the number of establishments in each county that have NAICS codes 813110 (places of worship including…

## data scientist walks into a bar…

A data scientist walks into a bar and observes a large crowd cheering intermittently. The crowd’s eyes track events on a large TV screen, following a bouncing spherical projectile’s motion as fit actors throw it through one of two metal rings. One of these rings elicits cheers from the crowd as the object passes through,…