industrial diversity correlates with population

It seems logical that U.S. counties having greater populations would support more diverse industry than counties having lesser population. Perhaps this has been proven already, but I recently stumbled upon my own verification of the idea: The above plot shows industry diversity (expressed in the form of Shannon entropy, discussed below) as a function of […]

test driving Amazon Web Services’ Elastic MapReduce

Hadoop provides software infrastructure for running MapReduce tasks, but it requires substantial setup time and availability of a compute cluster to take full advantage of. Amazon’s Elastic MapReduce (EMR) solves these problems; delivering pre-configured Hadoop virtual machines running on the cloud for only the time they are required, and billing only for the computation minutes […]

the first Big Data recession

The “Great Recession” of 2007-2009 may be the first “Big Data” recession, i.e., the first recession which we can examine using the vast information delivered by the advent of Big Data. Certainly the next recession will be studied through that lens. To test whether new data is available that can be cast in an economic […]

church to bar ratio, by U.S. county

Church to bar ratio by county from U.S. Census Bureau data: The brighter the color, the higher the church to bar ratio. Counties missing data necessary for the computation are shown in black. Method From the 2011 County Business Patterns data published at, I extracted the number of establishments in each county that have […]

the future orientation index

I’ve recently discovered Google Trends and have been looking for an opportunity to use it. Today I found such opportunity in a paper [1] published last April that computes countries’ “future orientation index” from Google Trends data and correlates it with national per-capita GDP. The authors report correlation for 2010; my experiment with Google Trends […]

using Hadoop to examine county-level industrial diversity

In a previous post, I computed each U.S. county’s industrial diversity from the 2009 County Business Patterns data published by the U.S. Census Bureau. The diversity calculation made use of Shannon’s information entropy equation, which is similarly used by ecologists to calculate species diversity for a region. Here I perform the same calculation using Hadoop, […]

industrial diversity vs percent change in unemployment rate

This analysis may exceed the bounds of my statistics knowledge, but I will deliver it anyway in the name of “process” blogging. I welcome experienced critique of the method! Result A modest positive correlation exists between a county’s industrial diversity and its percent change in unemployment rate over the period 2007-2010. Method Several months ago […]

British crime vs. internet use

Jordan Cashmore, a student at Nottingham Trent University, recently asked me for help determining if correlation and causality exist between the rise of internet use over the last 15 years and the drop in British crime over the same period. Jordan, for his dissertation, proposes that correlation and causality do exist based on criminology theory, […]

industry diversity (via Shannon entropy) per US county

Ecologists use Shannon entropy to measure species diversity in a given region. Here I apply the same equation to determine industry diversity in each US county. In the map below, darker color indicates greater industrial diversity: Method Downloaded the 2009 County Business Patterns data from the US Census Bureau and extracted the business establishment counts […]

simulated ROC curves

How receiver operating characteristic (ROC) curves vary with simulated data having stepped degrees of separation: Computational Notes These were created in R using the “ROCR” package. Be sure to say “ROCR” really fast! The simulated data are normally distributed within each group.