It seems logical that U.S. counties having greater populations would support more diverse industry than counties having lesser population. Perhaps this has been proven already, but I recently stumbled upon my own verification of the idea:
The above plot shows industry diversity (expressed in the form of Shannon entropy, discussed below) as a function of county population. As county population goes up, so does industry diversity, at least until a saturation level is reached for very high population counties. This result makes sense with our intuition: higher population centers produce more economic activity than lower population centers, presumably for a wider array of markets and therefore with a wider array of industries.
Ecologists use the Shannon entropy equation to measure species diversity in a given region. Here I apply the same equation to determine industry diversity in each US county.
We calculate H(X) for each U.S. county. In the equation, p(xi) is the probability of observing a business from industry xi in the county if a business is picked at random, which is calculated from the business establishment counts keyed to NAICS industry codes in the 2011 County Business Patterns dataset. (Here x is the set of all industries in the county). Source code used to compute H(X) for each county is described at http://badassdatascience.com/2014/01/07/test-driving-amazon-web-services-elastic-mapreduce/.
Matching with Population
The Shannon entropy results computed above were matched to 2010 Census population counts for each county using state and county FIPS codes. The source data comes from http://www.census.gov/popest/data/counties/totals/2012/CO-EST2012-alldata.html.
We confirm the correlation with a linear model of Shannon entropy versus population, where each data point used in the regression corresponds to a U.S. county:
The model shows that Log10(population) explains 85% of the variance in Shannon entropy.