using Hadoop to examine county-level industrial diversity

In a previous post, I computed each U.S. county’s industrial diversity from the 2009 County Business Patterns data published by the U.S. Census Bureau. The diversity calculation made use of Shannon’s information entropy equation, which is similarly used by ecologists to calculate species diversity for a region.

Here I perform the same calculation using Hadoop, since the calculation method fits nicely into the MapReduce framework. The computation consists of a single map procedure followed by a single reduce procedure.

The map operation (below) first parses the source data to collect the number of business establishments in each county for each NAICS code. The state-county FIPS code is saved as the key, while the number of establishments is saved as the value. The result will be multiple key/value pairs for each key, one pair for each NAICS code present in the county. Note that the NAICS codes themselves are not preserved.

public static class MapToEstablishments extends MapReduceBase implements Mapper {
private Text word = new Text();

    public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
	String line = value.toString().replace("\"", "");
	line = line.trim();
	String[] line_split = line.split(",");

	if (line.contains("fipstate")) { return; }

	String naics = line_split[2];
	Integer total_establishments = Integer.parseInt(line_split[10]);
	String fipstate = line_split[0];
	String fipcounty = line_split[1];

	if (naics.contains("-")) { return;	}
	if (naics.contains("/")) { return;	}

	DoubleWritable total_establishments_writable = new DoubleWritable(total_establishments);
	word.set(fipstate + "_" + fipcounty);
	output.collect(word, total_establishments_writable);
    }
}

The reduce computation (below) first makes a list of all establishment values for the given key (state-county FIPS). It then computes each establishment count’s proportion of the total establishment counts. Finally, Shannon’s information entropy equation is applied to the list of proportions to compute the Shannon entropy of the state-county FIPS key.

public static class ReduceToFindIndustrialDiversity extends MapReduceBase implements Reducer {

    public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {

		ArrayList establishment_counts = new ArrayList();
    		while (values.hasNext()) { establishment_counts.add(values.next().get()); }

		// I'm sure there is a better way to do this:
		double sum = 0.0;
		for (double i : establishment_counts) { sum += i; }

		// I'm sure there is a better way to do this too:
		ArrayList probabilities = new ArrayList();
		for (double i : establishment_counts) { probabilities.add( i / sum ); }

		// Shannon entropy calculation
		double H = 0.0;
		for (double p : probabilities) {
H += p * Math.log(p) / Math.log(2.0);
		}
		H = H * -1.0;

    		output.collect(key, new DoubleWritable(H));
    }
}

The following code in the Java class’s “main” function pilots the above described computation:

public static void main(String[] args) throws Exception {
    JobConf conf = new JobConf(IndustryDiversity.class);
    conf.setJobName("industry_diversity");

    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(DoubleWritable.class);

    conf.setMapperClass(MapToEstablishments.class);
    conf.setReducerClass(ReduceToFindIndustrialDiversity.class);

    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);

    FileInputFormat.setInputPaths(conf, new Path(args[0]));
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    JobClient.runJob(conf);
}

The result is a list of each state-county FIPS with a corresponding Shannon entropy value, e.g.,

This entry was posted in big data, data science, econometrics and tagged , , , , , , , , , , . Bookmark the permalink.

One Response to using Hadoop to examine county-level industrial diversity

  1. Pingback: test driving Amazon Web Services’ Elastic MapReduce |

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>