proportional-integral (PI) controller in Vensim

In my last post, I discussed an attempt at designing a PID controller using the Kepler Scientific Workflow system. Here I report on a similar (yet successful) development of a proportional-integral (PI) controller in Vensim PLE. Vensim is a software package for describing and simulating dynamic models, particularly those involving feedback. I’ve often described it as a tool for “point and click differential equations” since it allows users to draw models for simulation using a mouse that would ordinarily require specification as a set of differential equations.

A PI controller is a feedback control system configuration where the difference (error) between a system response and a desired set point is directed toward a proportional multiplier and an integrator. Each of the proportional multiplier and integrator values are in turn multiplied by constants, called “gains”, and then both results are summed together. The resulting value is then directed toward the input of the system. As the system response gets closer to the set point, the calculated input value approaches a constant.

Most real-world instances of PI controllers often have a “D” component as well, and are called “PID” controllers. The “D” stands for “derivative”, but I was not able to make Vensim PLE calculate derivatives so I left this feature out.

The basic system we are controlling looks like:


We want to control the value of the reservoir by adjusting the inward flow. The outward flow is not controlled, and it equals a positive constant multiplied by the reservoir’s value. Because the outward flow is greater than zero whenever the reservoir value is greater than zero, we always have to replenish the reservoir using the inward flow if we want its value to stay constant. For this exercise we want to control the inward flow so that the reservoir’s value equals ten. The reservoir’s initial value is zero for this experiment.

We start with a simple proportional controller:


Here we have a set point equal to ten. The inward flow is modulated by a constant (Kp) multiplied by the error (set point minus the reservoir value). Simulating the system yields:


We can see from these simulation results that the system is unable to track the set point of ten. It manages to achieve a constant value greater than zero however, so we are on the right track.

When we add an integral term to the control system we are able to track the set point. Here the error is integrated and the result is multiplied by gain “Ki”, which then is added to the proportional component and directed toward the system input:


Now the reservoir’s value fills to and remains at ten:


Posted in engineering, system dynamics | Tagged , , , , , | Leave a comment

attempted PID controller with Kepler

I wanted to check out the Kepler scientific workflow system (, and decided to build a PID controller model with it. Here I report on my results.

The following schematic, taken from the Wikipedia entry, shows the basic configuration of a PID controller. PID stands for “proportional integral derivative”, reflecting the fact that the value of the error between a set point r(t) and “plant” (process) output y(t) is subjected to a proportional multiplier (gain P), an integration of the error times a gain (gain I), and the derivative of the error times a gain (gain D). These results are then added together and used as input u(t) to the plant. As the plant output y(t) approaches the set point, the error decreases and therefore the modulation of plant input u(t) by the proportion, integral, and derivative components decreases. When the gains are tuned correctly, such a controller causes the plant output value to track the set point.


My resulting Kepler model implementing a PID controller for a simulated plant looks like:


The process to be controlled is a sine wave plus input u(t), where u(t) is modulated by the control system. The set point is 10. Note the “Sensor Lag” objects which are time delays necessary to prevent causality loop errors.

Simulating the system with naïve gains gives us:


Here the system oscillates around the value five, sometimes nearing 10 in values. I ran out of time allotted for this project before I could tune the gain parameters to produce a more satisfactory tracking of the set point. I believe the spikes at the discontinuous points are the derivative step trying to deal with theoretically infinite slopes, or they are artifacts of the underlying Runge-Kutta simulation procedure (or both). In either case the size of the spikes builds over simulation time, causing the simulation to eventually fail due to production of infinite values.

It may be that my choice of a sine wave plus u(t) for the plant is a poor one for PID control. If I come up with a more appropriate plant I’ll try this again.

Posted in engineering, system dynamics | Tagged , , , , , | 1 Comment

command line Hadoop with a “live” Elastic MapReduce cluster

There are two ways to run Hadoop from the command line on an Elastic MapReduce (EMR) cluster that is active in “waiting” mode. First the hard way:

Running Hadoop Directly by Logging into the Cluster’s Head Node

The following commands show how you can log into the cluster’s head node and run Hadoop from the shell, thereby obtaining the same results you would if you submitted the job through the web interface.

First, log into the head node of your EMR cluster (change your key filename and URL as necessary to match your information):

ssh -i my-key-pair.pem

Set necessary environment variables. Note that “/home/hadoop/MyJar.jar” is included, as we’ll need this when we run Hadoop:

export HADOOP_CLASSPATH=./:/home/hadoop/share/hadoop/common/hadoop-common-2.2.0.jar:/home/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:/home/hadoop/MyJar.jar:/home/hadoop/src
export CLASSPATH=./:/home/hadoop/share/hadoop/common/hadoop-common-2.2.0.jar:/home/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:/home/hadoop/MyJar.jar:/home/hadoop/src

Compile your Java classes and make a JAR file with them:

cd src
javac *.java
jar cf MyJar.jar *.class
mv MyJar.jar ../
cd ..

Copy your input file to the Hadoop filesystem:

hadoop fs -put input hdfs:///

Make sure your output destination doesn’t exist:

hadoop fs -rm -R output

Run Hadoop:

hadoop jar MyJar.jar MyJar hdfs:///input output

View the results:

hadoop fs -ls output
hadoop fs -cat output/part-r-00000

Running a Job on a “Live” EMR Cluster from any Terminal

Now for the easy way to run a job on an existing EMR cluster, assuming you’ve installed and configured the Amazon EMR Command Line Interface program. In this case the class you are running is stored on S3, along with the input. The output will be written to S3 as well. You do not need to be logged into the cluster’s head node for this to work.

elastic-mapreduce -j your-cluster-id --jar s3n://your-S3-bucket/code/MyJar.jar --arg MyJar --arg s3n://your-S3-bucket/input/ --arg s3n://your-S3-bucket/output
Posted in big data, data science, engineering | Tagged , , , , , , | Leave a comment

listing an Amazon S3 directory’s contents in Java

After much struggle, I have figured out how to list an Amazon S3 directory’s contents in Java using the AWS SDK. Here is how to do it:

First, you need to import the following libraries:

import java.util.List;
import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.util.StringUtils;

Then, in your main function (or elsewhere in your code) you need:

String accessKey = "your access key";
String secretKey = "your secret key";
AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
AmazonS3 conn = new AmazonS3Client(credentials);

String prefix = "directory-listing-example";
String bucketName = "badass-data-science";
String delimiter = "/";
if (!prefix.endsWith(delimiter)) {
    prefix += delimiter;

ListObjectsRequest listObjectsRequest = new ListObjectsRequest().withBucketName(bucketName).withPrefix(prefix).withDelimiter(delimiter);
ObjectListing new_objects = conn.listObjects(listObjectsRequest);

for (S3ObjectSummary objectSummary : new_objects.getObjectSummaries()) {
    System.out.println(objectSummary.getKey() + "\t" +
                       objectSummary.getSize() + "\t" +

Be sure to change the “prefix” variable to the directory name you want to list and the “bucketName” variable to the name of your bucket.

You’ll need a CLASSPATH similar to the following CLASSPATH, as there are substantial dependencies.

export CLASSPATH=./:/home/ubuntu/apps/packages/aws-java-sdk-1.7.1/lib/aws-java-sdk-1.7.1.jar:/home/ubuntu/apps/packages/commons-logging-1.1.3/commons-logging-1.1.3.jar:/home/ubuntu/apps/packages/httpcomponents-core-4.3.2/lib/httpcore-4.3.2.jar:/home/ubuntu/apps/packages/httpcomponents-client-4.3.2/lib/httpclient-4.3.2.jar:/home/ubuntu/apps/packages/jackson-annotations-2.0.1.jar:/home/ubuntu/apps/packages/jackson-databind-2.1.4.jar:/home/ubuntu/apps/packages/jackson-core-2.3.1.jar:/home/ubuntu/apps/packages/commons-codec-1.9/commons-codec-1.9.jar

The directory we are querying looks like:


Running this code gives you:


Posted in big data, data science, engineering | Tagged , , , , , , , , | Leave a comment

thank you to my Facebook fan!

An unknown reader consistently shares my links on Facebook. I just wanted to say thanks!


Posted in marketing | Tagged | Leave a comment

chaining map operations in Hadoop

Suppose we have a list of RNA sequences (pictured below), and we want to calculate both the “GC” nucleotide content and the RNA folding energy for each sequence using Hadoop 2.2.0. Furthermore, we want to chain the two operations so that each GC content result is fed to the corresponding sequence’s dG calculation. We also want both calculations to be map operations. (We are organizing the procedure so there are no reduce operations).


The way to accomplish this in Hadoop 2.2.0 is to “chain” the map operations together using the “ChainMapper” class. In the driver code posted below I will show how this is done.

First though, we need the code for the GC content map operation:

import java.util.ArrayList;
import java.lang.InterruptedException;

import org.apache.hadoop.mapreduce.*;

public class GcContent extends Mapper<LongWritable, Text, Text, Text> {

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();
line = line.trim();

Double GCContent = 0.;
for (int i=0; i < line.length(); i++) {
if (line.toUpperCase().charAt(i) == 'G') { GCContent += 1; }
if (line.toUpperCase().charAt(i) == 'C') { GCContent += 1; }
GCContent = GCContent / line.length();

context.write(new Text(line), new Text(GCContent.toString()));

And we need the code for the RNAfold map operation:

import java.lang.InterruptedException;

import org.apache.hadoop.mapreduce.*;

public class RnaFold extends Mapper<Text, Text, Text, Text> {

public void map(Text key, Text value, Context context) throws IOException, InterruptedException {

String GCContent = value.toString();
String sequence = key.toString();

String cmd = "echo " + sequence + " | /home/user01/apps/ViennaRNA-1.8.5/bin/RNAfold -noPS";
String[] full_cmd = {"/bin/sh", "-c", cmd};

Runtime r = Runtime.getRuntime();
Process p = r.exec(full_cmd);
BufferedReader b = new BufferedReader(new InputStreamReader(p.getInputStream()));

String line = "";
int line_count = 0;
while ((line = b.readLine()) != null) {
line_count += 1;
if (line_count == 2) {
String[] scoreArray = line.split(" ");
String score = scoreArray[scoreArray.length - 1].replace("(", "").replace(")", "");
context.write(new Text(sequence), new Text(GCContent + " " + score));

Finally, here is the code for the Hadoop driver, which I called “”:

import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.chain.ChainMapper;

public class Pilot {

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = new Job(conf, "Pilot");



Configuration mapconf1 = new Configuration(false);
ChainMapper.addMapper( job, GcContent.class, LongWritable.class, Text.class, Text.class, Text.class, mapconf1);

Configuration mapconf2 = new Configuration(false);
ChainMapper.addMapper( job, RnaFold.class, Text.class, Text.class, Text.class, Text.class, mapconf2);


FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));


In this code we can see the use of the “ChainMapper” class to link the output of the GcContent class to the input of the RnaFold class. When using ChainMapper, be sure the data types agree with the data types expected by the classes; this caused me substantial difficultly until I figured it out.

After running Hadoop, the results look like:


In the image above we have each input sequence, alongside its corresponding GC content and dG calculation results.

Posted in big data, data science, engineering | Tagged , , , , , , | Leave a comment

dynamically generated matplotlib images via django

I finally figured out how to serve a dynamically generated matplotlib image through Django. Here is the necessary and code for an example case:

from django.http import HttpResponse

from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
import matplotlib.pyplot as plt
import numpy as np

def index(request):
    x = np.random.randn(100000)
    f = plt.figure(111)
    plt.hist(x, color='lightblue')
    plt.title('Histogram of Randomly Generated Values')
    canvas = FigureCanvas(f)
    response = HttpResponse(content_type='image/png')
    return response

from django.conf.urls import patterns, url

from histogram import views

urlpatterns = patterns('',
    url(r'^$', views.index, name='index')



Posted in data science, engineering | Tagged , , , , , | Leave a comment

sorting out R’s distribution functions

R’s distribution functions come in four flavors: “d”, “p”, “q”, and “r” (e.g., “dnorm”, “pnorm”, “qnorm”, and “rnorm”). I regularly get them mixed up, so am writing down here what they do for future reference.


“d” produces the density curve as a function of a random variable. For example, “dnorm” produces:

x <- seq(-4, 4, 0.001)
y <- dnorm(x, mean=0, sd=1)
plot(x, y, type="l", col="blue", main="Density:  dnorm(mean=0, sd=1)", xlab="X", ylab="Density")


Cumulative Distribution Function

“p” produces the CDF as a function of a random variable. For example, “pnorm” produces:

y <- pnorm(x, mean=0, sd=1)
plot(x, y, type="l", col="blue", main="Cumulative Distribution:  pnorm(mean=0, sd=1)", xlab="X", ylab="Probability")


Quantile Function

“q” produces the quantiles as a function of probabilities. For example, “qnorm” produces:

p <- seq(0, 1, 0.001)
y <- qnorm(p, mean=0, sd=1)
plot(p, y, type="l", col="blue", main="Quantile Function:  qnorm(mean=0, sd=1)", xlab="Probability", ylab="X")


Random Samples

“r” produces random samples from the given distribution. For example, “rnorm” produces:

r <- rnorm(1000, mean=0, sd=1)
hist(r, col="lightblue", xlab="X", main="Histogram of rnorm(1000, mean=0, sd=1)")


Posted in data science, statistics | Tagged , | Leave a comment

industrial diversity correlates with population

It seems logical that U.S. counties having greater populations would support more diverse industry than counties having lesser population. Perhaps this has been proven already, but I recently stumbled upon my own verification of the idea:


The above plot shows industry diversity (expressed in the form of Shannon entropy, discussed below) as a function of county population. As county population goes up, so does industry diversity, at least until a saturation level is reached for very high population counties. This result makes sense with our intuition: higher population centers produce more economic activity than lower population centers, presumably for a wider array of markets and therefore with a wider array of industries.


Industry Diversity

Ecologists use the Shannon entropy equation to measure species diversity in a given region. Here I apply the same equation to determine industry diversity in each US county.


We calculate H(X) for each U.S. county. In the equation, p(xi) is the probability of observing a business from industry xi in the county if a business is picked at random, which is calculated from the business establishment counts keyed to NAICS industry codes in the 2011 County Business Patterns dataset. (Here x is the set of all industries in the county). Source code used to compute H(X) for each county is described at

Matching with Population

The Shannon entropy results computed above were matched to 2010 Census population counts for each county using state and county FIPS codes. The source data comes from

Statistical Model

We confirm the correlation with a linear model of Shannon entropy versus population, where each data point used in the regression corresponds to a U.S. county:


The model shows that Log10(population) explains 85% of the variance in Shannon entropy.

Posted in econometrics | Tagged , , , , , , , , , , , | Leave a comment

test driving Amazon Web Services’ Elastic MapReduce

Hadoop provides software infrastructure for running MapReduce tasks, but it requires substantial setup time and availability of a compute cluster to take full advantage of. Amazon’s Elastic MapReduce (EMR) solves these problems; delivering pre-configured Hadoop virtual machines running on the cloud for only the time they are required, and billing only for the computation minutes used. Here I test drive the EMR by demonstrating its use on a Hadoop computation I previously wrote about.

Computation Overview

In February of 2012, I wrote about computation of each U.S. county’s “industrial diversity” index from the U.S. Census Bureau’s County Business Patterns (CBP) data (see The CBP dataset provides, for each NAICS industry code, the number of business establishments operating in each county. Using an equation similar to that used by ecologists to study species diversity in a region, we can derive industrial diversity for each county using the number of NAICS codes and establishments as input.

At its core the diversity equation is Shannon’s information entropy equation, and the per county computation is well suited to the MapReduce framework. I describe an adaption of the computation to Hadoop here, where I ran the calculation with a Hadoop instance configured for single node use (basically I ran it on my desktop). In this post I expand from running the computation on a single node to a cluster using EMR.

Source Code

First, I had to change the code from my previous post to reflect updates in Hadoop. The new map step code is:

import java.lang.InterruptedException;
import org.apache.hadoop.mapreduce.*;

public class MapToEstablishments extends Mapper<LongWritable, Text, Text, DoubleWritable> {

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

	String line = value.toString().replace("\"", "");
	line = line.trim();
	String[] line_split = line.split(",");

	if (line.contains("fipstate")) { return; }

	String naics = line_split[2];
	Integer total_establishments = Integer.parseInt(line_split[10]);
	String fipstate = line_split[0];
	String fipcounty = line_split[1];

	if (naics.contains("-")) { return;  }
	if (naics.contains("/")) { return;  }

	DoubleWritable total_establishments_writable = new DoubleWritable(total_establishments);
	context.write(new Text(fipstate + "_" + fipcounty), total_establishments_writable);

The map step works by selecting the state/county FIPS numbers as the keys, and the business establishment counts for each NAICS code as the values. We do not need to save the NAICS codes themselves. When the map step is complete each state/county FIPS key will have multiple values, one for each establishment count.

The new reduce step code is:

import java.lang.InterruptedException;
import org.apache.hadoop.mapreduce.*;
import java.util.ArrayList;

public class ReduceToFindIndustrialDiversity extends Reducer<Text, DoubleWritable, Text, DoubleWritable> {

    public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {

	ArrayList establishment_counts = new ArrayList();
	for (DoubleWritable val : values) {

        // I'm sure there is a better way to do this:
        double sum = 0.0;
        for (double i : establishment_counts) { sum += i; }

        // I'm sure there is a better way to do this too:
        ArrayList probabilities = new ArrayList();
        for (double i : establishment_counts) { probabilities.add( i / sum ); }

        // Shannon entropy calculation
        double H = 0.0;
        for (double p : probabilities) {
	    H += p * Math.log(p) / Math.log(2.0);
        H = H * -1.0;

	context.write(key, new DoubleWritable(H));


The reduce step works by calculating the Shannon entropy (equation above) across all the establishment count values per key, producing a new key/value pair consisting of state/county FIPS numbers as the key, and the resulting Shannon entropy as the value. By the end of the reduce step each state/county FIPS key will have only one value.

Finally, the new pilot code is:

import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class IndustrialDiversity {

	public static void main(String[] args) throws Exception {

	    Configuration conf = new Configuration();
	    Job job = new Job(conf, "IndustrialDiversity");





	    FileInputFormat.addInputPath(job, new Path(args[0]));
	    FileOutputFormat.setOutputPath(job, new Path(args[1]));


This code contains the “main” function used to configure and run the analysis.

Uploading the Code and Data to S3

I next compiled these three classes into a JAR file called IndustrialDiversity.jar and uploaded the JAR file to Amazon Web Services’ S3 under the “code” folder of my bucket:


Then I uploaded the 2011 County Business Patterns data file to S3 into the “input” folder of my bucket:


Starting the Elastic MapReduce Cluster

At the Elastic MapReduce page available from the Amazon Web Services management console we see a list of previous clusters:


From here we press the “Create cluster” and get a form indicating options. Note that I set the log folder to point to my bucket on S3 in the image below:


At the bottom of the form we press the “Create cluster” button:


Next we see the new cluster page showing that it is starting up:


Once the cluster is up, three configuration steps run:


Finally, the cluster startup is complete:


Running the Hadoop Job

We run the Hadoop job by pressing the “Add step” button to add a new step, configuring it to run the JAR file we uploaded previously. Note that the class containing the “main” function is the first argument:


The step immediately begins running once “Add” is pressed:


Finally, the step completes:


Output Files

The output files may now be viewed on S3:


Looking at one of the results files in particular shows per county industrial diversity indices:


Terminating the Cluster

We now terminate the cluster:


The process takes a bit, but finally completes:



This procedure cost $1.36 USD.

Posted in big data, data science, econometrics | Tagged , , , , , , , , , , , , , , , , , , , , | 1 Comment