proportional-integral (PI) controller in Vensim

In my last post, I discussed an attempt at designing a PID controller using the Kepler Scientific Workflow system. Here I report on a similar (yet successful) development of a proportional-integral (PI) controller in Vensim PLE. Vensim is a software package for describing and simulating dynamic models, particularly those involving feedback. I’ve often described it as a tool for “point and click differential equations” since it allows users to draw models for simulation using a mouse that would ordinarily require specification as a set of differential equations.

A PI controller is a feedback control system configuration where the difference (error) between a system response and a desired set point is directed toward a proportional multiplier and an integrator. Each of the proportional multiplier and integrator values are in turn multiplied by constants, called “gains”, and then both results are summed together. The resulting value is then directed toward the input of the system. As the system response gets closer to the set point, the calculated input value approaches a constant.

Most real-world instances of PI controllers often have a “D” component as well, and are called “PID” controllers. The “D” stands for “derivative”, but I was not able to make Vensim PLE calculate derivatives so I left this feature out.

The basic system we are controlling looks like:

basic_schematic

We want to control the value of the reservoir by adjusting the inward flow. The outward flow is not controlled, and it equals a positive constant multiplied by the reservoir’s value. Because the outward flow is greater than zero whenever the reservoir value is greater than zero, we always have to replenish the reservoir using the inward flow if we want its value to stay constant. For this exercise we want to control the inward flow so that the reservoir’s value equals ten. The reservoir’s initial value is zero for this experiment.

We start with a simple proportional controller:

P_schematic

Here we have a set point equal to ten. The inward flow is modulated by a constant (Kp) multiplied by the error (set point minus the reservoir value). Simulating the system yields:

P_results

We can see from these simulation results that the system is unable to track the set point of ten. It manages to achieve a constant value greater than zero however, so we are on the right track.

When we add an integral term to the control system we are able to track the set point. Here the error is integrated and the result is multiplied by gain “Ki”, which then is added to the proportional component and directed toward the system input:

PI_schematic

Now the reservoir’s value fills to and remains at ten:

PI_results

Posted in engineering, system dynamics | Tagged , , , , , | Leave a comment

attempted PID controller with Kepler

I wanted to check out the Kepler scientific workflow system (https://kepler-project.org/), and decided to build a PID controller model with it. Here I report on my results.

The following schematic, taken from the Wikipedia entry http://en.wikipedia.org/wiki/PID_controller, shows the basic configuration of a PID controller. PID stands for “proportional integral derivative”, reflecting the fact that the value of the error between a set point r(t) and “plant” (process) output y(t) is subjected to a proportional multiplier (gain P), an integration of the error times a gain (gain I), and the derivative of the error times a gain (gain D). These results are then added together and used as input u(t) to the plant. As the plant output y(t) approaches the set point, the error decreases and therefore the modulation of plant input u(t) by the proportion, integral, and derivative components decreases. When the gains are tuned correctly, such a controller causes the plant output value to track the set point.

2000px-PID_en_updated_feedback.svg

My resulting Kepler model implementing a PID controller for a simulated plant looks like:

Model

The process to be controlled is a sine wave plus input u(t), where u(t) is modulated by the control system. The set point is 10. Note the “Sensor Lag” objects which are time delays necessary to prevent causality loop errors.

Simulating the system with naïve gains gives us:

PID_controlled_plant_output

Here the system oscillates around the value five, sometimes nearing 10 in values. I ran out of time allotted for this project before I could tune the gain parameters to produce a more satisfactory tracking of the set point. I believe the spikes at the discontinuous points are the derivative step trying to deal with theoretically infinite slopes, or they are artifacts of the underlying Runge-Kutta simulation procedure (or both). In either case the size of the spikes builds over simulation time, causing the simulation to eventually fail due to production of infinite values.

It may be that my choice of a sine wave plus u(t) for the plant is a poor one for PID control. If I come up with a more appropriate plant I’ll try this again.

Posted in engineering, system dynamics | Tagged , , , , , | 1 Comment

command line Hadoop with a “live” Elastic MapReduce cluster

There are two ways to run Hadoop from the command line on an Elastic MapReduce (EMR) cluster that is active in “waiting” mode. First the hard way:

Running Hadoop Directly by Logging into the Cluster’s Head Node

The following commands show how you can log into the cluster’s head node and run Hadoop from the shell, thereby obtaining the same results you would if you submitted the job through the web interface.

First, log into the head node of your EMR cluster (change your key filename and URL as necessary to match your information):

ssh -i my-key-pair.pem hadoop@ec2-53-84-234-157.compute-1.amazonaws.com

Set necessary environment variables. Note that “/home/hadoop/MyJar.jar” is included, as we’ll need this when we run Hadoop:

export HADOOP_CLASSPATH=./:/home/hadoop/share/hadoop/common/hadoop-common-2.2.0.jar:/home/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:/home/hadoop/MyJar.jar:/home/hadoop/src
export CLASSPATH=./:/home/hadoop/share/hadoop/common/hadoop-common-2.2.0.jar:/home/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:/home/hadoop/MyJar.jar:/home/hadoop/src

Compile your Java classes and make a JAR file with them:

cd src
javac *.java
jar cf MyJar.jar *.class
mv MyJar.jar ../
cd ..

Copy your input file to the Hadoop filesystem:

hadoop fs -put input hdfs:///

Make sure your output destination doesn’t exist:

hadoop fs -rm -R output

Run Hadoop:

hadoop jar MyJar.jar MyJar hdfs:///input output

View the results:

hadoop fs -ls output
hadoop fs -cat output/part-r-00000

Running a Job on a “Live” EMR Cluster from any Terminal

Now for the easy way to run a job on an existing EMR cluster, assuming you’ve installed and configured the Amazon EMR Command Line Interface program. In this case the class you are running is stored on S3, along with the input. The output will be written to S3 as well. You do not need to be logged into the cluster’s head node for this to work.

elastic-mapreduce -j your-cluster-id --jar s3n://your-S3-bucket/code/MyJar.jar --arg MyJar --arg s3n://your-S3-bucket/input/ --arg s3n://your-S3-bucket/output
Posted in big data, data science, engineering | Tagged , , , , , , | Leave a comment

listing an Amazon S3 directory’s contents in Java

After much struggle, I have figured out how to list an Amazon S3 directory’s contents in Java using the AWS SDK. Here is how to do it:

First, you need to import the following libraries:

import java.util.List;
import com.amazonaws.services.s3.model.ListObjectsRequest;
import com.amazonaws.services.s3.model.ObjectListing;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3Client;
import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.services.s3.model.Bucket;
import com.amazonaws.util.StringUtils;
import com.amazonaws.services.s3.model.S3ObjectSummary;

Then, in your main function (or elsewhere in your code) you need:

String accessKey = "your access key";
String secretKey = "your secret key";
AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
AmazonS3 conn = new AmazonS3Client(credentials);

String prefix = "directory-listing-example";
String bucketName = "badass-data-science";
String delimiter = "/";
if (!prefix.endsWith(delimiter)) {
    prefix += delimiter;
}

ListObjectsRequest listObjectsRequest = new ListObjectsRequest().withBucketName(bucketName).withPrefix(prefix).withDelimiter(delimiter);
ObjectListing new_objects = conn.listObjects(listObjectsRequest);

for (S3ObjectSummary objectSummary : new_objects.getObjectSummaries()) {
    System.out.println(objectSummary.getKey() + "\t" +
                       objectSummary.getSize() + "\t" +
                       StringUtils.fromDate(objectSummary.getLastModified()));
}

Be sure to change the “prefix” variable to the directory name you want to list and the “bucketName” variable to the name of your bucket.

You’ll need a CLASSPATH similar to the following CLASSPATH, as there are substantial dependencies.

export CLASSPATH=./:/home/ubuntu/apps/packages/aws-java-sdk-1.7.1/lib/aws-java-sdk-1.7.1.jar:/home/ubuntu/apps/packages/commons-logging-1.1.3/commons-logging-1.1.3.jar:/home/ubuntu/apps/packages/httpcomponents-core-4.3.2/lib/httpcore-4.3.2.jar:/home/ubuntu/apps/packages/httpcomponents-client-4.3.2/lib/httpclient-4.3.2.jar:/home/ubuntu/apps/packages/jackson-annotations-2.0.1.jar:/home/ubuntu/apps/packages/jackson-databind-2.1.4.jar:/home/ubuntu/apps/packages/jackson-core-2.3.1.jar:/home/ubuntu/apps/packages/commons-codec-1.9/commons-codec-1.9.jar

The directory we are querying looks like:

S3_screenshot

Running this code gives you:

results_screenshot

Posted in big data, data science, engineering | Tagged , , , , , , , , | Leave a comment

thank you to my Facebook fan!

An unknown reader consistently shares my links on Facebook. I just wanted to say thanks!

Dan

Posted in marketing | Tagged | Leave a comment

chaining map operations in Hadoop

Suppose we have a list of RNA sequences (pictured below), and we want to calculate both the “GC” nucleotide content and the RNA folding energy for each sequence using Hadoop 2.2.0. Furthermore, we want to chain the two operations so that each GC content result is fed to the corresponding sequence’s dG calculation. We also want both calculations to be map operations. (We are organizing the procedure so there are no reduce operations).

input

The way to accomplish this in Hadoop 2.2.0 is to “chain” the map operations together using the “ChainMapper” class. In the driver code posted below I will show how this is done.

First though, we need the code for the GC content map operation:

import java.io.IOException;
import java.util.ArrayList;
import java.lang.InterruptedException;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;

public class GcContent extends Mapper<LongWritable, Text, Text, Text> {

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();
line = line.trim();

Double GCContent = 0.;
for (int i=0; i < line.length(); i++) {
if (line.toUpperCase().charAt(i) == 'G') { GCContent += 1; }
if (line.toUpperCase().charAt(i) == 'C') { GCContent += 1; }
}
GCContent = GCContent / line.length();

context.write(new Text(line), new Text(GCContent.toString()));
}
}

And we need the code for the RNAfold map operation:

import java.io.IOException;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.lang.InterruptedException;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;

public class RnaFold extends Mapper<Text, Text, Text, Text> {

public void map(Text key, Text value, Context context) throws IOException, InterruptedException {

String GCContent = value.toString();
String sequence = key.toString();

String cmd = "echo " + sequence + " | /home/user01/apps/ViennaRNA-1.8.5/bin/RNAfold -noPS";
String[] full_cmd = {"/bin/sh", "-c", cmd};

Runtime r = Runtime.getRuntime();
Process p = r.exec(full_cmd);
p.waitFor();
BufferedReader b = new BufferedReader(new InputStreamReader(p.getInputStream()));

String line = "";
int line_count = 0;
while ((line = b.readLine()) != null) {
line_count += 1;
if (line_count == 2) {
String[] scoreArray = line.split(" ");
String score = scoreArray[scoreArray.length - 1].replace("(", "").replace(")", "");
context.write(new Text(sequence), new Text(GCContent + " " + score));
}
}
}
}

Finally, here is the code for the Hadoop driver, which I called “Pilot.java”:

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.chain.ChainMapper;

public class Pilot {

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = new Job(conf, "Pilot");

job.setJarByClass(Pilot.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);

Configuration mapconf1 = new Configuration(false);
ChainMapper.addMapper( job, GcContent.class, LongWritable.class, Text.class, Text.class, Text.class, mapconf1);

Configuration mapconf2 = new Configuration(false);
ChainMapper.addMapper( job, RnaFold.class, Text.class, Text.class, Text.class, Text.class, mapconf2);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);
}
}

In this code we can see the use of the “ChainMapper” class to link the output of the GcContent class to the input of the RnaFold class. When using ChainMapper, be sure the data types agree with the data types expected by the classes; this caused me substantial difficultly until I figured it out.

After running Hadoop, the results look like:

output

In the image above we have each input sequence, alongside its corresponding GC content and dG calculation results.

Posted in big data, data science, engineering | Tagged , , , , , , | Leave a comment

dynamically generated matplotlib images via django

I finally figured out how to serve a dynamically generated matplotlib image through Django. Here is the necessary views.py and urls.py code for an example case:

views.py

from django.http import HttpResponse

from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
import matplotlib.pyplot as plt
import numpy as np

def index(request):
    x = np.random.randn(100000)
    f = plt.figure(111)
    plt.hist(x, color='lightblue')
    plt.xlabel('X')
    plt.ylabel('Frequency')
    plt.title('Histogram of Randomly Generated Values')
    canvas = FigureCanvas(f)
    response = HttpResponse(content_type='image/png')
    canvas.print_png(response)
    plt.close()
    return response

urls.py

from django.conf.urls import patterns, url

from histogram import views

urlpatterns = patterns('',
    url(r'^$', views.index, name='index')
)

Result

histogram_in_browser

Posted in data science, engineering | Tagged , , , , , | Leave a comment

sorting out R’s distribution functions

R’s distribution functions come in four flavors: “d”, “p”, “q”, and “r” (e.g., “dnorm”, “pnorm”, “qnorm”, and “rnorm”). I regularly get them mixed up, so am writing down here what they do for future reference.

Density

“d” produces the density curve as a function of a random variable. For example, “dnorm” produces:

x <- seq(-4, 4, 0.001)
y <- dnorm(x, mean=0, sd=1)
plot(x, y, type="l", col="blue", main="Density:  dnorm(mean=0, sd=1)", xlab="X", ylab="Density")

dnorm

Cumulative Distribution Function

“p” produces the CDF as a function of a random variable. For example, “pnorm” produces:

y <- pnorm(x, mean=0, sd=1)
plot(x, y, type="l", col="blue", main="Cumulative Distribution:  pnorm(mean=0, sd=1)", xlab="X", ylab="Probability")

pnorm

Quantile Function

“q” produces the quantiles as a function of probabilities. For example, “qnorm” produces:

p <- seq(0, 1, 0.001)
y <- qnorm(p, mean=0, sd=1)
plot(p, y, type="l", col="blue", main="Quantile Function:  qnorm(mean=0, sd=1)", xlab="Probability", ylab="X")

qnorm

Random Samples

“r” produces random samples from the given distribution. For example, “rnorm” produces:

r <- rnorm(1000, mean=0, sd=1)
hist(r, col="lightblue", xlab="X", main="Histogram of rnorm(1000, mean=0, sd=1)")

rnorm_hist

Posted in data science, statistics | Tagged , | Leave a comment

industrial diversity correlates with population

It seems logical that U.S. counties having greater populations would support more diverse industry than counties having lesser population. Perhaps this has been proven already, but I recently stumbled upon my own verification of the idea:

industrial_diversity_VS_population

The above plot shows industry diversity (expressed in the form of Shannon entropy, discussed below) as a function of county population. As county population goes up, so does industry diversity, at least until a saturation level is reached for very high population counties. This result makes sense with our intuition: higher population centers produce more economic activity than lower population centers, presumably for a wider array of markets and therefore with a wider array of industries.

Method

Industry Diversity

Ecologists use the Shannon entropy equation to measure species diversity in a given region. Here I apply the same equation to determine industry diversity in each US county.

equation

We calculate H(X) for each U.S. county. In the equation, p(xi) is the probability of observing a business from industry xi in the county if a business is picked at random, which is calculated from the business establishment counts keyed to NAICS industry codes in the 2011 County Business Patterns dataset. (Here x is the set of all industries in the county). Source code used to compute H(X) for each county is described at http://badassdatascience.com/2014/01/07/test-driving-amazon-web-services-elastic-mapreduce/.

Matching with Population

The Shannon entropy results computed above were matched to 2010 Census population counts for each county using state and county FIPS codes. The source data comes from http://www.census.gov/popest/data/counties/totals/2012/CO-EST2012-alldata.html.

Statistical Model

We confirm the correlation with a linear model of Shannon entropy versus population, where each data point used in the regression corresponds to a U.S. county:

ipython_linear_model

The model shows that Log10(population) explains 85% of the variance in Shannon entropy.

Posted in econometrics | Tagged , , , , , , , , , , , | Leave a comment

test driving Amazon Web Services’ Elastic MapReduce

Hadoop provides software infrastructure for running MapReduce tasks, but it requires substantial setup time and availability of a compute cluster to take full advantage of. Amazon’s Elastic MapReduce (EMR) solves these problems; delivering pre-configured Hadoop virtual machines running on the cloud for only the time they are required, and billing only for the computation minutes used. Here I test drive the EMR by demonstrating its use on a Hadoop computation I previously wrote about.

Computation Overview

In February of 2012, I wrote about computation of each U.S. county’s “industrial diversity” index from the U.S. Census Bureau’s County Business Patterns (CBP) data (see http://badassdatascience.com/2012/02/21/industry-diversity/). The CBP dataset provides, for each NAICS industry code, the number of business establishments operating in each county. Using an equation similar to that used by ecologists to study species diversity in a region, we can derive industrial diversity for each county using the number of NAICS codes and establishments as input.

At its core the diversity equation is Shannon’s information entropy equation, and the per county computation is well suited to the MapReduce framework. I describe an adaption of the computation to Hadoop here, where I ran the calculation with a Hadoop instance configured for single node use (basically I ran it on my desktop). In this post I expand from running the computation on a single node to a cluster using EMR.

Source Code

First, I had to change the code from my previous post to reflect updates in Hadoop. The new map step code is:

import java.io.IOException;
import java.lang.InterruptedException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;

public class MapToEstablishments extends Mapper<LongWritable, Text, Text, DoubleWritable> {

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

	String line = value.toString().replace("\"", "");
	line = line.trim();
	String[] line_split = line.split(",");

	if (line.contains("fipstate")) { return; }

	String naics = line_split[2];
	Integer total_establishments = Integer.parseInt(line_split[10]);
	String fipstate = line_split[0];
	String fipcounty = line_split[1];

	if (naics.contains("-")) { return;  }
	if (naics.contains("/")) { return;  }

	DoubleWritable total_establishments_writable = new DoubleWritable(total_establishments);
	context.write(new Text(fipstate + "_" + fipcounty), total_establishments_writable);
    }
}

The map step works by selecting the state/county FIPS numbers as the keys, and the business establishment counts for each NAICS code as the values. We do not need to save the NAICS codes themselves. When the map step is complete each state/county FIPS key will have multiple values, one for each establishment count.

The new reduce step code is:

import java.io.IOException;
import java.lang.InterruptedException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import java.util.ArrayList;

public class ReduceToFindIndustrialDiversity extends Reducer<Text, DoubleWritable, Text, DoubleWritable> {

    public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {

	ArrayList establishment_counts = new ArrayList();
	for (DoubleWritable val : values) {
	    establishment_counts.add(val.get());
	}

        // I'm sure there is a better way to do this:
        double sum = 0.0;
        for (double i : establishment_counts) { sum += i; }

        // I'm sure there is a better way to do this too:
        ArrayList probabilities = new ArrayList();
        for (double i : establishment_counts) { probabilities.add( i / sum ); }

        // Shannon entropy calculation
        double H = 0.0;
        for (double p : probabilities) {
	    H += p * Math.log(p) / Math.log(2.0);
        }
        H = H * -1.0;

	context.write(key, new DoubleWritable(H));
    }
}

equation

The reduce step works by calculating the Shannon entropy (equation above) across all the establishment count values per key, producing a new key/value pair consisting of state/county FIPS numbers as the key, and the resulting Shannon entropy as the value. By the end of the reduce step each state/county FIPS key will have only one value.

Finally, the new pilot code is:

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class IndustrialDiversity {

	public static void main(String[] args) throws Exception {

	    Configuration conf = new Configuration();
	    Job job = new Job(conf, "IndustrialDiversity");

	    job.setJarByClass(IndustrialDiversity.class);
	    job.setOutputKeyClass(Text.class);
	    job.setOutputValueClass(DoubleWritable.class);

	    job.setMapOutputKeyClass(Text.class);
	    job.setMapOutputValueClass(DoubleWritable.class);

	    job.setMapperClass(MapToEstablishments.class);
	    job.setReducerClass(ReduceToFindIndustrialDiversity.class);

	    job.setInputFormatClass(TextInputFormat.class);
	    job.setOutputFormatClass(TextOutputFormat.class);

	    FileInputFormat.addInputPath(job, new Path(args[0]));
	    FileOutputFormat.setOutputPath(job, new Path(args[1]));

	    job.waitForCompletion(true);
	}
}

This code contains the “main” function used to configure and run the analysis.

Uploading the Code and Data to S3

I next compiled these three classes into a JAR file called IndustrialDiversity.jar and uploaded the JAR file to Amazon Web Services’ S3 under the “code” folder of my bucket:

uploading_code_to_S3

Then I uploaded the 2011 County Business Patterns data file to S3 into the “input” folder of my bucket:

uploading_data_to_S3

Starting the Elastic MapReduce Cluster

At the Elastic MapReduce page available from the Amazon Web Services management console we see a list of previous clusters:

01_cluster_list__EMR

From here we press the “Create cluster” and get a form indicating options. Note that I set the log folder to point to my bucket on S3 in the image below:

02_create_cluster__EMR

At the bottom of the form we press the “Create cluster” button:

03_create_cluster_bottom__EMR

Next we see the new cluster page showing that it is starting up:

04_cluster_starting__EMR

Once the cluster is up, three configuration steps run:

05_cluster_starting_STEP_EMR

Finally, the cluster startup is complete:

06__cluster_startup_complete__EMR

Running the Hadoop Job

We run the Hadoop job by pressing the “Add step” button to add a new step, configuring it to run the JAR file we uploaded previously. Note that the class containing the “main” function is the first argument:

07_add_step_EMR

The step immediately begins running once “Add” is pressed:

08_running_new_step__EMR

Finally, the step completes:

09_new_step_completed_EMR

Output Files

The output files may now be viewed on S3:

10__output_file_list_EMR

Looking at one of the results files in particular shows per county industrial diversity indices:

11_results_EMR

Terminating the Cluster

We now terminate the cluster:

12_terminating_cluster_EMR

The process takes a bit, but finally completes:

13__terminated_cluster_EMR

Cost

This procedure cost $1.36 USD.

Posted in big data, data science, econometrics | Tagged , , , , , , , , , , , , , , , , , , , , | 1 Comment