frugal anarchy

Of all the systems that we seek freedom within and from, none pervades our lives as much as the “econosphere” we inhabit. By “econosphere”, I mean the global network of economic activity whose nodes are individuals and whose edges are trade relationships between individuals.

Even if we had no government, we’d still likely be trading goods and services. Therefore the econosphere may be more significant than the existence of government to those seeking freedom. Anarchists traditionally focus on the elimination of government as the means of increasing freedom. However, I propose that limited reliance on the econosphere is a more comprehensive goal for anarchistic thinking.

There are two paths to individual economic freedom in a free-market economy: The first is to be wealthy enough to afford whatever transactions one wants to make whenever one wants to make them. This is unfortunately out of reach for most people. The second path is voluntary frugality; limiting the transactions one makes to well thought out targets, such that utility and satisfaction of purchases is maximized and very few dollars are spent on things outside those targets.

This strategy of voluntary frugality limits individual reliance on the econosphere by limiting the amount of money that an individual needs to acquire and spend, thereby enhancing their freedom to choose their path in life. I cannot think of a more practical expression of anarchism within the “real world” that we inhabit today.

Related Post

engineer moves into an RV

Posted in Uncategorized | Tagged , , , , , , , | Leave a comment

100th post to badass data science

This marks the 100th post to badass data science. I’ve written about everything from Lady Gaga to computational fluid dynamics, usually with a science or data related spin.

I thought I’d look at my posts analytically rather than simply reminisce. First, here is a tag cloud for the first 99 posts:


From this tag cloud, I can see that either Python or R is used in many posts, and that most posts cover statistical and data science topics. Engineering is also a frequent tag.

I then produced a graph view using Networkx, where the nodes are tags and the edges are formed by tags that occur in the same post. Displaying this graph as VRML:


It is a little hard to see, so here is a closer view:


In this image one can see that “statistics” is a primary hub. In rotated views of the graph (not shown), Python, R, and data science show similar prominence.

Finally, I computed the frequency of the top occurring tags:


From this I see that I wrote about Python more than R, which surprised me. I expected an even split. However this insight matches the fact that I favor using Python rather than R whenever possible, because Python is a full-featured programming language capable of easy string parsing and web deployment. It also looks like the number of engineering posts and the number of science posts is split evenly, which accurately reflects my technical interactions with the world.

The Future

I do not think my posts are “badass” enough for the blog’s title, so I’ll try to up the ante. Maybe I need something involving sharks and tornadoes. The post on claiming squatters’ rights is the closest I’ve come to my goal of “badass data science”.

I actually want to branch out and write more detailed analyses of less technical things, such as policy. Or take highly technical topics, such as synthetic biology, and write about them in a laypersons’ voice.

Most importantly, I’ll just keep writing. I’m bound to hit on something good.


Code used to create the VRML shown above is attached.

Posted in Uncategorized | Tagged , , | Leave a comment

engineer moves into an RV

I recently moved into a travel trailer to lessen the southern California cost of living (and because I like the idea of portable structures as an answer to housing scarcity). This living arrangement sparks my engineering creativity, which is the motivation for this post. Here I discuss RV living from a mechanical and software engineer’s viewpoint.

The following sections explore renting computing power since I no longer can carry a large desktop, modeling the trailer’s computational fluid dynamics (CFD) and mechanical dynamics, setting up an online registry of good boondocking sites, examining the 12 VDC system and figuring out how to add solar and wind power generation to enable off-grid living, improving the insulation and adding a thermostat to the air conditioner, and learning how to get on without my engineering textbooks (which do not fit in the RV).

Renting Computing Power

I no longer have room for a large desktop computer; instead I’m working with a laptop. This is a severe reduction in computing power, so whenever I’m working with a high computation, memory, or disk load I’ll rent computing power from Amazon’s Web Services. Amazon EC2 instances are inexpensive and easy to set up so this strategy is very practical.

Computational Fluid Dynamics

My truck manual limits the amount of surface area perpendicular to the direction of travel that can be towed, presumably due to drag forces. So I tried to use computational fluid dynamics (CFD) to find out if making the front of a trailer more aerodynamic buys one more surface area. I was unsuccessful at this experiment (other than generating cool pictures, below) since I couldn’t figure out how to get the OpenFOAM CFD package to work with compressible flow. First I ran incompressible CFD on a block-shaped trailer model:


Then I ran incompressible CFD on a trailer model where the front is arced:


The pressure at the front appears to be slightly less for the arced model, which says that it is more aerodynamic. However, because this modeled flow is incompressible, I could not really say much about whether towing performance would be improved by using a trailer with a curved shape at the front. Nonetheless, I bought a trailer with an angled front assuming it will reduce towing drag.

Online Boondocking Site Registry

I won’t be boondocking (parking somewhere for the night without utility hookup) anytime soon since I have a day job that requires 40 hours/week presence. I’m staying in an RV park where there are showers and a laundry facility. However, I’ve been thinking of the needs of boondockers, and plan to create an online registry of good boondocking sites, if one does not exist.

The web application for such a registry will have fields for site location and ratings that users can fill in. I’ll have it generate maps if possible. For the infrastructure of the web application, I envision writing it in Python using Django, with MySQL as the supporting database. However, it may be better to use a NoSQL database like a document store—I’ll have to investigate this more thoroughly. For hosting the application, I’ll likely use an Amazon EC2 instance.

Part of my motivation for creating such a site is that I can see myself boondocking in the future, if I find myself between jobs.

Vehicle Mechanical Dynamics

While towing the trailer, I noticed that whenever a semi-truck passed me, the front of my truck was pulled toward the passing semi, and I had to turn the steering wheel to keep my truck straight. This prompted me to explore the dynamics of towing:


In this very simple model, a wind gust from the semi-truck is modeled as force Fair, which rotates the trailer clockwise (about the center of mass which is between the wheels). The force at the hitch pin is also modeled such that it rotates the trailer clockwise when it is positive. For this “back of the envelop” calculation, I ignored the forces caused by the tires meeting the pavement to simplify things—assuming that they slip when lateral forces are applied.

On the tow vehicle, I again ignored the forces at the tires and joined the vehicle to the trailer with the hitch pin forces. I then constructed the above equations of motion and solved for the angular velocity of the trailer. This shows that when the trailer is forced to rotate clockwise by a wind gust from a passing semi, my truck is rotated counter-clockwise toward the passing semi, as observed.

12 VDC

The RV’s lights and fan run on 12 Volts DC, so they can run off the grid, but there are no 12 VDC receptacles inside the unit for running arbitrary 12 Volt appliances. However, I have 12 VDC tools that I would like to be able to run off the grid. I can get around this by removing the fuse box cover and attaching power clips to the DC input terminals for the converter:


The problem with this arrangement is that the negative terminal is dangerously close to a positive one, so that if for example my cat bumps a clip, it might cause a short circuit and potentially a fire. I could insulate the terminal with electrical tape but that is too much of a “hack” for my taste. If I start boondocking regularly I’ll wire in real 12 VDC receptacles, and be sure to add fuses to them to prevent high current loads.

I added a second marine battery when I purchased the RV, so the overall amp-hours I can power is double the original amount (the batteries are identical, both new, and wired in parallel). A power converter below the fuse box converts grid 120 VAC to 12 VDC to keep the batteries charged. I check the water level in the batteries weekly to ensure the water is not boiled away during charging.

Solar Power

My RV does not have solar panels installed, which I’ll add if I start boondocking regularly. I currently have a small six Watt panel (pictured below) which works as little more than a trickle charger. Since I have no charge controller I have to manually check the batteries while using it, and have to disconnect it at night to prevent battery discharge through it (unless I add a diode to the circuit, which has the downside of lowering the current supplied to the batteries during the day). This six Watt panel certainly does not produce enough charge during the day to recover a night’s worth of DC power use.


For true off-the-grid use I’ll buy two 130 Watt panels along with a charge controller. If I can find a charge controller that feeds power back to the grid when I’m plugged in—even better. (It will have to have an inverter built in to do this). I expect the whole package to cost about 1200 USD overall. I’m not ready to commit those funds right now while I’m living in an RV park and have easy grid access, but plan to research the equipment needed now so that I’m ready to buy in case my living arrangements change.

It may be best to design and build a charge controller from scratch, given that I may want to mix solar, grid, and wind power (below) in one system, and I’m not sure if existing controllers can accommodate all three very well. Designing such a controller would be a fun project, but would take a substantial amount of time. If I do this I’ll use an Arduino or Raspberry Pi platform for computing, and use optical relays to separate the computing circuits from the power delivery circuits.

Wind Power

I have never seen an RV with a wind generator installed, but see no reason why a small turbine such as the Primus Air 30 (pictured) cannot be used.


Depending on where I’m staying, I’ll perhaps want a marine-grade turbine. However, my RV is made of aluminum, which is not marine grade, so staying near the coast may be a bad idea no matter what turbine I buy.

I envision mounting it on the bumper and the upper rear of the trailer, as shown in the cheesy image below, with attachments that allow easy removal for when the RV is in motion.



Living in San Diego County requires little use of air conditioning or heating, but while traveling from Texas to San Diego things got hot in the trailer. The “R value” for the wall, floor, ceiling is R-7, which leaves much to be desired. The next time I take a trip across the desert I’ll cover the trailer’s windows with reflective bubble insulation while driving. My thinking is that these window covers can be held on with Velcro for easy removal.

Air Conditioner Thermostat

My trailer’s air conditioner has no thermostat to stop it from running once a desired temperature is reached. This causes problems in the middle of the night when things get cold but I don’t want to get up to turn the machine off. Furthermore, continuing to blow past the point of comfort wastes electricity. To deal with this matter I plan to splice in a household thermostat as soon as I measure the voltage at the AC’s on/off switch, assuming I can make the voltages match. If a household thermostat proves unsuitable, I’ll design a controller from scratch, again using an Arduino or Raspberry Pi system.

Relying on Wikipedia and the Web for Engineering Knowledge

To fit into the RV, I had to get rid of all my engineering and science textbooks. Now I’m relying on the internet whenever I need to look something up.

Related Post

selecting travel trailers by regression

Posted in engineering | Tagged , , | 2 Comments

graph database for gene annotation

Lately I’ve been experimenting with graph databases using Neo4j and the Cypher query language. To get a feel for these tools, I created the following gene annotation network. The Cypher commands I used are discussed in this post, followed by a demonstration of querying the database.

Creating the Graph Database

We are creating the following graph:


Here genes 5663 (PSEN1) and 675 (BRCA2) connect with the species Homo sapiens and also connect to the gene alias FAD. (Both genes have the same gene alias). RefSeq “NM” transcripts connect to their respective genes.

In Cypher, we first create two nodes representing genes PSEN1 and BRCA2, along with a node representing the human species:

CREATE (gid5663:Gene {symbol: "PSEN1", id: 5663, full_name: "presenilin 1"})
CREATE (gid675:Gene {symbol: "BRCA2", id: 675, full_name: "breast cancer 2, early onset"})
CREATE (human:Species {taxonomy_id: 9606, id: "Homo Sapiens"})

The Neo4j console replies that three nodes were created.


We then link each of the gene nodes we created to the species node. These steps require selecting the nodes to connect with a MATCH query and then feeding the selection results into a CREATE clause.

MATCH (a:Gene), (b:Species)
WHERE = 5663 AND b.taxonomy_id = 9606
CREATE (a)-[r:SPECIES]->(b)

MATCH (a:Gene), (b:Species)
WHERE = 675 AND b.taxonomy_id = 9606
CREATE (a)-[r:SPECIES]->(b)

The console shows each MATCH/CREATE query result:


We then create a gene alias node, which the gene nodes will connect to:

CREATE (FAD:GeneSymbolAlias {id: "FAD"})


Like before with the species node, we now connect the genes to the gene alias node:

MATCH (a:Gene), (b:GeneSymbolAlias)
CREATE (a)-[r:ALIAS]->(b)

MATCH (a:Gene), (b:GeneSymbolAlias)
CREATE (a)-[r:ALIAS]->(b)

The console again shows each MATCH/CREATE query result:


We next create three nodes to represent the three RefSeq transcripts associated with our two genes:

CREATE (NM_000059:RefSeqTranscript {id: "NM_000059", version: 3})
CREATE (NM_000021:RefSeqTranscript {id: "NM_000021", version: 3})
CREATE (NM_007318:RefSeqTranscript {id: "NM_007318", version: 2})


Finally, we connect these transcripts to their respective genes:

MATCH (a:Gene), (b:RefSeqTranscript)
WHERE = 675 AND"NM_000059"

MATCH (a:Gene), (b:RefSeqTranscript)
WHERE = 5663 AND"NM_000021"

MATCH (a:Gene), (b:RefSeqTranscript)
WHERE = 5663 AND"NM_007318"


We can view the whole network with:

MATCH (n)-[r]-()



We can start with a gene and trace its transcripts, returning the transcript nodes:

MATCH (:Gene {id: 5663})-[:TRANSCRIPT]->(t) return t


Similarly, we can start with a transcript and trace through its gene to the transcript’s species:

MATCH (:RefSeqTranscript {id: "NM_000021"})-[:TRANSCRIPT]-(gene)-[:SPECIES]-(species) return species


Posted in big data, bioinformatics, data science, science | Tagged , , , , , , , , , , , , , , | Leave a comment

setting up an Amazon RDS instance on a VPC private subnet

As a scientist, I tend not to think about database security much. However, security is an important concern for the database-driven web applications I write, so I decided to learn more about how to use Amazon EC2 and RDS instances securely. As part of this effort, I created a virtual private cloud (VPC) to hide my Amazon RDS database from the public internet. To make this easier for others who find themselves in the same position, I’m posting instructions here on how to set up an RDS instance in a VPC using the Amazon AWS console.



We are creating a VPC that is divided into two subnets, one public and one private. EC2 instances on the public subnet can access the internet (and serve web applications), while EC2 and RDS instances on the private subnet can only communicate with instances inside the VPC. We will store our database on the private subnet and access it through an EC2 instance on the public subnet.

Creating the VPC

Start with the AWS Console and select “VPC”:


Press the “Start VPC Wizard” button:


We want a VPC with both public and private subnets. Our database will reside on the private subnet while our web application that communicates with the database will go on the public subnet:


Enter a name for the VPC, as well as names for the public and private subnet. I recommend numbering your private subnet as shown in the image below, since we will need to create another private subnet later:


Click the “Create VPC” button, and you should receive a page saying the VPC was successfully created.


Click on “Your VPCs” on the left and find your new VPC. Record the VPC ID number.

Adding a Subnet

To run an RDS instance in the VPC, Amazon requires at least two subnets, each in a different availability zone. Additionally, we want both subnets to be private, so we have to create another subnet to meet the requirements. To do this we first click on the “Subnets” link on the left:


We see the public and private subnets we created with the VPC. Record the VPC numbers shown in the fourth column of each of these two subnets; we’ll need them later. Also record the Subnet IDs for these two subnets which are shown in the second column. Then click on the private subnet row and record the availability zone. Click on “Create Subnet” and add an additional subnet with the dialog box that appears:


This step requires specifying a different availability zone than that of the first private subnet, an issue that confused me at first. When you are finished creating the subnet, record the subnet IDs of the two private subnets.

Create a Public-Facing EC2 Instance

Create an EC2 instance in the public subnet of your new VPC. Detailed instructions on how to do this are omitted in this post for brevity. This is just like setting up any EC2 instance, but you have to specify which VPC and subnet to use in the launch wizard. Also, be sure to check the box indicating that you want a public IP address created with the instance. Record the private IP address of this instance as we’ll need it later.

Creating a DB Subnet Group

We now need to create a database subnet group. Go to the AWS Console page and press “RDS”.


Press the “Subnet Groups” link on the left and then press “Create DB Subnet Group”:


Give your subnet group a name and description, select your VPC (by ID), and add the two private subnets to the group using the “Add” button. Note that the two private subnets have different availability zones associated with them:


Click on “Yes, Create” and you should get the following output showing your new subnet group:


Creating an RDS Instance

Click on “Instances” and press the “Launch DB Instance” button:


Select the database system you want to use. In this case we are using MySQL:


Select an option for multi-availability zone operation. In this case we are selecting “No”.


This next screen asks again about use of multi-availability zone operation. Select “No” again. Fill in the DB Instance Class you want to use, the amount of allocated storage you want, a name for the DB Instance, a username, and a password. Note that the name for the DB Instance is not the same as the name for the database in MySQL, which will be specified later:


In the advanced settings, select the VPC you created earlier, as well as the DB Subnet Group you created. Add a database name (this will be the name of the database inside MySQL):


You will now see a confirmation page showing the active DB Instance:


Click on “Instances” to see the details of the RDS instance you created:


Expand your new DB Instance and click on the security group:


Add an inbound rule for MySQL traffic from your EC2 instance. Use the private address of the EC2 instance you created earlier.


That’s it! You’ll now be able to connect to your private RDS instance through your EC2 instance.



Posted in big data, data science, engineering | Tagged , , , , , , , , , , | Leave a comment

pyDome updates: tangential and spoke angles

In a previous post, I introduced pyDome, a Python program for calculating geodesic dome vertices, chords, and faces. I have since added two hub angle computations to the program, and report on that progress here. Face angle calculations still need to be implemented.


Example output in DXF format.

Angles Between Chords and the Hub Tangent Plane

The angle between a chord and the plane tangent to the sphere at the chord’s hub measures the amount of inward deflection a hub spoke for that chord must make. The following diagram illustrates this idea:


In this image, the view is directly facing the side of the tangent plane, so that it appears as a line. Two chords are shown here for illustrative purposes, but there are actually either five or six chords for a hub depending on the hub’s position in the geodesic sphere.

The program now reports these angles for each hub as part of the standard output:


As expected, these are small angles.

Angles of Chords Around the Hub

The other hub angles considered here are the angles between chords centered around a hub. Here we first project the chords onto the tangent plane, then select one of the chords as a reference, and then report the angle between the projected reference chord and each other projected chord. I call these angles “spoke” angles. The following image illustrates the idea:


Here the view is orthogonal to the sphere’s tangent plane defined at the hub. (This is the same tangent plane as that used above).

The program reports these angles in its standard output:



pyDome’s source code, which implements these new angle reporting features, is available from GitHub at HTML code for drawing the graphics shown above follows:

<svg width="500" height="500">
  <circle cx="250" cy="250" r="5" stroke="black" fill="black"/>
  <line x1="250" y1="250" x2="210" y2="150" style="stroke:rgb(0,0,0);stroke-width:3;"/>
  <line x1="250" y1="250" x2="210" y2="350" style="stroke:rgb(0,0,0);stroke-width:3;"/>
  <line x1="250" y1="160" x2="250" y2="340" style="stroke:blue;stroke-width:2;"/>
  <path d="M250 200 A10,10 0 0,0 230,200" stroke="green" style="fill:none;"/>
  <path d="M250 300 A10,10 0 0,1 230,300" stroke="green" style="fill:none;"/>
  <text x="260" y="255" fill="black">Hub</text>
  <text x="170" y="145" fill="black">Chord</text>
  <text x="240" y="155" fill="blue">Tangent Plane</text>
  <text x="255" y="200" fill="green">Angle to Tangent Plane</text>
<svg width="500" height="500">
  <circle cx="250" cy="250" r="5" stroke="black" fill="black"/>
  <line x1="250" y1="250" x2="375" y2="250" style="stroke:rgb(0,0,0);stroke-width:3;"/>
  <line x1="250" y1="250" x2="289" y2="369" style="stroke:rgb(0,0,0);stroke-width:3;"/>
  <line x1="250" y1="250" x2="149" y2="323" style="stroke:rgb(0,0,0);stroke-width:3;"/>
  <line x1="250" y1="250" x2="149" y2="177" style="stroke:rgb(0,0,0);stroke-width:3;"/>
  <line x1="250" y1="250" x2="289" y2="131" style="stroke:rgb(0,0,0);stroke-width:3;"/>
  <path d="M263,210 A32,32 0 0,0 216,274" fill="none" stroke="blue" stroke-width="5" />
  <text x="260" y="267" fill="black">Hub</text>
  <text x="110" y="250" fill="blue">Spoke Angle</text>
  <text x="250" y="120" fill="black">Reference Chord/Spoke</text>
  <text x="110" y="340" fill="black">Chord/Spoke</text>
Posted in engineering | Tagged , , , , , , , , | Leave a comment

building a forecasting application with AngularJS

Lately I’ve been working on an application that forecasts Amazon EC2 spot instance prices. (The forecasting element of this application will be described in a forthcoming blog post once I finalize the forecasting method). The tool needed a user interface, and I decided to write it with AngularJS as a learning exercise. This post describes the AngularJS user interface.

First, a demonstration of the application workflow. The user is offered drop-down menus of Amazon EC2 regions, instance types, and operating systems. The options for instance type and operating system change when the region is changed, and the options for operating system change when the instance type is changed. This page looks like:


When all three selections are made, an image giving the forecast appears:


The images are pre-computed daily with a cron job.

The challenge I faced with AngularJS was getting the menus to automatically update according to changes in parent menu selection. (E.g., changing operating system by region selection). This required some tricky binding to accomplish, which I will demonstrate below.

When the application is loaded, it retrieves from a JSON file information about which options belong in which drop-down menus:


The top level of the JSON file dictionary (I’m using Python terminology here) contains the regions as keys. The second layer contains the instance type as keys. The third layer gives the operating system as keys. The final layer, the array of letters, gives the availability zones. The use of select options as keys caused difficulty when I started with AngularJS, which I will illustrate in the code that follows.

The HTML code looks like:

<!DOCTYPE html>
<meta charset="UTF-8">
<title>EC2 Spot Instance Forecaster</title>
<script src="angular.min.js" type="text/javascript"></script>
<script src="ec2-forecast.js" type="text/javascript"></script>
<body ng-app="ForecastApp">
<div ng-controller="ForecastController">
<h1>EC2 Spot Instance Price Forecaster</h1>
<p>Select Amazon EC2 Region</p>
<select ng-model="myLocationKey" ng-change="locationChange()" ng-options="location as location for (location, value) in location_dict"></select>
<p>Select Amazon EC2 Instance Type</p>
<select ng-model="myInstanceKey" ng-change="instanceChange()" ng-options="instance as instance for (instance, value) in location_dict[myLocationKey]"></select>
<p>Select Operating System</p>
<select ng-model="myOSKey" ng-change="osChange()" ng-options="os as os for (os, value) in location_dict[myLocationKey][myInstanceKey]"></select>
<li ng-repeat="plotfile in filenames">
<img ng-src="{{plotfile}}"/>

I assigned the “ng-app” parameter to the body tag, and set the outer div tag to use the “ng-contoller” “ForecastController”. The entire application is contained within this div tag.

For the first select tag, I set the options to the top-level keys of the JSON input. Originally I set “ng-options” to

location for (location, value) in location_dict

where “location_dict” is the entire JSON input object and is defined in the ForecastController scope. However, doing so delivered the entire contents of the dictionary keyed by the selected location to the “ng-model” variable “myLocationKey”. I only wanted the selected key value, so had to use

location as location for (location, value) in location_dict

instead. This delivered the selected key to the myLocationKey variable. I found this aspect of AngularJS very confusing at first.

The next two select tags extracted options using the values of the key set at the previous level. AngularJS automatically took care of binding the correct option list to the menus.

Each select tag has an “ng-change” function. The first two functions ensure that subsequent select options are reset to an unselected state if a change is made. The final “ng-change” function instructs AngularJS to draw the image. AngularJS automatically binds the correct filename of the image to show to the img tag.

JavaScript code implementing these functions follows:

 * declare the application
var app = angular.module("ForecastApp", []);

 * declare the main controller for the application
app.controller("ForecastController", function ($scope, $http, $templateCache) {
	 * retrieve the data structure indicating instance types available
	$http({method: 'GET', url: 'forecast_images/dictionary.json', cache: templateCache}).
		success(function(data, status) {
			$scope.location_dict = data;
	 * function to handle changes to location types
	$scope.locationChange = function() {
	    $scope.myInstanceKey = {};
	    $scope.myOSKey = {};

	  * function to handle changes to instance types
	 $scope.instanceChange = function() {
	     $scope.myOSKey = {};

      * function to handle changes to OS type
     $scope.osChange = function() {
	 $scope.alist = $scope.location_dict[$scope.myLocationKey][$scope.myInstanceKey][$scope.myOSKey]
	 $scope.filenames = []
	 $scope.alist.forEach(function(alistdata) {
	     var filename = 'forecast_images/' + $scope.myInstanceKey + '---' + $scope.myOSKey + '---' + $scope.myLocationKey + alistdata + '.ts.png';
	     filename = filename.replace(' ', '_').replace(' ', '_').replace(' ', '_').replace(' ', '_');
	     filename = filename.replace('Linux/UNIX', 'Linux.UNIX');
Posted in engineering | Tagged , , , , , , , , , , , | Leave a comment

maximized entropy of a finite distribution

I received the following tweet yesterday from @ProbFact and decided to check it out in more detail:


Two-Dimensional Case

I generated the following test to investigate the claim: Create four category discrete distributions where two of the categories have 0.25 probability each, and the third category probability varies between 0.1 and 0.4. The fourth category’s probability equals 0.5 minus the third probability, to form a total of one across all four probabilities. I then computed the Shannon entropy of the distributions formed from these probabilities using the equation:


I stored the value of H(X) for each value of the third probability, and then plotted H(X) as a function of the third probability. The entropy in the plot maximized when all four probabilities equal 0.25.


Three-Dimension Case

I then created another set of four category distributions where the first and third probabilities vary between 0.1 and 0.4 and the second and fourth probabilities equal 0.5 minus the first and third probabilities, respectively. Again I computed the Shannon entropy of the distributions formed by these probabilities. I then plotted the computed entropies on the following contour plot, which shows that entropy is maximized when all four probabilities equal 0.25:


Source Code

import numpy as np
import matplotlib.pyplot as plt

# two dimensions
px1 = 0.25
px2 = 0.25
px3 = np.arange(0.1, 0.4, 0.01)
px4 = 1 - px1 - px2 - px3

H = -1 * (px1*np.log2(px1) + px2*np.log2(px2) + px3*np.log2(px3) + px4*np.log2(px4))

plt.plot(px3, H)
plt.ylabel('Shannon Entropy (bits)')
plt.title('Entropy for Distribution of Four Categories')

# three dimensions
px1 = np.arange(0.1, 0.4, 0.01)
px2 = 0.5 - px1
px3 = np.arange(0.1, 0.4, 0.01)
px4 = 0.5 - px3

entropy_list = []
for i in range(len(px1)):
    entropy_sub_list = []
    for j in range(len(px3)):
        H = -1 * (px1[i]*np.log2(px1[i]) + px2[i]*np.log2(px2[i]) + px3[j]*np.log2(px3[j]) + px4[j]*np.log2(px4[j]))

X, Y = np.meshgrid(px1, px3)
Z = np.array(np.matrix(entropy_list))
V = np.arange(1.7, 2, 0.01)

plt.contour(X, Y, Z.transpose(), V)
plt.title('Entropy for Distribution of Four Categories')
Posted in statistics | Tagged , , , , , | Leave a comment

selecting travel trailers by regression

Data Scientist has been thinking recently of moving into a used travel trailer. However, the weight of the trailer to be purchased is limited by that which our hero’s truck can pull. But most online used travel trailer listings only specify length of the vehicle, not its weight. So Data Scientist needed a quick way to estimate the weight based on length. Enter regression analysis:


Our hero found 21 new trailer listings [1,2] and recorded their dry (tanks empty) weight and length. Data Scientist then performed regression analysis on these known weights and lengths to obtain an equation for estimating weight based on advertised length. Using this equation, our hero minimizes trips to go inspect potential purchases by ruling out cases most likely to exceed the weight limit.



1. Accessed 23 May 2014
2. Accessed 23 May 2014

Related Post

engineer moves into an RV

Posted in statistics | Tagged , , , , | 1 Comment

examining mRNA complexity by annotation region using MapReduce

I became interested in how annotated mRNA regions (e.g., 5′ UTR, coding, and 3′ UTR) vary in information content, speculating that coding regions (CDS) of transcripts will be generally more complex than other regions due to their role in specifying protein recipes. Measuring sequence complexity using Shannon entropy validated this hypothesis, at least with regard to the calculation method described herein.



I iterated through all human “NM” transcripts in RefSeq 64 in 21-mer segments, calculating the Shannon entropy (discussed below) for each segment. For each of these segments, I calculated the nucleotide distribution required by Shannon’s equation solely from that of the bases contained in the segment. Segments containing an “N” as one of the bases were ignored (there were only 21 of these cases). I also recorded which mRNA region (CDS, 3′ UTR, and 5′ UTR) the segment originated from, and discarded segments that crossed a boundary between regions.

Shannon Entropy

Shannon’s entropy equation computes the minimum number of bits necessary to encode a signal (e.g., a sequence of nucleotides) without information loss. Information-rich sequences require more bits to encode than information-poor sequences. The equation reads:


where X is a random variable with values A, C, G, or U/T. We calculate the distribution p of the nucleotides separately for each 21-mer segment, rather than use the overall distribution of nucleotides in the source transcript or transcriptome. For example, segment UACGUAGAUUUGGAAUUCCAG has a nucleotide distribution of p = {A: 0.29, C: 0.14, U: 0.33, G: 0.24}, resulting in a Shannon entropy of 1.94 bits.

MapReduce with Hadoop

All 21-mer segments from RefSeq’s human NM transcripts add up to a lot of data. I therefore tackled the calculation using Hadoop via Amazon’s Elastic MapReduce (EMR). The EMR cluster I employed used three “m2.4xlarge” instances, since the final reduce step was very memory intensive. I started with an input sequence containing RefSeq accession number, coding region start and stop, and transcript sequence on each line:


The first mapping operation iterated through each line in the input file and mapped each 21-mer with the region name appended to a value of one:


The first reduce operation counted the instances of identical 21-mers/region names, thereby producing a list of unique 21-mer/region combinations. This cancels out bias that might be produced by the fact that some genes have more splice variants than others.


The second map operation mapped each 21-mer’s region to its Shannon entropy:


The second reduce operation computed the Shannon entropy boxplot statistics (median, IRQ, etc.) per region:


In this image the outliers are visible as pipe-delimited numbers.

T-Test for Equality

I tested the CDS mean against the 3′ UTR mean for equality using the t-test, realizing that this is not the best test for this data given the skew visible in the boxplot above. However, we have large sample sizes, which helps the t-test work effectively. I was not able to assemble a non-parametric test due to the vastness of the data (and, more accurately, due to the fact that I’ve run out of time alloted to this project). The t-test produced a very low p-value, leading to rejection of the null hypothesis that the two means are equal.


Source Code



Posted in big data, bioinformatics, data science, science | Tagged , , , , , , , , , , , , , , | Leave a comment