hacking the stock market (part 1)

Caveat: I am not a technical investor–just a hobbyist, so take this analysis with a grain of salt. I am also just beginning with my Master’s work in statistics.

I wanted to examine the correlation between changes in the daily closing price of the Dow Jones Industrial Average (DJIA) and lags of those changes, to see if there is a pattern I could use. First I downloaded the DJIA data from Yahoo using Pandas:

#
# load useful libraries
#
from pandas.io.data import DataReader
from datetime import datetime
import matplotlib.pyplot as plt
import numpy as np
import math
from scipy.stats.stats import pearsonr
from scipy.stats.stats import spearmanr

#
# load DJIA data from Yahoo server
#
djia = DataReader("DJIA",  "yahoo", datetime(2000,1,1), datetime.today())

I then generated the autocorrelation plots for the one-day differenced closing prices and the signs of the one-day differenced closing prices:

#
# investigate the diff between DJIA closing prices
#
diff_1st_order = djia["Adj Close"].diff()
diff_1st_order_as_list = []
for d in diff_1st_order:
    if not np.isnan(d):
        diff_1st_order_as_list.append(d)
plt.subplot(2, 1, 1)
plt.acorr(diff_1st_order_as_list, maxlags=10)
plt.title("Autocorrelation of Diff of DJIA Adjusted Close")
plt.xlabel("Lag")
plt.ylabel("Correlation")

#
# sign of diff, not diff itself
#
diff_1st_order_sign = []
for d in diff_1st_order_as_list:
    if not np.isnan(d / abs(d)):
        diff_1st_order_sign.append(d / abs(d))
    else:
        diff_1st_order_sign.append(0)
plt.subplot(2, 1, 2)
plt.acorr(diff_1st_order_sign, maxlags=10)
plt.title("Autocorrelation of Sign of Diff of DJIA Adjusted Close")
plt.xlabel("Lag")
plt.ylabel("Correlation")

first_look_at_DJIA

There is a very small negative correlation between the one-day closing price difference and the one-day lag of the one-day closing price difference. Similarly, there is an even smaller positive correlation between the one-day closing price difference and the three-day lag of the one-day closing price difference.

So I set out to find the proportion of times the difference and the one-day lag of the difference changes from day to day:

#
# frequencies of 1-day lag changes in direction of closing price
#
count_opposite = 0
count_same = 0
i_list = []
j_list = []
for i in range(0, len(diff_1st_order_as_list) - 1):
    price_diff_i = diff_1st_order_as_list[i]
    price_diff_j = diff_1st_order_as_list[i+1]  # one trading day ahead

    i_list.append(price_diff_i)
    j_list.append(price_diff_j)

    sign_of_price_diff_i = 0
    if not np.isnan(price_diff_i / abs(price_diff_i)):
        sign_of_price_diff_i = int(price_diff_i / abs(price_diff_i))

    sign_of_price_diff_j = 0
    if not np.isnan(price_diff_j / abs(price_diff_j)):
        sign_of_price_diff_j = int(price_diff_j / abs(price_diff_j))

    if sign_of_price_diff_i == sign_of_price_diff_j:
        count_same += 1
    else:
        count_opposite += 1

print
print 'Correlation coefficients for the diff lists:'
print '\t', 'Pearson R: ', pearsonr(i_list, j_list)[0]
print '\t', 'Spearman R: ', spearmanr(i_list, j_list)[0]
print

print 'Amount of time closing value direction remains the same: ', round(float(count_same) / (float(count_same) + float(count_opposite)), 3)
amount_time_changes = float(count_opposite) / (float(count_same) + float(count_opposite))
print 'Amount of time closing value direction changes: ', round(amount_time_changes, 3)
L = amount_time_changes - 1.959964*((math.sqrt(amount_time_changes*(1.0 - amount_time_changes)))/math.sqrt(float(len(diff_1st_order_as_list))))
U = amount_time_changes + 1.959964*((math.sqrt(amount_time_changes*(1.0 - amount_time_changes)))/math.sqrt(float(len(diff_1st_order_as_list))))
print 'Agresti-Coull C.I.: ', round(L, 3), '< p <', round(U, 3)
print

first_look_at_DJIA_OUTPUT_1

This analysis tells me that in the long run (at least over the period that I pulled DJIA data for), betting using one-day changes of direction of the closing price of the DJIA would slowly pay off. (However, we are ignoring the magnitudes of the changes in this analysis; the magnitudes may be insufficient to be worth the price of a trade. A future analysis will investigate this). The test for correlation between the two time-series (one-day difference and lag of one-day difference) shows that the Agresti-Coull confidence interval is appropriate (we have near independence), although this betting scheme relies on the thinest correlation detected by the autocorrelation plot.

There was another possible pattern in the autocorrelation plot above: a three-day lag positive correlation and a one-lag negative correlation. I decided to check out the proportion of times using the combination of the two would result in a prediction success greater than the null of 25% for the case that the three-day lag changes direction in opposite direction as the one-day lag and in the same direction as the zero-day value:

#
# frequencies of combination of 1-day lag and 3-day lag changes in direction
# of closing price
#
count_matches = 0
count_non_matches = 0
i_list = []
j_list = []
k_list = []
for i in range(0, len(diff_1st_order_as_list) - 3):
    price_diff_i = diff_1st_order_as_list[i]
    price_diff_j = diff_1st_order_as_list[i+1]  # one trading day ahead
    price_diff_k = diff_1st_order_as_list[i+3]  # three trading days ahead

    i_list.append(price_diff_i)
    j_list.append(price_diff_j)
    k_list.append(price_diff_k)

    # price_diff_i represents 3-day lag
    sign_of_price_diff_i = 0
    if not np.isnan(price_diff_i / abs(price_diff_i)):
        sign_of_price_diff_i = int(price_diff_i / abs(price_diff_i))
    sign_of_price_diff_j = 0

    # price_diff_j represents 1-day lag
    if not np.isnan(price_diff_j / abs(price_diff_j)):
        sign_of_price_diff_j = int(price_diff_j / abs(price_diff_j))
    sign_of_price_diff_k = 0

    # price_diff_k represents current day
    if not np.isnan(price_diff_k / abs(price_diff_k)):
        sign_of_price_diff_k = int(price_diff_k / abs(price_diff_k))

    if sign_of_price_diff_k != sign_of_price_diff_j and sign_of_price_diff_k == sign_of_price_diff_i:
        count_matches += 1
    else:
        count_non_matches += 1

print 'Correlation coefficients for the diff lists:'
print '\t', 'Pearson R: ', pearsonr(i_list, j_list)[0]
print '\t', 'Spearman R: ', spearmanr(i_list, j_list)[0]
print '\t', 'Pearson R: ', pearsonr(i_list, k_list)[0]
print '\t', 'Spearman R: ', spearmanr(i_list, k_list)[0]
print '\t', 'Pearson R: ', pearsonr(j_list, k_list)[0]
print '\t', 'Spearman R: ', spearmanr(j_list, k_list)[0]
print

amount_time_changes = float(count_matches) / (float(count_matches) + float(count_non_matches))
print 'Amount of time 1-day change is opposite direction and 3-day change is same direction: ', round(amount_time_changes, 3)
L = amount_time_changes - 1.959964*((math.sqrt(amount_time_changes*(1.0 - amount_time_changes)))/math.sqrt(float(len(diff_1st_order_as_list))))
U = amount_time_changes + 1.959964*((math.sqrt(amount_time_changes*(1.0 - amount_time_changes)))/math.sqrt(float(len(diff_1st_order_as_list))))
print 'Agresti-Coull C.I.: ', round(L, 3), '< p <', round(U, 3)
print

first_look_at_DJIA_OUTPUT_2

To complete the code, we need to show the plot:

#
# show the plot
#
plt.show()

I’m not certain the narrow margin of opportunity detected by this analysis is sufficient for the development of a trading strategy. More investigation is needed.

Posted in econometrics, statistics | Tagged , , , , , , , , , , | Leave a comment

Kaplan-Meier estimator in Python

The following Python class computes and draws Kaplan-Meier product limit estimators for given data. An example of how to use the class follows the code.

Code

# load useful libraries
import matplotlib.pyplot as plt

#
# class for building Kaplan-Meier product limit estimator
#
class KM(object):

    #
    # constructor
    #
    def __init__(self, measured_values, censored_or_not):
        self.measured_values = measured_values;
        self.censored_or_not = censored_or_not

        # log measured values where event occurred
        self.event_occurred_times = {}
        self.censored_occurred_times = {}
        for i, mv in enumerate(self.measured_values):
            if self.censored_or_not[i] == 1:
                if not self.event_occurred_times.has_key(mv):
                    self.event_occurred_times[mv] = 0
                self.event_occurred_times[mv] += 1
            else:
                if not self.censored_occurred_times.has_key(mv):
                    self.censored_occurred_times[mv] = 0
                self.censored_occurred_times[mv] += 1

        # construct list of j values
        j_list = [0]
        j_list.extend(sorted(self.event_occurred_times.keys()))
        j_list.append(max(max(self.event_occurred_times.keys()), max(self.censored_occurred_times.keys())) + 1)
        self.j_list = j_list

        # count censored values in interval [j, j+1), index by j
        self.number_of_units_censored_in_interval_j_to_j_plus_one = {}
        for i in range(0, len(j_list)-1):
            j = j_list[i]
            j_plus_one = j_list[i+1]
            m_count = 0
            for m in self.censored_occurred_times.keys():
                if j <= m and m < j_plus_one:
                    m_count = m_count + self.censored_occurred_times[m]
            self.number_of_units_censored_in_interval_j_to_j_plus_one[j] = m_count

        # calculate number of units at risk just prior to time t_j
        self.number_of_units_at_risk_just_prior_to_time_j = {}
        for i in range(0, len(j_list)-1):
            j = j_list[i]
            n_count = 0.
            for k in range(i, len(j_list)-1):
                jk = j_list[k]
                if self.event_occurred_times.has_key(jk):
                    n_count = n_count + self.event_occurred_times[jk]
                if self.number_of_units_censored_in_interval_j_to_j_plus_one.has_key(jk):
                    n_count = n_count + self.number_of_units_censored_in_interval_j_to_j_plus_one[jk]
            self.number_of_units_at_risk_just_prior_to_time_j[j] = n_count

        # add time zero count to self.event_occurred_times
        self.event_occurred_times[0] = 0

        # build the estimator for each time j
        self.S = {}
        for i in range(0, len(j_list)-1):
            j = j_list[i]
            prod = 1.
            for k in range(0, i+1):
                jk = j_list[k]
                prod = prod * ((self.number_of_units_at_risk_just_prior_to_time_j[jk] - self.event_occurred_times[jk]) / self.number_of_units_at_risk_just_prior_to_time_j[jk])
            self.S[j] = prod

    #
    # display the estimator in the console
    #
    def display(self):

        print
        print '\ttime\tn.risk\tn.event\tsurvival'
        for i in range(1, len(self.j_list) - 1):
            j = self.j_list[i]
            print '\t' + str(j) + '\t' + str(int(self.number_of_units_at_risk_just_prior_to_time_j[j])) + '\t' + str(self.event_occurred_times[j]) + '\t' + str(round(self.S[j], 3))
        print

    #
    # plot
    #
    def plot(self, color, xlabel, title):
        j_list_sans_end = self.j_list[0:-1]

        # plot the curve
        plt.figure()
        for i in range(0, len(j_list_sans_end) - 1):
            j = j_list_sans_end[i]
            j_plus_one = j_list_sans_end[i+1]
            plt.plot([j, j_plus_one], [self.S[j], self.S[j]], color=color)
            plt.plot([j_plus_one, j_plus_one], [self.S[j], self.S[j_plus_one]], color=color)

        # set the axis limits
        plt.xlim([0, self.j_list[-1]])
        plt.ylim([-0.05, 1.05])

        # add the censored cases within the curve
        for i in sorted(self.censored_occurred_times.keys()):
            last_S = 1.
            for j in sorted(self.S.keys()):
                if j >= i:
                    plt.scatter(i, last_S, color=color)
                    break
                last_S = self.S[j]

        # add the censored cases beyond the curve
        for i in sorted(self.censored_occurred_times.keys()):
            max_S_key = max(self.S.keys())
            if i > max_S_key:
                plt.scatter(i, self.S[max_S_key], color="blue")

        # show the plot
        plt.ylabel("Survival")
        plt.xlabel(xlabel)
        plt.title(title)
        plt.show()

Usage Example

import matplotlib.pyplot as plt
times = [3, 4, 4, 5, 7, 6, 8, 8, 12, 14, 14, 19, 20, 20]
events = [1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0]  # 0 for censored cases, 1 for event occurred
my_km = KM(times, events)
my_km.display()
my_km.plot("blue", "Time", "Kaplan-Meier Product Limit Estimator Example")

python_kaplan-meier-console

python_kaplan_meier_demo

Posted in data science, engineering, science, statistics | Tagged , , , , | Leave a comment

setting up a Nexus OSS server on an Amazon EC2 instance

It took me a while to figure out how to set up a Nexus OSS server on an Amazon EC2 instance, so I’m writing down the instructions here in case they prove useful to anyone in the future.

In this case we are also changing the default Nexus OSS server port of 8081 to port 80 due to a firewall issue.

Start up an Amazon Linux EC2 instance with incoming port 22 set to “My IP” and incoming port 80 set to “Anywhere”. For the sake of these instructions, we assume the IP address of the new EC2 instance is xxx.xxx.xxx.xxx.

Log into the server using a terminal:

ssh -i your_key.pem ec2-user@xxx.xxx.xxx.xxx

Execute the following commands:

sudo yum update
sudo yum install emacs
wget http://download.sonatype.com/nexus/oss/nexus-latest-bundle.tar.gz
sudo cp nexus-latest-bundle.tar.gz /usr/local
cd /usr/local
sudo tar -xvzf nexus-latest-bundle.tar.gz
sudo rm nexus-latest-bundle.tar.gz 
sudo ln -s nexus-2.9.2-01 nexus
sudo chown -R ec2-user:ec2-user nexus
sudo chown -R ec2-user:ec2-user nexus-2.9.2-01
sudo chown -R ec2-user:ec2-user sonatype-work

Edit the Nexus server executable, changing the value of “RUN_AS_USER” to “root”:

emacs nexus/bin/nexus

Set a root password, which we’ll need later:

sudo su - root
passwd
exit

Enter the “nexus” directory and edit the conf/nexus.properties file to set the “application-port” to “80”.

cd nexus
emacs conf/nexus.properties

Start the server. Enter the root password when prompted:

./bin/nexus start

Wait about a minute or two for the server to fire up, and then point a web browser to http://xxx.xxx.xxx.xxx/nexus to use the server.

nexus_webapp_screenshot

Posted in engineering | Tagged , , , , , , , | Leave a comment

data natives

We hear a lot of marketing yammer about “digital natives”, that is, folks fluent in social media and in particular marketing using social media. Writers who use this term often juxtapose such digital natives against “analog natives”, i.e., individuals who matured or were educated before online social media became such a significant part of our lives. These writers often imply that such analog natives are unable to understand the world today. This is of course ridiculous; anyone with an open mind and tenacity can develop marketing skill with social media.

I offer a different grouping of individuals that might be considered an analog (pun intended) of digital nativity: “data natives”. By this I mean individuals fluent in using heterogeneous numerical and textual data sources, along with mathematical techniques, to reach conclusions about many facets of their lives and work.

Consider the following example: When I was shopping for a used RV trailer recently, I needed to filter out trailers from the pool of candidates that were too heavy for my truck to pull. However, used RV listings only specified length of the RV trailers, rather than the weight. Therefore I used regression analysis to predict weight from length based on about twenty known length-weight data points. This simple example illustrates a data native’s approach to solving problems.

In my usage of the phrase, “data native” is meant to be a more comprehensive designation than “data scientist”, though certainly there is crossover between the two. In using the word “native” I’m implying an intimate comfort with data and data-driven decision making, like immersion in and skill with data flows like one’s mother tongue. Data science is a job, data native is a way of being.

For now, data natives are as rare as data scientists. But the new world of Big Data is producing both at a rapid clip. This will likely enrich the world.

Posted in big data, data science, marketing, statistics | Tagged , , , , | Leave a comment

GNU Octave: a free, open source MATLAB-like language for numerical computing

I tend to use Python with the Numpy, SciPy, and Matplotlib stack whenever I have to do scientific computing. For statistical computing I use R whenever this Python stack does not provide the necessary features. However, I want to draw readers’ attention to another tool for free, open-source numerical computing: GNU Octave (hereafter called “Octave”), which is an interactive language closely related to MATLAB. In fact, most MATLAB code can be run in Octave with no modification; for MATLAB code that does need a change to run in Octave, the necessary modification is usually slight. MATLAB is common in engineering and scientific environments, but it costs money. For budget-constrained startups or non-profits Octave is a viable alternative. This post introduces Octave and shows its operation in both a statistical and control system design role.

Octave is covered by the GNU General Public License, which enables programmers to modify it as needed, provided the modifications are made publicly available. This license also makes Octave effectively free in the financial sense. The program runs in both Linux and Windows, although the Windows setup is more complicated since you have to use Cygwin, MinGW, or Visual Studio.

Octave has a full range of plotting abilities that MATLAB users will be familiar with. The program automatically uses a third-party program, Gnuplot, to create the plots. For example, a 3D plot of a geodesic sphere rendered by Octave is shown below [1]:

gnuplot_class_1_icosahedron_frequency_2

As another example, a Nyquist plot of a dynamic system is:

my_nyquist

Like MATLAB, Octave has many toolkits, which are published as part of Octave-Forge [2], a central repository for Octave packages. Examples of such toolkits are packages for control systems analysis/design and fuzzy logic.

Under the hood, Octave uses the BLAS and LAPACK libraries to facilitate linear algebra computations, in much the same way R and Python’s NumPy package do.

Recent versions of Octave come with a GUI. Using the GUI to show the iconic “sombrero” plot with code given by [3]:

octave_gui

Statistical Calculations with Octave

Since I’m a statistician in training, I thought I’d demonstrate a few of Octave’s basic statistical tools. First, a box plot of samples from two normal distributions:

a = normrnd(20, 3, 100, 1);
b = normrnd(22, 3, 100, 1);
boxplot([a, b]);
title("Comparison of Samples From Two Normal Distributions")
xlabel("Distribution")
ylabel("Sample")

boxplot

Conducting a t-test on the two sets of samples from normal distributions:

t_test

Conducting linear regression on simulated data:

x = 0:0.01:10;
noise = normrnd(0, 2, 1, length(x));
hist(noise);
title("Noise");
y = 2*x + noise + 5;
figure;
hold on;
scatter(x, y);
F = polyfit(x, y, 1);
a = F(1);
b = F(2);
y_pred = a*x + b;
plot(x, y_pred, 'r');
title("Scatter Plot and Regression Line");
hold off;

histogram

scatter

Control Systems Calculations with Octave

I also have experience designing control systems, which Octave is well suited for. Here is an example of a PID controller’s step response implemented in Octave’s control system toolkit:

pkg load control
Kp = 4
Kd = 1
Ki = 1
P=tf([1], [1 Kp/Kd Ki/Kd])
step(P, 20)

pid

Creating the Nyquist plot shown above for this system:

nyquist(P)

my_nyquist

Conclusion

Octave is a viable, free alternative to MATLAB for many scientific computing applications.

References

1. http://octave-dome.sourceforge.net/ (I have stopped working on this. Use pyDome instead).
2. http://octave.sourceforge.net/
3. http://en.wikipedia.org/wiki/GNU_Octave

Posted in engineering, science, statistics | Tagged , , , , , , , , , , , | Leave a comment

synthetic biology: an emerging engineering discipline

In the last decade a new engineering disciple called “synthetic biology” has emerged. It differs from the science of biology in that it applies engineering strategies to the creation of cells that perform a desired task, such as the production of drugs or biofuels. It also differs from previous genetic engineering approaches by stressing the assembly of systems composed of modular, repeatable genetic components selected from a pool of well described candidate components. This post introduces the subject from a high level.

Modularity in Engineering Design

All mature branches of engineering stress modularity in design of systems and products, such that the designed systems are composed of simpler systems having known input and output behaviors. Examples of this design ethic are the electrical circuits (Butterworth filters from the LTspice example circuits) shown below:

LTlspice_butterworth_filter

These circuits are made of simpler components: resistors, capacitors, and inductors, each with known physical properties. Knowing the physical properties of each of these parts enables simulation of the whole combined circuits, allowing prediction of circuit outcome. For example, the LTspice predicted voltage responses of the top two of these filters are:

LTlspice_butterworth_filter_SIMULATION

Computer-Aided Design (CAD)

In the discussion above the example was expressed in a CAD program called LTspice, which facilitates the specification, communication, and simulation of electrical circuits. Other branches of engineering use CAD for these purposes as well, for example Pro/ENGINEER by mechanical engineers and AutoCAD by civil engineers. These CAD packages also encourage modularity, as demonstrated by the multi-component system shown in Pro/ENGINEER below [1]:

ProE-Connector-for-Aras

Synthetic Biology

Synthetic biology is an approach to genetic engineering that draws from traditional engineering’s use of modular, well-described parts. DNA components of genes such as ribosome binding sites, protein coding regions, promoters, etc. are abstracted into “parts” that can be assembled with other parts—not necessarily from the same gene—into “devices”. These devices can in turn be combined into “systems” that result in a desired cellular behavior once DNA encoding the designed system is inserted into a cell. Key to this design strategy is that the engineer has a suite of genetic parts to choose from when designing the genes that they combine into larger systems. These larger systems are often said to be made of genetic “circuits”, since the designed operations can resemble switching and logic gates. We will explore sources of genetic parts shortly, but first consider CAD.

It is logical that synthetic biologists would seek CAD programs to help facilitate this design process, and such tools are beginning to emerge out of academic labs and commercial institutions. Most of these tools cover a single task in the design process, and therefore must be chained together if the designer is to go from part selection to simulation to DNA specification of the final design. This has driven the creation of markup languages to describe designs (CellML [2] and SBML [3]) so that multiple tools can work with the same design.

An example of the specification of protein production and interaction by a genetic circuit is provided by the iBioSim CAD package’s [4] tutorial:

iBioSim

Here proteins are shown in the blue boxes, and a promoter is shown in the diagonal box. The promoter is repressed by protein Cl2 and activates transcription leading to the production of protein CII. An event (green box) specifies that cell division is to occur at a predefined point during the simulation. In iBioSim, all parameters of the chemical reaction dynamics must be specified prior to simulation, which is challenging because often these parameters are unknown and have to be estimated. iBioSim then enables simulation of the genetic circuit. Below we can see that the proteins created by the cell are expected to reach steady state:

iBioSim_simulation

Another CAD package in development for synthetic biology is Cello, which stands for “Cell Logic” [5]. In Cello the user specifies their desired logic in a truth table, where intracellular chemicals and signals make up the inputs, and the program selects genetic parts necessary to implement the logic. Cello then specifies the DNA necessary to implement the logic [5]. In the NOR gate example shown below, one or both of two promoters activate production of a protein that represses another promoter that activates the output protein [5].

overview_NOR

overview_dnabases

Genetic Part Registries

Several repositories have emerged to store descriptions of genetic parts and devices [6, 7], with the goal of mimicking the specification sheets associated with semiconductor parts today. As semiconductor specification sheets encourage repeatable, modular design of electrical circuits, the plan for genetic part specifications is to encourage repeatable, modular design of genetic circuits. An example of such a repository is the Registry of Standard Biological Parts [6], which provides specification sheets for promoters, ribosome binding sites, coding regions, terminators, etc., e.g., [8]:

igem_example

References

1. http://www.aras.com/integrations/MCAD/creo-parametric-connector.aspx
2. http://www.cellml.org/
3. http://sbml.org/Main_Page
4. http://www.async.ece.utah.edu/iBioSim/
5. http://cidar.bu.edu/cello/server/html/login.html
6. http://parts.igem.org/Catalog?title=Catalog
7. http://www.ncbi.nlm.nih.gov/pubmed/20160009
8. http://parts.igem.org/Part:BBa_K1216007

Posted in bioinformatics, engineering, science | Tagged , , , , , , , , , , , | Leave a comment

building a web-enabled temperature logger

Not wanting to miss out on the “Internet of Things”, I decided to learn some of its foundational technology, namely microprocessor programming. Actually, I used a Raspberry Pi in this project instead of a classic microprocessor, but the idea is the same. Here I describe building a web-enabled temperature logger, complete with a web application to display its results.

The Challenge

I live in an RV with my cat. When I go to work I have to decide whether to leave the windows open or turn on the air conditioner to keep my cat cool, as there is no thermostat on my air conditioner. I usually decide based on the weather forecast, but really don’t know how hot it gets in the RV during the peak temperature of the day. I needed a data logger that reads the temperature regularly and stores it. Ideally such a data logger would report to a web application, so I can monitor the temperature from work.

The Solution

Raspberry Pi

I first bought a Raspberry Pi B+ computer and configured it to run Raspbian Linux. Then I added a USB WiFi dongle so the device can communicate with the Internet.

DS18B20 Digital Temperature Sensor

Next, I bought a DS18B20 digital temperature probe, and connected it to the Raspberry Pi according to the following schematic, which is slightly modified from that specified by [1]:

schematic

The 4.7 kOhm resistor came with the temperature sensor.

The resulting hardware looks like:

hardware_photo_with_resistor_10

Running the Data Logger and Connecting to the Web

On the Raspberry Pi I run the following Python code, which is slightly modified from that shown in [1]. My modification simply calls a URL containing the temperature reading that is processed by the web application described below. This code sends a reading to the URL every two minutes.

import os
import glob
import time
import urllib2

os.system('modprobe w1-gpio')
os.system('modprobe w1-therm')

base_dir = '/sys/bus/w1/devices/'
device_folder = glob.glob(base_dir + '28*')[0]
device_file = device_folder + '/w1_slave'

def read_temp_raw():
    f = open(device_file, 'r')
    lines = f.readlines()
    f.close()
    return lines

def read_temp():
    lines = read_temp_raw()
    while lines[0].strip()[-3:] != 'YES':
        time.sleep(0.2)
        lines = read_temp_raw()
    equals_pos = lines[1].find('t=')
    if equals_pos != -1:
        temp_string = lines[1][equals_pos + 2:]
        temp_c = float(temp_string) / 1000.
        return temp_c

while True:
    try:
        temp_c = read_temp()
        req = urllib2.urlopen('http://my.url.com/logtemp.php?temp=' + str(temp_c))
        time.sleep(2. * 60.)
    except:
        time.sleep(5.)

Web Application for Storing Temperature Readings

A PHP program receives the temperature reading sent by the Python script as a GET argument. It then places the temperature value with a time stamp into a MySQL database. I chose PHP for this task because my web hosting company makes PHP deployment much easier than Django or JSP deployment:

<html>
 <head>
  <title>Log Temperature</title>
 </head>
 <body>
 <?php 

    $temp = $_GET['temp'];

    $con = mysqli_connect("host", "user", "password", "database");

    // Check connection
    if (mysqli_connect_errno()) {
    echo "Failed to connect to MySQL: " . mysqli_connect_error();
    }

    $sql = "INSERT INTO temperature_log VALUES (now(), $temp)";

    if (!mysqli_query($con,$sql)) {
    die('Error: ' . mysqli_error($con));
    }
    echo "1 record added";

    mysqli_close($con);
?>
 </body>
</html>

Web Application for Displaying Temperature Readings

The web application for viewing the temperature readings displays a run chart and a log. The run chart is implemented in JavaScript with the jqPlot library. The application queries the MySQL database for the last 24 hours’ readings. Again, I used PHP just because it is easy to deploy on my web hosting platform.

temperature_logger_screenshot

The code for this application is:

<html>
 <head>
<link rel="stylesheet" type="text/css" href="viewtemp.css">

<script language="javascript" type="text/javascript" src="jqplot/jquery.min.js"></script>
<script language="javascript" type="text/javascript" src="jqplot/jquery.jqplot.min.js"></script>
<script language="javascript" type="text/javascript" src="jqplot/plugins/jqplot.canvasTextRenderer.min.js"></script>
<script language="javascript"type="text/javascript" src="jqplot/plugins/jqplot.canvasAxisTickRenderer.min.js"></script>
<link rel="stylesheet" type="text/css" href="jqplot/jquery.jqplot.css" />

  <title>View Temperature Log</title>
 </head>
 <body>

<center><h1>Trailer Temperature</h1></center>

<div id="chart"></div>
<br><br>

 <?php 

    $con = mysqli_connect("host", "user", "password", "database");

    // Check connection
    if (mysqli_connect_errno()) {
    echo "Failed to connect to MySQL: " . mysqli_connect_error();
    }

    $sql = "select * from temperature_log tl where tl.time >= DATE_SUB(NOW(), INTERVAL 1 DAY) order by tl.time desc";

    $result = mysqli_query($con, $sql);

?>
<table id='time_temp_table'><thead><tr><th>Time</th><th>Temperature (C)</th><th>Temperature (F)</th></tr></thead><tbody>
<?php

    $temp_array = array();
    $time_array = array();
    while($row = mysqli_fetch_array($result)) {
    $temp_f = round($row['temperature'] * 1.8 + 32.0, 2);
    echo "<tr><td id='time_entry'>" . $row['time'] . "</td><td id='temp_entry'>" . $row['temperature'] . "</td><td>" . $temp_f . "</td></tr>";	
    array_push($temp_array, $temp_f);
    array_push($time_array, $row['time']);
    }   

    $temp_array_reverse = array_reverse($temp_array);
    $time_array_reverse = array_reverse($time_array);
?>
    </tbody></table>
<?php
    mysqli_close($con);
?>

<script type="text/javascript">
<?php
$js_array = json_encode($temp_array_reverse);
echo "var tempArrayAsString = " . $js_array . ";\n";
$js_array = json_encode($time_array_reverse);
echo "var timeArrayAsString = " . $js_array . ";\n";
?>

$(document).ready(function() {

	tempArray = [];
	$.each(tempArrayAsString, function(index, value) {
		tempArray.push(parseFloat(value));
	});

	data = [];
	data.push([0, tempArray[0]]);
	timeDiffList = [0];
	for (var i=1; i<timeArrayAsString.length; i++) {
		d = timeArrayAsString[0].split(' ')[0];
		year = parseInt(d.split('-')[0]);
		month = parseInt(d.split('-')[1]) - 1;
		day = parseInt(d.split('-')[2]);
		t = timeArrayAsString[0].split(' ')[1];
		hour = parseInt(t.split(':')[0]);
		minute = parseInt(t.split(':')[1]);
		second = parseInt(t.split(':')[2]);
		dt0 = new Date(year, month, day, hour, minute, second);

		d = timeArrayAsString[i].split(' ')[0];
		year = parseInt(d.split('-')[0]);
		month = parseInt(d.split('-')[1]) - 1;
		day = parseInt(d.split('-')[2]);
		t = timeArrayAsString[i].split(' ')[1];
		hour = parseInt(t.split(':')[0]);
		minute = parseInt(t.split(':')[1]);
		second = parseInt(t.split(':')[2]);
		dti = new Date(year, month, day, hour, minute, second);

		var timeDiff = (dti - dt0) / (1000. * 60.);
		timeDiffList.push(timeDiff);

		data.push([timeDiff, tempArray[i]]);
	}

	var ticksToUse = [];
	var position_dict = {};
	for (i=0; i<timeDiffList.length; i++) {
		var a = Math.round(timeDiffList[i] / 5) * 5;
		if (a % 120 == 0) {
	 		var label = timeArrayAsString[i];
			 
			if (!position_dict.hasOwnProperty(a)) {
			    ticksToUse.push([timeDiffList[i], label]);
			    position_dict[a] = true;
			}
		}
	}
	label = timeArrayAsString[i-1];
	ticksToUse.push([timeDiffList[i-1], label]);


	$.jqplot('chart',  [data], 
	{
		series: [{showMarker: false, lineWidth: 2}],
		axesDefaults : {
			tickRenderer: $.jqplot.CanvasAxisTickRenderer ,
			tickOptions: {
				angle: -80
			}
		},
		axes: {
			xaxis: {
			label: 'Time (Minutes)',
			ticks: ticksToUse,
			},
			yaxis: {
			label: 'Temperature (Fahrenheit)',
			tickOptions: { angle: 0 }
			},
		},
	});

});
</script>
 </body>
</html>

The CSS for the application is:

#time_temp_table {
    border-collapse: collapse;
    background-color: lightblue;
}

#time_temp_table th {
    border: 1px solid black;
    text-align: center;
    padding: 2px 15px 2px 15px
}

#time_temp_table td {
    border: 1px solid black;
    text-align: center;
    padding: 2px 15px 2px 15px
}

Future Plans

For the web application that displays the recorded temperatures, it would be nice to add a box plot to summarize the results.

Ultimately, I’d like to connect this hardware to my air conditioner so that it will automatically turn on when a set temperature point is reached. I’ll need some high-amp relays for this.

References

1. https://learn.adafruit.com/adafruits-raspberry-pi-lesson-11-ds18b20-temperature-sensing

Related Post

engineer moves into an RV

Posted in engineering, science | Tagged , , , , , , , , , , , , , | Leave a comment

rapidly extracting a subsequence from chromosome sequence data in Java

The Challenge

We have a text file containing the nucleotides of a chromosome, say human chromosome 11, and need to be able to quickly extract a subsequence from the chromosome text given a nucleotide position and number of subsequent nucleotides to include. The problem is that chromosome files are huge, e.g. 135 megabytes for chromosome 11, so we don’t want to use typical string processing tools. An example of the text we are extracting nucleotides from is:

chromosome_snapshot

Note that this file contains no line breaks or header, so that the byte position of a nucleotide corresponds to its position in the genome.

The Solution

Our solution is to use random file access to jump to the desired start nucleotide position in the file and read forward from that position for the required number of bases. Java code that implements this strategy is:

import java.io.*;

public class ReadNucleotidePositions {

    public static void main(String[] args) {

	// user settings
	Integer start = 68081400 - 1;   // base zero
	Integer numberOfCharactersToRead = 200;
	String chromosomeFile = "chromosome_11.txt";

	String output = "";

	try {
	    // open a file for reading
	    RandomAccessFile raf = new RandomAccessFile(chromosomeFile, "r");

	    // seek the start position
	    raf.seek(start);

	    // repeat read operation for the number of times specified
	    for (Integer i=0; i<numberOfCharactersToRead; i++) {

		// this has to be type "int", not type "Integer" for the cast to work
		int someCharInteger = raf.read();

		// cast as character and append to output string
		output += (char) someCharInteger;
	    }
	}
	catch (IOException ex) {
	    ex.printStackTrace();
	}

	System.out.println();
	System.out.println(output);
	System.out.println();
    }
}

Here we start at position 68081400 and subtract one to make the coordinate system base zero. We specify that we want to read the nucleotide at this position followed by the next 199 nucleotides in chromosome 11.

We open a “RandomAccessFile” and use the “seek” method to move the file pointer to the position we intend to start reading from. Then we loop for the number of nucleotides we are reading, using the “read” method at each iteration to extract the character at that position in the chromosome text. The value returned from the “read” method is of type “int”, which we must cast to type “char” before adding it to our String object containing the extracted sequence.

run_results

Finally, we check the extracted sequence against a reference (in this case the UCSC Genome Browser) to ensure our sequence extraction is accurate:

UCSC_genome_browser

Posted in bioinformatics, engineering, science | Tagged , , , , , , , | Leave a comment

invasive species spreads to low Earth orbit

After colonizing six continents and setting up scientific outposts on the seventh, Earth’s major invasive species sent some of its members into orbit. Not long after starting to use its opposable thumbs to cultivate grain, the species built rockets capable of landing individuals on Earth’s moon and delivering its members regularly to a series of low orbit scientific test labs.

The species now has its sights on Mars, which it has not visited yet in person but has landed several probes on. These probes are collecting data necessary to facilitate colonization, such as looking for evidence of readily available water. The invasive species also is looking at the asteroid belt as a source of minerals needed for further expansion.

Known for long term thinking, the invasive species has set up laboratories on Earth to develop advanced propulsion technologies to make space travel faster, thereby facilitating a full-blown invasion of the solar system. Pressure is building for such an outward migration from the planet as the invasive species’ population reaches seven billion, a number challenging the planet’s food, water, and energy resources.

Posted in humor, science | Leave a comment

using bug tracking software to keep track of life’s tasks

I’ve tried a few mobile task tracking apps for my smart phone, but have found none as useful for keeping track of life’s responsibilities as using a web-based software bug tracking program called MantisBT. MantisBT is used by software engineers to log reported bugs, assign them to staff for correction, and record progress toward bug resolution.

mantis_logo

To get started, I first deployed MantisBT (a PHP application) on the web so I can access it from anywhere with an internet connection. For the underlying database, I selected MySQL.

I then logged in and created five “projects”: “Badass Data Science” to log tasks related to this blog, “Bills” to track when payments are due, “School” to track school related deadlines, “Repairs” to log repair tasks for my RV or truck, and “Miscellaneous” to track tasks that don’t fit cleanly into the other four topics. Since it is easy to add projects, I am not limited to only five categories.

all_projects

When a new task comes my way, I select the relevant project from the drop-down menu and then click on the “Report Issue” link. The following screenshot shows adding a task (building a GitHub page for the pyDome geodesic dome designer) in the “Badass Data Science” project:

creating_a_new_issue

We can then click “View Issues” to see all the tasks related to the project:

BDS_all_issues

At a glance we can see indicators of task priority (column “P”) and task severity. The rows are color-coded by status with a color key on the very bottom row. For example, five of the tasks shown in the image above have status “assigned”.

When a task is worked on or completed, one can select the task from the “View Issues” page and change its status and/or add notes.

editing_an_issue

The changing status workflow provides additional opportunity to add notes, for example information related to resolving a task:

resolving_an_issue

We can also filter issues by status, for example hiding all resolved issues:

hide_resolved

 

Posted in data science, engineering | Tagged , , , , , , , | Leave a comment