fuzzy logic toolkit in C++

I recently came across some old C++ code I wrote about 10 years ago to assist fuzzy logic reasoning. This program is now posted on GitHub at https://github.com/badassdatascience/fuzzy-logic-toolkit and is described below. An example of the tool in action (with code) follows the description.

Fuzzy Logic

Suppose we have a numerical value for distance to a place of interest, and we want to fit it into one of the following categories: “very near”, “near”, “neither near nor far”, “far”, or “very far”. To complicate matters, suppose that experts can’t agree on what values of the distance define membership in each category, so they define functions that estimate the degree of membership in each category for each possible value of the distance. For example, such functions might place a given distance value at 5% “very close”, 20% “close”, and 75% “neither close nor far”. Therefore if one had to choose a category for a given distance they could only specify “fuzzy” levels of membership in each category. This is the crux of fuzzy set theory.

Now suppose that some reasoning based on the selection of the category is required. Perhaps another fuzzy group of categories is involved, such as cost (“cheap”, “modest”, “expensive”). The reasoning could look like boolean logic: e.g, IF “cheap” AND “near” THEN “eat lunch there”. However, the boundaries for “cheap” and “near” are fuzzy and therefore the computation is imprecise. This is fuzzy logic.

The code described below defines C++ classes for describing functions specifying membership in a category, a class for specifying the categories themselves, and a class for compiling the possible categories into a domain of related categories. There is also a class for working with the numerical values of interest (e.g., distance in the case described above), enabling fuzzy boolean operations on those numerical values and estimating “crisp” (non-fuzzy) numerical conclusions from the fuzzy operations.

UML Model

A UML model of the program’s classes is shown below. Each of the classes is described following the image.


Membership Functions

Membership functions are used to define the ranges and computation method with which numerical values are evaluated for degree of membership in a fuzzy category. Each fuzzy category (called a LinguisticSet in the code and UML model) will have a membership function associated with it.

We start with an abstract class called MembershipFunction. From this abstract class we generate four child classes corresponding to different ways of evaluating membership. They are StandardMBF_S, StandardMBF_Z, StandardMBF_Pi, and StandardMBF_Lambda. Objects instantiated from these child classes are associated with a LinguisticSet object during instantiation of the LinguisticSet object.

We now examine one of the membership functions, StandardMBF_Z, in more detail: This is a “Z” function that returns one if a given numerical value is below a minimum value, zero if the test value is above a maximum value, and a linear interpolation between one and zero if the test value falls between the minimum and maximum. An example of what this function looks like is:


Examples of the other three membership functions are:


Categories (LinguisticSet)

Categories, called “linguistic sets” in the code, encapsulate a membership function and a name. The membership function is used to compute degree of membership of a value in the category. Linguistic sets are combined into LinguistDomain objects according to a common variable, such as distance in the description above.


“Linguistic domains” are collections of categories that are related by a variable, such as distance in the case described above. They might define for instance the set (“very close”, “close”, “neither close nor far”, “far”, “very far”). These are used to relate linguistic set objects by a common variable.


Once a linguistic domain is established, a FuzzyValue object can be defined on it that enables fuzzy reasoning using the categories in the domain. This class plays the role of the variable under consideration by the elements of the linguistic domain, and can be set to an exact number if that number is known at the beginning of the computations. For example, one might be executing a fuzzy reasoning procedure involving domains A, B, and C, where the FuzzyValue result for C is computed using known measurements and membership functions for A and B. The example program shown below will illustrate how to use FuzzyValue.

Example Program

The following example comes from Constantin von Altrock’s book “Fuzzy Logic & Neurofuzzy Applications Explained“, which is a very accessible introduction to the subject. It consists of a crane operation where a program must decide through fuzzy reasoning how much power to apply to the crane, given known measurements of distance and angle. This code is also contained in the file “main.cc” that comes with the program when you check it out from GitHub.

The program starts by defining membership functions and linguistic sets for the various options for distance, e.g., “too far” or “medium”. These are then compiled into a linguistic domain for distance. Similarly, membership functions, linguistic sets, and linguistic domains are defined for angle measurements and power values (power values will be computed).

The program then declares values for the known quantities distance (12 yards) and angle (4 degrees). These values are associated with their respective domain upon declaration. It then declares a fuzzy value for power, but does not specify a known measurement, since we are computing it using fuzzy reasoning.

We then apply the following logic: If distance is medium and angle is small positive, apply positive medium power. If distance is medium and angle is zero, apply zero power. Finally, if distance is far and angle is zero, apply positive medium power. Note that “small positive”, “positive medium”, “medium”, “zero”, and “far” are all categories not exact values.

To reason through these categories to choose an outcome, the measured values of distance and angle are each assigned degree of membership to each category in their respective domains. Then the logic is applied through fuzzy reasoning to assign probabilities of power membership in the power categories. Finally, an exact value for power is inferred from the computed probabilities of group membership and reported.

#include "FuzzyValue.hh"
#include <iostream>

int main()
  std::cout << std::endl;
  std::cout << "This program demonstrates Dan Williams' yet untitled" << std::endl;
  std::cout << "fuzzy logic toolkit." << std::endl;
  std::cout << std::endl;
  std::cout << "It implements the crane example in Constantin von Altrock's book " << std::endl;
  std::cout << "'Fuzzy Logic & Neurofuzzy Applications Explained'" << std::endl;

  std::cout << std::endl;
  std::cout << "Distance sensor reads 12 yards." << std::endl;
  std::cout << "Angle sensor reads 4 degrees." << std::endl;

  create and name a linguistic domain for distance
  LinguisticDomain* distance_domain = new LinguisticDomain("distance_domain");

  define some linguistic values and membership functions for the distance domain

  // too_far
  StandardMBF_Z* distance_domain_too_far_membership = new StandardMBF_Z(-5, 0);
  LinguisticSet* distance_domain_too_far = new LinguisticSet(distance_domain_too_far_membership, "too_far");

  // zero
  StandardMBF_Lambda* distance_domain_zero_membership = new StandardMBF_Lambda(-5, 0, 5);
  LinguisticSet* distance_domain_zero = new LinguisticSet(distance_domain_zero_membership, "zero");

  // close
  StandardMBF_Lambda* distance_domain_close_membership = new StandardMBF_Lambda(0, 5, 10);
  LinguisticSet* distance_domain_close = new LinguisticSet(distance_domain_close_membership, "close");

  // medium
  StandardMBF_Lambda* distance_domain_medium_membership = new StandardMBF_Lambda(5, 10, 30);
  LinguisticSet* distance_domain_medium = new LinguisticSet(distance_domain_medium_membership, "medium");

  // far
  StandardMBF_S* distance_domain_far_membership = new StandardMBF_S(10, 30);
  LinguisticSet* distance_domain_far = new LinguisticSet(distance_domain_far_membership, "far");

  Add the linguistic values to the distance domain

  create and name a linguistic domain for angle
  LinguisticDomain* angle_domain = new LinguisticDomain("angle_domain");

  define some linguistic values and membership functions for the angle domain

  // neg_big
  StandardMBF_Z* angle_domain_neg_big_membership = new StandardMBF_Z(-45, -5);
  LinguisticSet* angle_domain_neg_big = new LinguisticSet(angle_domain_neg_big_membership, "neg_big");

  // neg_small
  StandardMBF_Lambda* angle_domain_neg_small_membership = new StandardMBF_Lambda(-45, -5, 0);
  LinguisticSet* angle_domain_neg_small = new LinguisticSet(angle_domain_neg_small_membership, "neg_small");

  // zero
  StandardMBF_Lambda* angle_domain_zero_membership = new StandardMBF_Lambda(-5, 0, 5);
  LinguisticSet* angle_domain_zero = new LinguisticSet(angle_domain_zero_membership, "zero");

  // pos_small
  StandardMBF_Lambda* angle_domain_pos_small_membership = new StandardMBF_Lambda(0, 5, 45);
  LinguisticSet* angle_domain_pos_small = new LinguisticSet(angle_domain_pos_small_membership, "pos_small");

  // pos_big
  StandardMBF_S* angle_domain_pos_big_membership = new StandardMBF_S(5, 45);
  LinguisticSet* angle_domain_pos_big = new LinguisticSet(angle_domain_pos_big_membership, "pos_big");

    Add the linguistic values to the angle domain

  create and name a linguistic domain for power
  LinguisticDomain* power_domain = new LinguisticDomain("power_domain");

  define some linguistic values and membership functions for the power domain

  // neg_high
  StandardMBF_Lambda* power_domain_neg_high_membership = new StandardMBF_Lambda(-30, -25, -8);
  LinguisticSet* power_domain_neg_high = new LinguisticSet(power_domain_neg_high_membership, "neg_high");

  // neg_medium
  StandardMBF_Lambda* power_domain_neg_medium_membership = new StandardMBF_Lambda(-25, -8, 0);
  LinguisticSet* power_domain_neg_medium = new LinguisticSet(power_domain_neg_medium_membership, "neg_medium");

  // zero
  StandardMBF_Lambda* power_domain_zero_membership = new StandardMBF_Lambda(-8, 0, 8);
  LinguisticSet* power_domain_zero = new LinguisticSet(power_domain_zero_membership, "zero");

  // pos_medium
  StandardMBF_Lambda* power_domain_pos_medium_membership = new StandardMBF_Lambda(0, 8, 25);
  LinguisticSet* power_domain_pos_medium = new LinguisticSet(power_domain_pos_medium_membership, "pos_medium");

  // pos_high
  StandardMBF_Lambda* power_domain_pos_high_membership = new StandardMBF_Lambda(8, 25, 20);
  LinguisticSet* power_domain_pos_high = new LinguisticSet(power_domain_pos_high_membership, "pos");

  add the linguistic values to the power domain

  "Fuzzify" sensor readings
  FuzzyValue* distance = new FuzzyValue(distance_domain);
  FuzzyValue* angle = new FuzzyValue(angle_domain);

  Create a fuzzy variable to store power inference calculations
  FuzzyValue* power = new FuzzyValue(power_domain);

  Fuzzy inference of power value
  power->OR_setSetMembership( distance->AND("medium", angle, "pos_small"), "pos_medium" );
  power->OR_setSetMembership( distance->AND("medium", angle, "zero"), "zero" );
  power->OR_setSetMembership( distance->AND("far", angle, "zero"), "pos_medium" );

  "Defuzzify" infered power value
  long double power_setting;
  power_setting = power->getCrispValue();
  std::cout << "Set power to " << power_setting << " kW." << std::endl;

  std::cout << std::endl;
  return 0;


Posted in engineering, statistics | Tagged , , , , | Leave a comment

EC2 spot instance price change: no correlation with day of week

My plans for world domination involve heavy use of Amazon EC2 instances, but I have to be frugal about it so I’m running spot instances to save cash. Therefore a means of forecasting spot instance prices would be helpful.

Thus far I’ve had little success using mainstream forecasting tools such as ARIMA and exponential smoothing. So I’m now looking for explanatory variables to use in regressions. I thought I’d start with day of the week as a variable to see if there is a pattern.

I sampled 11 weeks of price changes (code below) from the two US west coast EC2 regions, counting how many times prices went up or down by day of the week the price change occurred on. Doing so produced the following measurements:


From these results we see that price change variance goes down on weekends, and that price increases tend to occur slightly more frequently than price decreases on weekdays, though not by much.

I also logged the percent change in price for each price change and produced the following box plot (outliers are not displayed for now since they flood the image):


Again from these results we see that price change variance decreases on weekends. However, overall percent change in price for each price change shows a net of zero for all days. Therefore, I do not think the day of the week will be a useful predictor in a future model of EC2 spot prices. It may be possible to modify forecast confidence intervals based on day of the week due to the decrease in price change variance that occurs on weekends, but I’m not sure yet.

Plotting the box plot with the outliers displayed shows that the distributions are right skewed. I haven’t yet figured out how these extreme values are balanced on the negative side, but there might be some useful information in this occurrence.



Here is Python code to fetch the source data from Amazon. Your results will vary as Amazon only provides a certain number of past values depending on when you run the code:

# import useful libraries
import datetime
import os

# user settings
days_to_go_back = 200
access_key = 'your access key'
secret_key = 'your secret key'

# compute the start time
today = datetime.date.today()
delta = datetime.timedelta(days=-days_to_go_back)
start_time = str(today + delta) + 'T00:00:00'

# produce the cmd
cmd = 'ec2-describe-spot-price-history -H --aws-access-key ' + access_key + ' --aws-secret-key ' + secret_key + ' --start-time ' + start_time

# regions
cmd_list = []
regions = ['us-east-1', 'us-west-1', 'us-west-2']
for region in regions:
    cmd_list.append(cmd + ' --region ' + region)

# execute commands
os.system('rm output/AWS_price_history_data.txt')
os.system('touch output/AWS_price_history_data.txt')
for c in cmd_list:
    os.system(c + ' >> output/AWS_price_history_data.txt')

Here is Python code to produce the table and data for the box plots:

# load useful libraries
import datetime
import pytz
from pytz import timezone
import math

# timezones
eastern = timezone('US/Eastern')
pacific = timezone('US/Pacific')

# load data
data = {}
f = open('output/AWS_price_history_data.txt')
for i, line in enumerate(f):
    line = line.strip()
    if line.find('AvailabilityZone') >= 0:  continue
    price = line.split('\t')[1]
    timestamp = line.split('\t')[2]
    instance_type = line.split('\t')[3]
    description = line.split('\t')[4]
    zone = line.split('\t')[5]

    if zone.find('east') >= 0:  continue  # consider only west coast for now since date ranges differ between east and west

    year = int(timestamp.split('T')[0].split('-')[0])
    month = int(timestamp.split('T')[0].split('-')[1])
    day = int(timestamp.split('T')[0].split('-')[2])
    hour = int(timestamp.split('T')[1].split(':')[0])
    minute = int(timestamp.split('T')[1].split(':')[1])
    ts = datetime.datetime(year, month, day, hour, minute, 0, tzinfo=pytz.utc)

    if zone.find('east') >= 0:
        ts = ts.astimezone(eastern)
    if zone.find('west') >= 0:
        ts = ts.astimezone(pacific)

    if not data.has_key(instance_type):
        data[instance_type] = {}
    if not data[instance_type].has_key(description):
        data[instance_type][description] = {}
    if not data[instance_type][description].has_key(zone):
        data[instance_type][description][zone] = {'price' : [], 'timestamp' : []}



# sort the prices by date
for instance_type in data.keys():
    for description in data[instance_type].keys():
        for zone in data[instance_type][description].keys():

            price_list = data[instance_type][description][zone]['price']
            timestamp_list = data[instance_type][description][zone]['timestamp']

            indices = [i[0] for i in sorted(enumerate(timestamp_list), key=lambda x:x[1])]
            new_timestamp_list = []
            new_price_list = []
            for i in indices:

            data[instance_type][description][zone]['price'] = new_price_list
            data[instance_type][description][zone]['timestamp'] = new_timestamp_list

# get days of week
for instance_type in data.keys():
    for description in data[instance_type].keys():
        for zone in data[instance_type][description].keys():
            timestamp_list = data[instance_type][description][zone]['timestamp']
            day_of_week_list = []
            for t in timestamp_list:
            data[instance_type][description][zone]['day_of_week_list'] = day_of_week_list

# need to make sure there is an exact number of full weeks in the analysis
for instance_type in data.keys():
    for description in data[instance_type].keys():
        for zone in data[instance_type][description].keys():
            timestamp_list = data[instance_type][description][zone]['timestamp']
            days_diff = (timestamp_list[-1] - timestamp_list[0]).days
            weeks_diff_to_use = int(math.floor(float(days_diff) / 7.)) - 1  # subtracting one allows us to shift the days without error
            data[instance_type][description][zone]['weeks'] = weeks_diff_to_use

            shift_dt = datetime.timedelta(days = 3)  # we are shifting by three days to deal with possible artifacts at beginning of data set
            dt = datetime.timedelta(weeks = weeks_diff_to_use)
            cutoff_time = timestamp_list[0] + dt + shift_dt
            start_cutoff_time = timestamp_list[0] + shift_dt

            for i in range(0, len(timestamp_list)):
                if timestamp_list[i] > cutoff_time:

            for j in range(0, len(timestamp_list)):
                if timestamp_list[j] > start_cutoff_time:

            data[instance_type][description][zone]['price'] = data[instance_type][description][zone]['price'][j:i]
            data[instance_type][description][zone]['timestamp'] = data[instance_type][description][zone]['timestamp'][j:i]
            data[instance_type][description][zone]['day_of_week_list'] = data[instance_type][description][zone]['day_of_week_list'][j:i]

# count price rises
price_rises_by_weekday = {0 : 0, 1 : 0, 2 : 0, 3 : 0, 4 : 0, 5 : 0, 6 : 0}
price_falls_by_weekday = {0 : 0, 1 : 0, 2 : 0, 3 : 0, 4 : 0, 5 : 0, 6 : 0}
percent_change_by_weekday = {0 : [], 1 : [], 2 : [], 3 : [], 4 : [], 5 : [], 6 : []}
for instance_type in data.keys():
    for description in data[instance_type].keys():
        for zone in data[instance_type][description].keys():
            price_list = data[instance_type][description][zone]['price']
            day_of_week_list = data[instance_type][description][zone]['day_of_week_list']
            for i in range(1, len(price_list)):
                change_in_price = price_list[i] - price_list[i-1]

                percent_change = change_in_price / price_list[i-1]


                if change_in_price > 0.00001:
                    price_rises_by_weekday[day_of_week_list[i]] += 1
                if change_in_price < -0.00001:
                    price_falls_by_weekday[day_of_week_list[i]] += 1

# output percent change
weekdays = {0 :'Monday', 1 : 'Tuesday', 2 : 'Wednesday', 3  : 'Thursday', 4 : 'Friday', 5 : 'Saturday', 6 : 'Sunday'}
f = open('output/percent_change.csv', 'w')
for i in sorted(percent_change_by_weekday.keys()):
    for value in percent_change_by_weekday[i]:
        f.write(weekdays[i] + ',' + str(value) + '\n')

# output counts
for i in sorted(price_rises_by_weekday.keys()):
    print weekdays[i] + ':  ' + str(price_rises_by_weekday[i])
for i in sorted(price_falls_by_weekday.keys()):
    print weekdays[i] + ':  ' + str(price_falls_by_weekday[i])
for i in sorted(price_falls_by_weekday.keys()):
    print weekdays[i] + ':  ' + str(price_rises_by_weekday[i] - price_falls_by_weekday[i])

Here is the R code used to make the plots:

# load the data
data <- read.csv("percent_change.csv")

# reorder the factors
data$y = factor(data$weekday, levels(data$weekday)1)

# boxplot with no outliers
boxplot(percent.change ~ y, data=data, outline=FALSE, main="Percent Change in Spot Instance Price for all Changes in 11-Week Period", ylab="Percent Change")

# boxplot with outliers
boxplot(percent.change ~ y, data=data, main="Percent Change in Spot Instance Price for all Changes in 11-Week Period", ylab="Percent Change")
Posted in engineering | Tagged , , , , , , , | Leave a comment

Apache Spark and stock price causality

The Challenge

I wanted to compute Granger causality (described below) for each pair of stocks listed in the New York Stock Exchange. Moreover, I wanted to analyze between one and thirty lags for each pair’s comparison. Needless to say, this requires massive computing power. I used Amazon EC2 as the computing platform, but needed a smooth way to architect the computation and parallelize it. Therefore I turned to Apache Spark (also described below). Code implementing the computation is included at the bottom of this text.

Granger Causality

Granger causality provides a statistical measure of how much information in one time series predicts the information in another. It works by examining the correlation between the two series where the expected predictor series is lagged by a specified number of time points. The result is a test statistic indicating the degree of Granger causality. Note that this is not a measure of true causality, as both time series might be actually caused by a third process.

Consider for example the daily change in closing prices of stocks BQH and WMB. Using the Python StatsModels package’s Granger test procedure (code below), I computed a p-value of 2.205e-26 for the null hypothesis that the daily change in BQH’s closing price does not Granger cause the daily change in WMB’s closing price at five lags. I concluded therefore that change in BQH does Granger cause change in WMB by five trading days. I then plotted the two normalized changes in price (lagged for BQH by five days) for 20 trading days. In the plot below one can see that the WMB price and the five-day lagged BQH price moves in the same direction for 14 of the 20 days, indicating that change in BQH closing price might be a useful predictor of future change in WMB closing price.


Apache Spark

Apache Spark is the new “it thing” in data science these days, so I thought it a good idea to learn it. Essentially it generalizes MapReduce, running much faster than Hadoop, and it provides Java, Scala, and Python interfaces. It is not limited to the two-step MapReduce procedure, easily allowing multiple map and reduce operations to be chained together, along with other useful data processing steps such as set unions, sorting, and generation of Cartesian products. The program largely takes care of parallelization during these steps, though offers some opportunity for fine tuning.

Apache Spark can be run on a local machine using multiple cores or on a cluster, but I’ve had more luck with parallelization using a single-worker cluster than using a local machine. The learning curve was comparable to that for Hadoop; writing applications that use the paradigm is quite easy once you “get it”. Map and reduce operations are specified as functions in one of the three languages listed above, and are parallelized by the various map and reduce commands provided by Spark. Variables can be shared across the parallel processes to aid procedures requiring storage of global values.

The Source Data

I started with daily closing stock prices for each trading day between January 1, 2000 and October 30, 2014, which I pulled from Yahoo Finance using Python’s Pandas package. I placed this data in one CSV file with the following format:


The Code

Here is the Python code, with explanation, necessary to run the pairwise Granger causality computation in Apache Spark.

First, we import the necessary libraries:

from pyspark import SparkContext, SparkConf
import datetime
import pandas as pd
from statsmodels.tsa import stattools
import numpy as np

We next declare a SparkContext object using with configuration settings. The “setMaster” argument is specific to the cluster you create. The ports listed in the port arguments must be open on the cluster instances, a “gotcha” on EC2-based clusters especially:

conf = (SparkConf().setMaster("spark://ip-123-123-123-123:7077").setAppName("spark test").set('spark.executor.memory', '3g').set('spark.driver.port', 53442).set('spark.blockManager.port', 54321).set('spark.executor.port', 12345).set('spark.broadcast.port', 22222).set('spark.fileserver.port', 33333))
sc = SparkContext(conf = conf)

We then read the source data into a Spark data object:

combined_file = sc.textFile('/home/ec2-user/data.csv')

Next, we extract from each line the stock symbol as a string, the date as a datetime object, and the closing price as a floating point number

close = combined_file.map(lambda line: ((line.split(',')[0], datetime.datetime(int(line.split(',')[1].split('-')[0]), int(line.split(',')[1].split('-')[1]), int(line.split(',')[1].split('-')[2]))), float(line.split(',')[5])))

We reorganize the last dataset as a key value set with the symbol as the key. We could have combined this with the last step, but the process is fast and I didn’t want to break working code.

close_by_symbol = close.map(lambda a: (a[0][0], (a[0][1], a[1])))

We now group each entry by stock symbol. This produces a key-value set were each key is a stock symbol and each value is an iterator containing a list of (date, price) tuples.

close_grouped_by_symbol = close_by_symbol.groupByKey()

We define a function to compute the one-day difference in closing prices for each symbol. This is because we are conducting the Granger causality analysis on the daily price differences rather than the prices themselves.

def make_diff(a):
    datelist = []
    pricelist = []
    for i in a[1]:
    sorted_indices = [x[0] for x in sorted(enumerate(datelist), key=lambda q:q[1])]
    datelist_sorted = []
    pricelist_sorted = []
    for i in sorted_indices:
    price_diff = []
    diff_date = []
    for i in range(1, len(pricelist_sorted)):
        price_diff.append(pricelist_sorted[i] - pricelist_sorted[i-1])
    return a[0], zip(diff_date, price_diff)

Next, we execute a map operation using the function defined above to produce a dataset containing the one-day difference values:

close_diff_by_symbol = close_grouped_by_symbol.map(make_diff)

We have key-value pairs where the keys are stock symbols and the values are lists of (date, price difference) tuples. We now want the Cartesian product of this data set against itself, so that every stock symbol’s price difference data (with dates) is matched to every other stock symbol’s price difference data. Apache Spark really shines at this task, providing the “cartesian” function for the purpose:

cartesian_matched = close_diff_by_symbol.cartesian(close_diff_by_symbol)

We define a function that will match the price difference values for each symbol pair by date. This returns a tuple (symbol 1, symbol 2, price difference list 1, price difference list 2), where the lists are aligned by trading date:

def match(a):
    case_1 = a[0]
    case_2 = a[1]
    symbol1 = case_1[0]
    symbol2 = case_2[0]
    ts1 = case_1[1]
    ts2 = case_2[1]
    time1 = []
    price1 = []
    for i, j in ts1:
    time2 = []
    price2 = []
    for i, j in ts2:
    series_1 = pd.Series(price1, index=time1)
    series_2 = pd.Series(price2, index=time2)
    df_1 = pd.DataFrame(series_1, columns=['1'])
    df_2 = pd.DataFrame(series_2, columns=['2'])
    merged = df_1.join(df_2, how='inner')
    return (symbol1, symbol2, merged['1'].tolist(), merged['2'].tolist())

We then execute this function in a map operation:

matched = cartesian_matched.map(match)

We can now define our function to compute Granger causality from the date-matched pairs of price differences for each stock symbol pair. Lag values are used as keys in the dictionary:

def gc(a):
    m = 30
    s1 = a[0]
    s2 = a[1]
    list1 = a[2]
    list2 = a[3]
    test_array = np.array([list1, list2], np.int64)
    p_ssr_ftest_dict = {}
        gc = stattools.grangercausalitytests(test_array.T, m, verbose=False)
        for q in gc.keys():
            p_ssr_ftest_dict[q] = gc[q][0]['ssr_ftest'][1]
        for q in range(1, m+1):
            p_ssr_ftest_dict[q] = None
    return (s1, s2, p_ssr_ftest_dict)

We execute this function in a map operation:

symbol_pair_gc = matched.map(gc)

We define a function for converting the results to CSV format:

def report(a):
    s1 = a[0]
    s2 = a[1]
    fdict = a[2]
    report_list = []
    for q in sorted(fdict.keys()):
    report_string = s1 + ',' + s2 + ',' + ','.join(report_list)
    return report_string

We execute a map operation using this function to create CSV output:

report_csv = symbol_pair_gc.map(report)

Finally, we save the results to a file. Note that when running on a cluster, for the path given below, a path “/home/ec2-user/output/gc_output/_temporary” must exist on the worker nodes before starting the program. “/home/ec2-user/output/gc_output” must not exist on the head node at start time.


The Output

The program generates Granger causality test p-values for each stock symbol pair for lags one through 30, e.g.,



There you have it; a way to compute pairwise Granger causality for each stock pair in the New York Stock Exchange using Apache Spark. I’ll let you know if an investment strategy emerges from this work.

Posted in big data, data science, econometrics, statistics | Tagged , , , , , , , , , , , , , | Leave a comment

hacking the stock market (part 1)

Caveat: I am not a technical investor–just a hobbyist, so take this analysis with a grain of salt. I am also just beginning with my Master’s work in statistics.

I wanted to examine the correlation between changes in the daily closing price of the Dow Jones Industrial Average (DJIA) and lags of those changes, to see if there is a pattern I could use. First I downloaded the DJIA data from Yahoo using Pandas:

# load useful libraries
from pandas.io.data import DataReader
from datetime import datetime
import matplotlib.pyplot as plt
import numpy as np
import math
from scipy.stats.stats import pearsonr
from scipy.stats.stats import spearmanr

# load DJIA data from Yahoo server
djia = DataReader("DJIA",  "yahoo", datetime(2000,1,1), datetime.today())

I then generated the autocorrelation plots for the one-day differenced closing prices and the signs of the one-day differenced closing prices:

# investigate the diff between DJIA closing prices
diff_1st_order = djia["Adj Close"].diff()
diff_1st_order_as_list = []
for d in diff_1st_order:
    if not np.isnan(d):
plt.subplot(2, 1, 1)
plt.acorr(diff_1st_order_as_list, maxlags=10)
plt.title("Autocorrelation of Diff of DJIA Adjusted Close")

# sign of diff, not diff itself
diff_1st_order_sign = []
for d in diff_1st_order_as_list:
    if not np.isnan(d / abs(d)):
        diff_1st_order_sign.append(d / abs(d))
plt.subplot(2, 1, 2)
plt.acorr(diff_1st_order_sign, maxlags=10)
plt.title("Autocorrelation of Sign of Diff of DJIA Adjusted Close")


There is a very small negative correlation between the one-day closing price difference and the one-day lag of the one-day closing price difference. Similarly, there is an even smaller positive correlation between the one-day closing price difference and the three-day lag of the one-day closing price difference.

So I set out to find the proportion of times the difference and the one-day lag of the difference changes from day to day:

# frequencies of 1-day lag changes in direction of closing price
count_opposite = 0
count_same = 0
i_list = []
j_list = []
for i in range(0, len(diff_1st_order_as_list) - 1):
    price_diff_i = diff_1st_order_as_list[i]
    price_diff_j = diff_1st_order_as_list[i+1]  # one trading day ahead


    sign_of_price_diff_i = 0
    if not np.isnan(price_diff_i / abs(price_diff_i)):
        sign_of_price_diff_i = int(price_diff_i / abs(price_diff_i))

    sign_of_price_diff_j = 0
    if not np.isnan(price_diff_j / abs(price_diff_j)):
        sign_of_price_diff_j = int(price_diff_j / abs(price_diff_j))

    if sign_of_price_diff_i == sign_of_price_diff_j:
        count_same += 1
        count_opposite += 1

print 'Correlation coefficients for the diff lists:'
print '\t', 'Pearson R: ', pearsonr(i_list, j_list)[0]
print '\t', 'Spearman R: ', spearmanr(i_list, j_list)[0]

print 'Amount of time closing value direction remains the same: ', round(float(count_same) / (float(count_same) + float(count_opposite)), 3)
amount_time_changes = float(count_opposite) / (float(count_same) + float(count_opposite))
print 'Amount of time closing value direction changes: ', round(amount_time_changes, 3)
L = amount_time_changes - 1.959964*((math.sqrt(amount_time_changes*(1.0 - amount_time_changes)))/math.sqrt(float(len(diff_1st_order_as_list))))
U = amount_time_changes + 1.959964*((math.sqrt(amount_time_changes*(1.0 - amount_time_changes)))/math.sqrt(float(len(diff_1st_order_as_list))))
print 'Agresti-Coull C.I.: ', round(L, 3), '< p <', round(U, 3)


This analysis tells me that in the long run (at least over the period that I pulled DJIA data for), betting using one-day changes of direction of the closing price of the DJIA would slowly pay off. (However, we are ignoring the magnitudes of the changes in this analysis; the magnitudes may be insufficient to be worth the price of a trade. A future analysis will investigate this). The test for correlation between the two time-series (one-day difference and lag of one-day difference) shows that the Agresti-Coull confidence interval is appropriate (we have near independence), although this betting scheme relies on the thinest correlation detected by the autocorrelation plot.

There was another possible pattern in the autocorrelation plot above: a three-day lag positive correlation and a one-lag negative correlation. I decided to check out the proportion of times using the combination of the two would result in a prediction success greater than the null of 25% for the case that the three-day lag changes direction in opposite direction as the one-day lag and in the same direction as the zero-day value:

# frequencies of combination of 1-day lag and 3-day lag changes in direction
# of closing price
count_matches = 0
count_non_matches = 0
i_list = []
j_list = []
k_list = []
for i in range(0, len(diff_1st_order_as_list) - 3):
    price_diff_i = diff_1st_order_as_list[i]
    price_diff_j = diff_1st_order_as_list[i+1]  # one trading day ahead
    price_diff_k = diff_1st_order_as_list[i+3]  # three trading days ahead


    # price_diff_i represents 3-day lag
    sign_of_price_diff_i = 0
    if not np.isnan(price_diff_i / abs(price_diff_i)):
        sign_of_price_diff_i = int(price_diff_i / abs(price_diff_i))
    sign_of_price_diff_j = 0

    # price_diff_j represents 1-day lag
    if not np.isnan(price_diff_j / abs(price_diff_j)):
        sign_of_price_diff_j = int(price_diff_j / abs(price_diff_j))
    sign_of_price_diff_k = 0

    # price_diff_k represents current day
    if not np.isnan(price_diff_k / abs(price_diff_k)):
        sign_of_price_diff_k = int(price_diff_k / abs(price_diff_k))

    if sign_of_price_diff_k != sign_of_price_diff_j and sign_of_price_diff_k == sign_of_price_diff_i:
        count_matches += 1
        count_non_matches += 1

print 'Correlation coefficients for the diff lists:'
print '\t', 'Pearson R: ', pearsonr(i_list, j_list)[0]
print '\t', 'Spearman R: ', spearmanr(i_list, j_list)[0]
print '\t', 'Pearson R: ', pearsonr(i_list, k_list)[0]
print '\t', 'Spearman R: ', spearmanr(i_list, k_list)[0]
print '\t', 'Pearson R: ', pearsonr(j_list, k_list)[0]
print '\t', 'Spearman R: ', spearmanr(j_list, k_list)[0]

amount_time_changes = float(count_matches) / (float(count_matches) + float(count_non_matches))
print 'Amount of time 1-day change is opposite direction and 3-day change is same direction: ', round(amount_time_changes, 3)
L = amount_time_changes - 1.959964*((math.sqrt(amount_time_changes*(1.0 - amount_time_changes)))/math.sqrt(float(len(diff_1st_order_as_list))))
U = amount_time_changes + 1.959964*((math.sqrt(amount_time_changes*(1.0 - amount_time_changes)))/math.sqrt(float(len(diff_1st_order_as_list))))
print 'Agresti-Coull C.I.: ', round(L, 3), '< p <', round(U, 3)


To complete the code, we need to show the plot:

# show the plot

I’m not certain the narrow margin of opportunity detected by this analysis is sufficient for the development of a trading strategy. More investigation is needed.

Posted in econometrics, statistics | Tagged , , , , , , , , , , | Leave a comment

Kaplan-Meier estimator in Python

The following Python class computes and draws Kaplan-Meier product limit estimators for given data. An example of how to use the class follows the code.


# load useful libraries
import matplotlib.pyplot as plt

# class for building Kaplan-Meier product limit estimator
class KM(object):

    # constructor
    def __init__(self, measured_values, censored_or_not):
        self.measured_values = measured_values;
        self.censored_or_not = censored_or_not

        # log measured values where event occurred
        self.event_occurred_times = {}
        self.censored_occurred_times = {}
        for i, mv in enumerate(self.measured_values):
            if self.censored_or_not[i] == 1:
                if not self.event_occurred_times.has_key(mv):
                    self.event_occurred_times[mv] = 0
                self.event_occurred_times[mv] += 1
                if not self.censored_occurred_times.has_key(mv):
                    self.censored_occurred_times[mv] = 0
                self.censored_occurred_times[mv] += 1

        # construct list of j values
        j_list = [0]
        j_list.append(max(max(self.event_occurred_times.keys()), max(self.censored_occurred_times.keys())) + 1)
        self.j_list = j_list

        # count censored values in interval [j, j+1), index by j
        self.number_of_units_censored_in_interval_j_to_j_plus_one = {}
        for i in range(0, len(j_list)-1):
            j = j_list[i]
            j_plus_one = j_list[i+1]
            m_count = 0
            for m in self.censored_occurred_times.keys():
                if j <= m and m < j_plus_one:
                    m_count = m_count + self.censored_occurred_times[m]
            self.number_of_units_censored_in_interval_j_to_j_plus_one[j] = m_count

        # calculate number of units at risk just prior to time t_j
        self.number_of_units_at_risk_just_prior_to_time_j = {}
        for i in range(0, len(j_list)-1):
            j = j_list[i]
            n_count = 0.
            for k in range(i, len(j_list)-1):
                jk = j_list[k]
                if self.event_occurred_times.has_key(jk):
                    n_count = n_count + self.event_occurred_times[jk]
                if self.number_of_units_censored_in_interval_j_to_j_plus_one.has_key(jk):
                    n_count = n_count + self.number_of_units_censored_in_interval_j_to_j_plus_one[jk]
            self.number_of_units_at_risk_just_prior_to_time_j[j] = n_count

        # add time zero count to self.event_occurred_times
        self.event_occurred_times[0] = 0

        # build the estimator for each time j
        self.S = {}
        for i in range(0, len(j_list)-1):
            j = j_list[i]
            prod = 1.
            for k in range(0, i+1):
                jk = j_list[k]
                prod = prod * ((self.number_of_units_at_risk_just_prior_to_time_j[jk] - self.event_occurred_times[jk]) / self.number_of_units_at_risk_just_prior_to_time_j[jk])
            self.S[j] = prod

    # display the estimator in the console
    def display(self):

        print '\ttime\tn.risk\tn.event\tsurvival'
        for i in range(1, len(self.j_list) - 1):
            j = self.j_list[i]
            print '\t' + str(j) + '\t' + str(int(self.number_of_units_at_risk_just_prior_to_time_j[j])) + '\t' + str(self.event_occurred_times[j]) + '\t' + str(round(self.S[j], 3))

    # plot
    def plot(self, color, xlabel, title):
        j_list_sans_end = self.j_list[0:-1]

        # plot the curve
        for i in range(0, len(j_list_sans_end) - 1):
            j = j_list_sans_end[i]
            j_plus_one = j_list_sans_end[i+1]
            plt.plot([j, j_plus_one], [self.S[j], self.S[j]], color=color)
            plt.plot([j_plus_one, j_plus_one], [self.S[j], self.S[j_plus_one]], color=color)

        # set the axis limits
        plt.xlim([0, self.j_list[-1]])
        plt.ylim([-0.05, 1.05])

        # add the censored cases within the curve
        for i in sorted(self.censored_occurred_times.keys()):
            last_S = 1.
            for j in sorted(self.S.keys()):
                if j >= i:
                    plt.scatter(i, last_S, color=color)
                last_S = self.S[j]

        # add the censored cases beyond the curve
        for i in sorted(self.censored_occurred_times.keys()):
            max_S_key = max(self.S.keys())
            if i > max_S_key:
                plt.scatter(i, self.S[max_S_key], color="blue")

        # show the plot

Usage Example

import matplotlib.pyplot as plt
times = [3, 4, 4, 5, 7, 6, 8, 8, 12, 14, 14, 19, 20, 20]
events = [1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0]  # 0 for censored cases, 1 for event occurred
my_km = KM(times, events)
my_km.plot("blue", "Time", "Kaplan-Meier Product Limit Estimator Example")



Posted in data science, engineering, science, statistics | Tagged , , , , | Leave a comment

setting up a Nexus OSS server on an Amazon EC2 instance

It took me a while to figure out how to set up a Nexus OSS server on an Amazon EC2 instance, so I’m writing down the instructions here in case they prove useful to anyone in the future.

In this case we are also changing the default Nexus OSS server port of 8081 to port 80 due to a firewall issue.

Start up an Amazon Linux EC2 instance with incoming port 22 set to “My IP” and incoming port 80 set to “Anywhere”. For the sake of these instructions, we assume the IP address of the new EC2 instance is xxx.xxx.xxx.xxx.

Log into the server using a terminal:

ssh -i your_key.pem ec2-user@xxx.xxx.xxx.xxx

Execute the following commands:

sudo yum update
sudo yum install emacs
wget http://download.sonatype.com/nexus/oss/nexus-latest-bundle.tar.gz
sudo cp nexus-latest-bundle.tar.gz /usr/local
cd /usr/local
sudo tar -xvzf nexus-latest-bundle.tar.gz
sudo rm nexus-latest-bundle.tar.gz 
sudo ln -s nexus-2.9.2-01 nexus
sudo chown -R ec2-user:ec2-user nexus
sudo chown -R ec2-user:ec2-user nexus-2.9.2-01
sudo chown -R ec2-user:ec2-user sonatype-work

Edit the Nexus server executable, changing the value of “RUN_AS_USER” to “root”:

emacs nexus/bin/nexus

Set a root password, which we’ll need later:

sudo su - root

Enter the “nexus” directory and edit the conf/nexus.properties file to set the “application-port” to “80”.

cd nexus
emacs conf/nexus.properties

Start the server. Enter the root password when prompted:

./bin/nexus start

Wait about a minute or two for the server to fire up, and then point a web browser to http://xxx.xxx.xxx.xxx/nexus to use the server.


Posted in engineering | Tagged , , , , , , , | Leave a comment

data natives

We hear a lot of marketing yammer about “digital natives”, that is, folks fluent in social media and in particular marketing using social media. Writers who use this term often juxtapose such digital natives against “analog natives”, i.e., individuals who matured or were educated before online social media became such a significant part of our lives. These writers often imply that such analog natives are unable to understand the world today. This is of course ridiculous; anyone with an open mind and tenacity can develop marketing skill with social media.

I offer a different grouping of individuals that might be considered an analog (pun intended) of digital nativity: “data natives”. By this I mean individuals fluent in using heterogeneous numerical and textual data sources, along with mathematical techniques, to reach conclusions about many facets of their lives and work.

Consider the following example: When I was shopping for a used RV trailer recently, I needed to filter out trailers from the pool of candidates that were too heavy for my truck to pull. However, used RV listings only specified length of the RV trailers, rather than the weight. Therefore I used regression analysis to predict weight from length based on about twenty known length-weight data points. This simple example illustrates a data native’s approach to solving problems.

In my usage of the phrase, “data native” is meant to be a more comprehensive designation than “data scientist”, though certainly there is crossover between the two. In using the word “native” I’m implying an intimate comfort with data and data-driven decision making, like immersion in and skill with data flows like one’s mother tongue. Data science is a job, data native is a way of being.

For now, data natives are as rare as data scientists. But the new world of Big Data is producing both at a rapid clip. This will likely enrich the world.

Posted in big data, data science, marketing, statistics | Tagged , , , , | Leave a comment

GNU Octave: a free, open source MATLAB-like language for numerical computing

I tend to use Python with the Numpy, SciPy, and Matplotlib stack whenever I have to do scientific computing. For statistical computing I use R whenever this Python stack does not provide the necessary features. However, I want to draw readers’ attention to another tool for free, open-source numerical computing: GNU Octave (hereafter called “Octave”), which is an interactive language closely related to MATLAB. In fact, most MATLAB code can be run in Octave with no modification; for MATLAB code that does need a change to run in Octave, the necessary modification is usually slight. MATLAB is common in engineering and scientific environments, but it costs money. For budget-constrained startups or non-profits Octave is a viable alternative. This post introduces Octave and shows its operation in both a statistical and control system design role.

Octave is covered by the GNU General Public License, which enables programmers to modify it as needed, provided the modifications are made publicly available. This license also makes Octave effectively free in the financial sense. The program runs in both Linux and Windows, although the Windows setup is more complicated since you have to use Cygwin, MinGW, or Visual Studio.

Octave has a full range of plotting abilities that MATLAB users will be familiar with. The program automatically uses a third-party program, Gnuplot, to create the plots. For example, a 3D plot of a geodesic sphere rendered by Octave is shown below [1]:


As another example, a Nyquist plot of a dynamic system is:


Like MATLAB, Octave has many toolkits, which are published as part of Octave-Forge [2], a central repository for Octave packages. Examples of such toolkits are packages for control systems analysis/design and fuzzy logic.

Under the hood, Octave uses the BLAS and LAPACK libraries to facilitate linear algebra computations, in much the same way R and Python’s NumPy package do.

Recent versions of Octave come with a GUI. Using the GUI to show the iconic “sombrero” plot with code given by [3]:


Statistical Calculations with Octave

Since I’m a statistician in training, I thought I’d demonstrate a few of Octave’s basic statistical tools. First, a box plot of samples from two normal distributions:

a = normrnd(20, 3, 100, 1);
b = normrnd(22, 3, 100, 1);
boxplot([a, b]);
title("Comparison of Samples From Two Normal Distributions")


Conducting a t-test on the two sets of samples from normal distributions:


Conducting linear regression on simulated data:

x = 0:0.01:10;
noise = normrnd(0, 2, 1, length(x));
y = 2*x + noise + 5;
hold on;
scatter(x, y);
F = polyfit(x, y, 1);
a = F(1);
b = F(2);
y_pred = a*x + b;
plot(x, y_pred, 'r');
title("Scatter Plot and Regression Line");
hold off;



Control Systems Calculations with Octave

I also have experience designing control systems, which Octave is well suited for. Here is an example of a PID controller’s step response implemented in Octave’s control system toolkit:

pkg load control
Kp = 4
Kd = 1
Ki = 1
P=tf([1], [1 Kp/Kd Ki/Kd])
step(P, 20)


Creating the Nyquist plot shown above for this system:




Octave is a viable, free alternative to MATLAB for many scientific computing applications.


1. http://octave-dome.sourceforge.net/ (I have stopped working on this. Use pyDome instead).
2. http://octave.sourceforge.net/
3. http://en.wikipedia.org/wiki/GNU_Octave

Posted in engineering, science, statistics | Tagged , , , , , , , , , , , | Leave a comment

synthetic biology: an emerging engineering discipline

In the last decade a new engineering disciple called “synthetic biology” has emerged. It differs from the science of biology in that it applies engineering strategies to the creation of cells that perform a desired task, such as the production of drugs or biofuels. It also differs from previous genetic engineering approaches by stressing the assembly of systems composed of modular, repeatable genetic components selected from a pool of well described candidate components. This post introduces the subject from a high level.

Modularity in Engineering Design

All mature branches of engineering stress modularity in design of systems and products, such that the designed systems are composed of simpler systems having known input and output behaviors. Examples of this design ethic are the electrical circuits (Butterworth filters from the LTspice example circuits) shown below:


These circuits are made of simpler components: resistors, capacitors, and inductors, each with known physical properties. Knowing the physical properties of each of these parts enables simulation of the whole combined circuits, allowing prediction of circuit outcome. For example, the LTspice predicted voltage responses of the top two of these filters are:


Computer-Aided Design (CAD)

In the discussion above the example was expressed in a CAD program called LTspice, which facilitates the specification, communication, and simulation of electrical circuits. Other branches of engineering use CAD for these purposes as well, for example Pro/ENGINEER by mechanical engineers and AutoCAD by civil engineers. These CAD packages also encourage modularity, as demonstrated by the multi-component system shown in Pro/ENGINEER below [1]:


Synthetic Biology

Synthetic biology is an approach to genetic engineering that draws from traditional engineering’s use of modular, well-described parts. DNA components of genes such as ribosome binding sites, protein coding regions, promoters, etc. are abstracted into “parts” that can be assembled with other parts—not necessarily from the same gene—into “devices”. These devices can in turn be combined into “systems” that result in a desired cellular behavior once DNA encoding the designed system is inserted into a cell. Key to this design strategy is that the engineer has a suite of genetic parts to choose from when designing the genes that they combine into larger systems. These larger systems are often said to be made of genetic “circuits”, since the designed operations can resemble switching and logic gates. We will explore sources of genetic parts shortly, but first consider CAD.

It is logical that synthetic biologists would seek CAD programs to help facilitate this design process, and such tools are beginning to emerge out of academic labs and commercial institutions. Most of these tools cover a single task in the design process, and therefore must be chained together if the designer is to go from part selection to simulation to DNA specification of the final design. This has driven the creation of markup languages to describe designs (CellML [2] and SBML [3]) so that multiple tools can work with the same design.

An example of the specification of protein production and interaction by a genetic circuit is provided by the iBioSim CAD package’s [4] tutorial:


Here proteins are shown in the blue boxes, and a promoter is shown in the diagonal box. The promoter is repressed by protein Cl2 and activates transcription leading to the production of protein CII. An event (green box) specifies that cell division is to occur at a predefined point during the simulation. In iBioSim, all parameters of the chemical reaction dynamics must be specified prior to simulation, which is challenging because often these parameters are unknown and have to be estimated. iBioSim then enables simulation of the genetic circuit. Below we can see that the proteins created by the cell are expected to reach steady state:


Another CAD package in development for synthetic biology is Cello, which stands for “Cell Logic” [5]. In Cello the user specifies their desired logic in a truth table, where intracellular chemicals and signals make up the inputs, and the program selects genetic parts necessary to implement the logic. Cello then specifies the DNA necessary to implement the logic [5]. In the NOR gate example shown below, one or both of two promoters activate production of a protein that represses another promoter that activates the output protein [5].



Genetic Part Registries

Several repositories have emerged to store descriptions of genetic parts and devices [6, 7], with the goal of mimicking the specification sheets associated with semiconductor parts today. As semiconductor specification sheets encourage repeatable, modular design of electrical circuits, the plan for genetic part specifications is to encourage repeatable, modular design of genetic circuits. An example of such a repository is the Registry of Standard Biological Parts [6], which provides specification sheets for promoters, ribosome binding sites, coding regions, terminators, etc., e.g., [8]:



1. http://www.aras.com/integrations/MCAD/creo-parametric-connector.aspx
2. http://www.cellml.org/
3. http://sbml.org/Main_Page
4. http://www.async.ece.utah.edu/iBioSim/
5. http://cidar.bu.edu/cello/server/html/login.html
6. http://parts.igem.org/Catalog?title=Catalog
7. http://www.ncbi.nlm.nih.gov/pubmed/20160009
8. http://parts.igem.org/Part:BBa_K1216007

Posted in bioinformatics, engineering, science | Tagged , , , , , , , , , , , | Leave a comment

building a web-enabled temperature logger

Not wanting to miss out on the “Internet of Things”, I decided to learn some of its foundational technology, namely microprocessor programming. Actually, I used a Raspberry Pi in this project instead of a classic microprocessor, but the idea is the same. Here I describe building a web-enabled temperature logger, complete with a web application to display its results.

The Challenge

I live in an RV with my cat. When I go to work I have to decide whether to leave the windows open or turn on the air conditioner to keep my cat cool, as there is no thermostat on my air conditioner. I usually decide based on the weather forecast, but really don’t know how hot it gets in the RV during the peak temperature of the day. I needed a data logger that reads the temperature regularly and stores it. Ideally such a data logger would report to a web application, so I can monitor the temperature from work.

The Solution

Raspberry Pi

I first bought a Raspberry Pi B+ computer and configured it to run Raspbian Linux. Then I added a USB WiFi dongle so the device can communicate with the Internet.

DS18B20 Digital Temperature Sensor

Next, I bought a DS18B20 digital temperature probe, and connected it to the Raspberry Pi according to the following schematic, which is slightly modified from that specified by [1]:


The 4.7 kOhm resistor came with the temperature sensor.

The resulting hardware looks like:


Running the Data Logger and Connecting to the Web

On the Raspberry Pi I run the following Python code, which is slightly modified from that shown in [1]. My modification simply calls a URL containing the temperature reading that is processed by the web application described below. This code sends a reading to the URL every two minutes.

import os
import glob
import time
import urllib2

os.system('modprobe w1-gpio')
os.system('modprobe w1-therm')

base_dir = '/sys/bus/w1/devices/'
device_folder = glob.glob(base_dir + '28*')[0]
device_file = device_folder + '/w1_slave'

def read_temp_raw():
    f = open(device_file, 'r')
    lines = f.readlines()
    return lines

def read_temp():
    lines = read_temp_raw()
    while lines[0].strip()[-3:] != 'YES':
        lines = read_temp_raw()
    equals_pos = lines[1].find('t=')
    if equals_pos != -1:
        temp_string = lines[1][equals_pos + 2:]
        temp_c = float(temp_string) / 1000.
        return temp_c

while True:
        temp_c = read_temp()
        req = urllib2.urlopen('http://my.url.com/logtemp.php?temp=' + str(temp_c))
        time.sleep(2. * 60.)

Web Application for Storing Temperature Readings

A PHP program receives the temperature reading sent by the Python script as a GET argument. It then places the temperature value with a time stamp into a MySQL database. I chose PHP for this task because my web hosting company makes PHP deployment much easier than Django or JSP deployment:

  <title>Log Temperature</title>

    $temp = $_GET['temp'];

    $con = mysqli_connect("host", "user", "password", "database");

    // Check connection
    if (mysqli_connect_errno()) {
    echo "Failed to connect to MySQL: " . mysqli_connect_error();

    $sql = "INSERT INTO temperature_log VALUES (now(), $temp)";

    if (!mysqli_query($con,$sql)) {
    die('Error: ' . mysqli_error($con));
    echo "1 record added";


Web Application for Displaying Temperature Readings

The web application for viewing the temperature readings displays a run chart and a log. The run chart is implemented in JavaScript with the jqPlot library. The application queries the MySQL database for the last 24 hours’ readings. Again, I used PHP just because it is easy to deploy on my web hosting platform.


The code for this application is:

<link rel="stylesheet" type="text/css" href="viewtemp.css">

<script language="javascript" type="text/javascript" src="jqplot/jquery.min.js"></script>
<script language="javascript" type="text/javascript" src="jqplot/jquery.jqplot.min.js"></script>
<script language="javascript" type="text/javascript" src="jqplot/plugins/jqplot.canvasTextRenderer.min.js"></script>
<script language="javascript"type="text/javascript" src="jqplot/plugins/jqplot.canvasAxisTickRenderer.min.js"></script>
<link rel="stylesheet" type="text/css" href="jqplot/jquery.jqplot.css" />

  <title>View Temperature Log</title>

<center><h1>Trailer Temperature</h1></center>

<div id="chart"></div>


    $con = mysqli_connect("host", "user", "password", "database");

    // Check connection
    if (mysqli_connect_errno()) {
    echo "Failed to connect to MySQL: " . mysqli_connect_error();

    $sql = "select * from temperature_log tl where tl.time >= DATE_SUB(NOW(), INTERVAL 1 DAY) order by tl.time desc";

    $result = mysqli_query($con, $sql);

<table id='time_temp_table'><thead><tr><th>Time</th><th>Temperature (C)</th><th>Temperature (F)</th></tr></thead><tbody>

    $temp_array = array();
    $time_array = array();
    while($row = mysqli_fetch_array($result)) {
    $temp_f = round($row['temperature'] * 1.8 + 32.0, 2);
    echo "<tr><td id='time_entry'>" . $row['time'] . "</td><td id='temp_entry'>" . $row['temperature'] . "</td><td>" . $temp_f . "</td></tr>";	
    array_push($temp_array, $temp_f);
    array_push($time_array, $row['time']);

    $temp_array_reverse = array_reverse($temp_array);
    $time_array_reverse = array_reverse($time_array);

<script type="text/javascript">
$js_array = json_encode($temp_array_reverse);
echo "var tempArrayAsString = " . $js_array . ";\n";
$js_array = json_encode($time_array_reverse);
echo "var timeArrayAsString = " . $js_array . ";\n";

$(document).ready(function() {

	tempArray = [];
	$.each(tempArrayAsString, function(index, value) {

	data = [];
	data.push([0, tempArray[0]]);
	timeDiffList = [0];
	for (var i=1; i<timeArrayAsString.length; i++) {
		d = timeArrayAsString[0].split(' ')[0];
		year = parseInt(d.split('-')[0]);
		month = parseInt(d.split('-')[1]) - 1;
		day = parseInt(d.split('-')[2]);
		t = timeArrayAsString[0].split(' ')[1];
		hour = parseInt(t.split(':')[0]);
		minute = parseInt(t.split(':')[1]);
		second = parseInt(t.split(':')[2]);
		dt0 = new Date(year, month, day, hour, minute, second);

		d = timeArrayAsString[i].split(' ')[0];
		year = parseInt(d.split('-')[0]);
		month = parseInt(d.split('-')[1]) - 1;
		day = parseInt(d.split('-')[2]);
		t = timeArrayAsString[i].split(' ')[1];
		hour = parseInt(t.split(':')[0]);
		minute = parseInt(t.split(':')[1]);
		second = parseInt(t.split(':')[2]);
		dti = new Date(year, month, day, hour, minute, second);

		var timeDiff = (dti - dt0) / (1000. * 60.);

		data.push([timeDiff, tempArray[i]]);

	var ticksToUse = [];
	var position_dict = {};
	for (i=0; i<timeDiffList.length; i++) {
		var a = Math.round(timeDiffList[i] / 5) * 5;
		if (a % 120 == 0) {
	 		var label = timeArrayAsString[i];
			if (!position_dict.hasOwnProperty(a)) {
			    ticksToUse.push([timeDiffList[i], label]);
			    position_dict[a] = true;
	label = timeArrayAsString[i-1];
	ticksToUse.push([timeDiffList[i-1], label]);

	$.jqplot('chart',  [data], 
		series: [{showMarker: false, lineWidth: 2}],
		axesDefaults : {
			tickRenderer: $.jqplot.CanvasAxisTickRenderer ,
			tickOptions: {
				angle: -80
		axes: {
			xaxis: {
			label: 'Time (Minutes)',
			ticks: ticksToUse,
			yaxis: {
			label: 'Temperature (Fahrenheit)',
			tickOptions: { angle: 0 }


The CSS for the application is:

#time_temp_table {
    border-collapse: collapse;
    background-color: lightblue;

#time_temp_table th {
    border: 1px solid black;
    text-align: center;
    padding: 2px 15px 2px 15px

#time_temp_table td {
    border: 1px solid black;
    text-align: center;
    padding: 2px 15px 2px 15px

Future Plans

For the web application that displays the recorded temperatures, it would be nice to add a box plot to summarize the results.

Ultimately, I’d like to connect this hardware to my air conditioner so that it will automatically turn on when a set temperature point is reached. I’ll need some high-amp relays for this.


1. https://learn.adafruit.com/adafruits-raspberry-pi-lesson-11-ds18b20-temperature-sensing

Related Post

engineer moves into an RV

Posted in engineering, science | Tagged , , , , , , , , , , , , , | Leave a comment