clustering stocks by price correlation (part 2)

In my last post, “clustering stocks by price correlation (part 1)“, I performed hierarchical clustering of NYSE stocks by correlation in weekly closing price. I expected the stocks to cluster by industry, and found that they did not. I proposed several explanations for this observation, including that perhaps I chose a poor distance metric for the clustering procedure, that perhaps daily rather than weekly sampling would have generated better results, or that stocks simply don’t cluster by industry very much. I also proposed generating k-means clusters to see if that revealed different results. This post details that k-means clustering effort.

K-Means Clustering

K-means clustering starts with a matrix of observation and feature values, resembling the data structure one might use in a regression [1]. This differs from hierarchical clustering where the data matrix is an N-by-N pairwise comparison of each of the observations by a similarity (or distance) metric. An example of an N-by-3 matrix for k-means clustering is:


This matrix is then normalized (or “whitened” in the language of k-means clustering).

Each observation in the “whitened” matrix is then randomly assigned to one of k clusters [2], and the following steps are repeated until a stop condition (usually minimal change in cluster assignment) is reached [2]:

  1. The algorithm computes the average value of each current cluster.
  2. The algorithm then assigns each observation to the nearest cluster (by the calculated average value).
  3. Finally, the algorithm tests for the stop condition and repeats if it is not reached.

The result is a set of k clusters grouped by similarity of feature values.

We will implement our calculation using Python’s scipy.cluster.vq [3] module.


I started with a list of daily closing stock prices for all the stocks listed in the New York Stock Exchange (NYSE), e.g.,


I then extracted the closing price for every stock for every Tuesday between October 22, 2013 and October 28, 2014, so that we are sampling weekly and avoiding holidays. Stocks that were removed from the NYSE or that entered the NYSE during this period were discarded, thereby ensuring that each stock in the resulting data set had the same number of time points. [4]

For every stock in the dataset, I next calculated the Pearson correlation coefficient between the stock’s price time series and each of the stocks in the DJIA (NYSE only) price time series. This treats each correlation with a DJIA stock as a feature as required by the data matrix used by k-means clustering. The resulting feature matrix had dimension of 2601-by-27, or 2601 stocks compared to each of the 27 DJIA stocks listed on the NYSE, e.g,:


Finally, I “whitened” the feature matrix and ran the Python k-means module on the whitened feature matrix.

Python code implementing these calculations is appended below.


Like the hierarchical clustering results shown in [4], the groups do not seem to be organized by industrial sector:


This suggests a couple of possibilities: First, correlation in stock prices may simply not cluster by industry very much. Second, my use of k=50 (50 groups) may have been inappropriate, perhaps a better division number exists. Third, maybe sampling prices daily rather than weekly would produce better results. Finally, perhaps correlation is not the best feature to use.

I also tried using the weekly change in prices rather than the actual prices, but that attempt did not show any better grouping by industry.


# import useful libraries
import scipy.cluster.vq
import scipy.stats
import numpy as np
import datetime

# user settings
number_of_groups_to_use = 50

# set seed for repeatibility

# load the Dow Jones Industrial Average symbols
djia = []
f = open('DJIA.tsv')
for line in f:
    line = line.strip()
    company = line.split('\t')[0]
    exchange = line.split('\t')[1]
    symbol = line.split('\t')[2]
    if exchange == 'NYSE':
djia = sorted(djia)

# specify the time points we want to sample and sort them
end_date = datetime.datetime(2014, 10, 28, 0, 0)
dates_to_use = [end_date]
dt = datetime.timedelta(days=-7)
for i in range(0, 53):
    dates_to_use.append(dates_to_use[-1] + dt)
dates_to_use = sorted(dates_to_use)
new_dates = []
for x in dates_to_use:
    year = str(x.year)
    month = str(x.month)
    day = str(
    if len(month) == 1:  month = '0' + month
    if len(day) == 1:  day = '0' + day
    new_dates.append(year + '-' + month + '-' + day)
dates_to_use = new_dates

# load the closing prices for the given dates into memory
prices = {}
f = open('../../combined_data.csv')
for line in f:
    line = line.strip()
    date_string = line.split(',')[1]
    if not date_string in dates_to_use:  continue
    symbol = line.split(',')[0]
    close = float(line.split(',')[5])
    if not prices.has_key(symbol):
        prices[symbol] = {}
    prices[symbol][date_string] = close

# delete cases without all of the dates in dates_to_use
symbol_list = prices.keys()
for symbol in symbol_list:
    date_list = prices[symbol].keys()
    if len(date_list) != len(dates_to_use):

# generate price time series
price_time_series = {}
for symbol in prices.keys():

    # actual prices
    price_time_series[symbol] = [prices[symbol][d] for d in dates_to_use]

    # change in prices
    #price_time_series[symbol] = [(prices[symbol][dates_to_use[n]] - prices[symbol][dates_to_use[n-1]]) / prices[symbol][dates_to_use[n-1]] for n in range(1, len(dates_to_use))]

# calculate R
new_symbol_list = sorted(price_time_series.keys())
R_dict = {}
for i in range(0, len(new_symbol_list)):
    if not R_dict.has_key(i):
        R_dict[i] = {}
    for j in range(0, len(djia)):
        symbol_i = new_symbol_list[i]
        symbol_j = djia[j]

        # Pearson's R
        R = 1.0
        if symbol_i != symbol_j:
            R = scipy.stats.pearsonr(price_time_series[symbol_i], price_time_series[symbol_j])[0]

        # record R
        R_dict[i][j] = R

# create the feature matrix
feature_matrix = np.zeros([len(new_symbol_list), len(djia)])
for i in R_dict.keys():
    for j in R_dict[i].keys():
        feature = R_dict[i][j] + 1.   # range 0 to 2
        feature_matrix[i][j] = feature

# cluster using k-means
W = scipy.cluster.vq.whiten(feature_matrix)
K = scipy.cluster.vq.kmeans(W, number_of_groups_to_use)
VQ = scipy.cluster.vq.vq(W, K[0])
groups = VQ[0]

# report results
f = open('output/groups.csv', 'w')
for g, s in zip(groups, new_symbol_list):
    f.write(s + ',' + str(g) + '\n')

# write the feature matrix to disk
f = open('output/features.csv', 'w')
f.write('symbol,' + ','.join(djia) + '\n')
for i in R_dict.keys():
    symbol_i = new_symbol_list[i]
    f.write(symbol_i + ',')
    values = []
    for j in R_dict[i].keys():
    f.write(','.join(values) + '\n')



Post Author: badassdatascience

Leave a Reply

Your email address will not be published.