In my last post, “clustering stocks by price correlation (part 1)“, I performed hierarchical clustering of NYSE stocks by correlation in weekly closing price. I expected the stocks to cluster by industry, and found that they did not. I proposed several explanations for this observation, including that perhaps I chose a poor distance metric for the clustering procedure, that perhaps daily rather than weekly sampling would have generated better results, or that stocks simply don’t cluster by industry very much. I also proposed generating k-means clusters to see if that revealed different results. This post details that k-means clustering effort.

# K-Means Clustering

K-means clustering starts with a matrix of observation and feature values, resembling the data structure one might use in a regression [1]. This differs from hierarchical clustering where the data matrix is an N-by-N pairwise comparison of each of the observations by a similarity (or distance) metric. An example of an N-by-3 matrix for k-means clustering is:

This matrix is then normalized (or “whitened” in the language of k-means clustering).

Each observation in the “whitened” matrix is then randomly assigned to one of k clusters [2], and the following steps are repeated until a stop condition (usually minimal change in cluster assignment) is reached [2]:

- The algorithm computes the average value of each current cluster.
- The algorithm then assigns each observation to the nearest cluster (by the calculated average value).
- Finally, the algorithm tests for the stop condition and repeats if it is not reached.

The result is a set of k clusters grouped by similarity of feature values.

We will implement our calculation using Python’s scipy.cluster.vq [3] module.

# Method

I started with a list of daily closing stock prices for all the stocks listed in the New York Stock Exchange (NYSE), e.g.,

I then extracted the closing price for every stock for every Tuesday between October 22, 2013 and October 28, 2014, so that we are sampling weekly and avoiding holidays. Stocks that were removed from the NYSE or that entered the NYSE during this period were discarded, thereby ensuring that each stock in the resulting data set had the same number of time points. [4]

For every stock in the dataset, I next calculated the Pearson correlation coefficient between the stock’s price time series and each of the stocks in the DJIA (NYSE only) price time series. This treats each correlation with a DJIA stock as a feature as required by the data matrix used by k-means clustering. The resulting feature matrix had dimension of 2601-by-27, or 2601 stocks compared to each of the 27 DJIA stocks listed on the NYSE, e.g,:

Finally, I “whitened” the feature matrix and ran the Python k-means module on the whitened feature matrix.

Python code implementing these calculations is appended below.

# Results

Like the hierarchical clustering results shown in [4], the groups do not seem to be organized by industrial sector:

This suggests a couple of possibilities: First, correlation in stock prices may simply not cluster by industry very much. Second, my use of k=50 (50 groups) may have been inappropriate, perhaps a better division number exists. Third, maybe sampling prices daily rather than weekly would produce better results. Finally, perhaps correlation is not the best feature to use.

I also tried using the weekly change in prices rather than the actual prices, but that attempt did not show any better grouping by industry.

# Code

# import useful libraries import scipy.cluster.vq import scipy.stats import numpy as np import datetime # user settings number_of_groups_to_use = 50 # set seed for repeatibility np.random.seed(1000) # load the Dow Jones Industrial Average symbols djia = [] f = open('DJIA.tsv') for line in f: line = line.strip() company = line.split('\t')[0] exchange = line.split('\t')[1] symbol = line.split('\t')[2] if exchange == 'NYSE': djia.append(symbol) f.close() djia = sorted(djia) # specify the time points we want to sample and sort them end_date = datetime.datetime(2014, 10, 28, 0, 0) dates_to_use = [end_date] dt = datetime.timedelta(days=-7) for i in range(0, 53): dates_to_use.append(dates_to_use[-1] + dt) dates_to_use = sorted(dates_to_use) new_dates = [] for x in dates_to_use: year = str(x.year) month = str(x.month) day = str(x.day) if len(month) == 1: month = '0' + month if len(day) == 1: day = '0' + day new_dates.append(year + '-' + month + '-' + day) dates_to_use = new_dates # load the closing prices for the given dates into memory prices = {} f = open('../../combined_data.csv') for line in f: line = line.strip() date_string = line.split(',')[1] if not date_string in dates_to_use: continue symbol = line.split(',')[0] close = float(line.split(',')[5]) if not prices.has_key(symbol): prices[symbol] = {} prices[symbol][date_string] = close f.close() # delete cases without all of the dates in dates_to_use symbol_list = prices.keys() for symbol in symbol_list: date_list = prices[symbol].keys() if len(date_list) != len(dates_to_use): del(prices[symbol]) # generate price time series price_time_series = {} for symbol in prices.keys(): # actual prices price_time_series[symbol] = [prices[symbol][d] for d in dates_to_use] # change in prices #price_time_series[symbol] = [(prices[symbol][dates_to_use[n]] - prices[symbol][dates_to_use[n-1]]) / prices[symbol][dates_to_use[n-1]] for n in range(1, len(dates_to_use))] # calculate R new_symbol_list = sorted(price_time_series.keys()) R_dict = {} for i in range(0, len(new_symbol_list)): if not R_dict.has_key(i): R_dict[i] = {} for j in range(0, len(djia)): symbol_i = new_symbol_list[i] symbol_j = djia[j] # Pearson's R R = 1.0 if symbol_i != symbol_j: R = scipy.stats.pearsonr(price_time_series[symbol_i], price_time_series[symbol_j])[0] # record R R_dict[i][j] = R # create the feature matrix feature_matrix = np.zeros([len(new_symbol_list), len(djia)]) for i in R_dict.keys(): for j in R_dict[i].keys(): feature = R_dict[i][j] + 1. # range 0 to 2 feature_matrix[i][j] = feature # cluster using k-means W = scipy.cluster.vq.whiten(feature_matrix) K = scipy.cluster.vq.kmeans(W, number_of_groups_to_use) VQ = scipy.cluster.vq.vq(W, K[0]) groups = VQ[0] # report results f = open('output/groups.csv', 'w') f.write('symbol,group\n') for g, s in zip(groups, new_symbol_list): f.write(s + ',' + str(g) + '\n') f.close() # write the feature matrix to disk f = open('output/features.csv', 'w') f.write('symbol,' + ','.join(djia) + '\n') for i in R_dict.keys(): symbol_i = new_symbol_list[i] f.write(symbol_i + ',') values = [] for j in R_dict[i].keys(): values.append(str(R_dict[i][j])) f.write(','.join(values) + '\n') f.close()

# References

- http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.vq.whiten.html#scipy.cluster.vq.whiten
- http://homes.di.unimi.it/~valenti/SlideCorsi/MB0910/IntroClustering.pdf
- http://docs.scipy.org/doc/scipy-0.14.0/reference/cluster.vq.html
- http://badassdatascience.com/2014/12/31/clustering-stocks-1/