clustering stocks by price correlation (part 1)

I’ve been building my knowledge of clustering techniques to apply to genetic circuit engineering, and decided to try the same tools for stock price analysis. In this post I describe building a hierarchical cluster of stocks by pairwise correlation in weekly price, to see how well the stocks cluster by industry, and compare the derived groupings to an outside categorization by industry. In a later post I will try k-means clustering of the same source data.

Hierarchical Clustering

Agglomerative hierarchical clustering begins with a pool of clusters (single nodes) that have not been added to any hierarchy. These clusters are then iteratively grouped by similarity into new clusters, while the relationship between the single nodes established by each iteration is maintained. When only one cluster remains, the procedure stops [1].

Reference [2] shows a useful image outlining when data may benefit from hierarchical clustering:

structure

Here we see that cases where a hierarchy is present in the data or where the data has largely differing densities are candidates for the use of hierarchical clustering.

Method

I started with a list of daily closing stock prices for all the stocks listed in the New York Stock Exchange (NYSE), e.g.,

source_data_POST_CROP

I then extracted the closing price for every stock for every Tuesday between October 22, 2013 and October 28, 2014, so that we are sampling weekly and avoiding holidays. Stocks that were removed from the NYSE or that entered the NYSE during this period were discarded, thereby ensuring that each stock in the resulting data set had the same number of time points.

For every pair of stocks in the dataset, I then computed the Pearson correlation coefficient between the two weekly closing price time series, compiling these values into a similarity matrix describing the similarity between each pair of stocks. I then converted the similarity matrix into a distance matrix as required by my clustering algorithm. (I’m not sure how valid using the correlation coefficient is for autocorrelated time-series, need to check this later).

Finally, I ran a hierarchical clustering algorithm on the distance matrix, and then flattened the results into 50 groups, producing the outcome discussed below.

Python code implementing these calculations follows this text.

Results

The resulting dendrogram looks like:

dendrogram_actual_pearson_FINAL

I flattened the results into 50 groups, some containing many stocks and some containing only two. A histogram of the number of stocks per group is shown below. The median group membership was 21 stocks with a maximum of 779. This makes me suspect (though I’m not certain yet) that the data has widely varying densities and sizes as per the density image shown above under the “Hierarchical Clustering” heading.

distribution_50_actual_pearson

A mapping of the stocks by industrial sector [3] showed that my groupings did not cluster well by sector, as I expected them to. An example for the 20 stocks placed into group six by the algorithm demonstrates this result:

sectors_assigned_to_groups

Possible Improvements

The lack of successful clustering by industrial sector suggests a few possibilities: First, use of the Pearson correlation coefficient as a distance metric might not be an appropriate measure of dissimilarity between stocks. Second, perhaps daily sampling instead of weekly sampling would have produced better results. Third, hierarchical clustering might not be the best means of clustering for this data; I’m going to try k-means clustering next. Finally, it is possible that stocks simply don’t naturally cluster by industrial sector as I expected them to.

Code

# import useful libraries
import pprint as pp
import scipy.cluster.hierarchy
import scipy.stats
import numpy as np
import matplotlib.pyplot as plt
import datetime

# specify the time points we want to sample and sort them
end_date = datetime.datetime(2014, 10, 28, 0, 0)
dates_to_use = [end_date]
dt = datetime.timedelta(days=-7)
for i in range(0, 53):
    dates_to_use.append(dates_to_use[-1] + dt)
dates_to_use = sorted(dates_to_use)
new_dates = []
for x in dates_to_use:
    year = str(x.year)
    month = str(x.month)
    day = str(x.day)
    if len(month) == 1:  month = '0' + month
    if len(day) == 1:  day = '0' + day
    new_dates.append(year + '-' + month + '-' + day)
dates_to_use = new_dates

# load the closing prices for the given dates into memory
prices = {}
f = open('../combined_data.csv')
for line in f:
    line = line.strip()
    date_string = line.split(',')[1]
    if not date_string in dates_to_use:  continue
    symbol = line.split(',')[0]
    close = float(line.split(',')[5])
    if not prices.has_key(symbol):
        prices[symbol] = {}
    prices[symbol][date_string] = close
f.close()

# delete cases without all of the dates in dates_to_use
symbol_list = prices.keys()
for symbol in symbol_list:
    date_list = prices[symbol].keys()
    if len(date_list) != len(dates_to_use):
        del(prices[symbol])

# generate price time series
price_time_series = {}
for symbol in prices.keys():
    price_time_series[symbol] = [prices[symbol][d] for d in dates_to_use]

# calculate R
new_symbol_list = sorted(price_time_series.keys())
R_dict = {}
for i in range(0, len(new_symbol_list)):
    if not R_dict.has_key(i):
        R_dict[i] = {}
    for j in range(i, len(new_symbol_list)):
        symbol_i = new_symbol_list[i]
        symbol_j = new_symbol_list[j]
        R = 1.0
        if symbol_i != symbol_j:
            R = scipy.stats.pearsonr(price_time_series[symbol_i], price_time_series[symbol_j])[0]
        R_dict[i][j] = R

# create the distance matrix
distance_matrix = np.zeros([len(new_symbol_list), len(new_symbol_list)])
for i in R_dict.keys():
    for j in R_dict[i].keys():
        similarity = R_dict[i][j] + 1.   # range 0 to 2
        distance = (-1.0 * (similarity - 2.0)) / 2.0
        distance_matrix[i][j] = distance
        distance_matrix[j][i] = distance

# create the linkage
Z = scipy.cluster.hierarchy.linkage(distance_matrix, method='average')

# function to set blank dendrogram labels
def llf(id):
    return ''

# plot
scipy.cluster.hierarchy.dendrogram(Z, leaf_label_func=llf)
plt.title('Hierarchical Clustering of Stocks by Closing Price Correlation')
plt.show()

# flatten
number_of_groups_to_use = [50]
for n in number_of_groups_to_use:
    f = open('output/' + str(n) + '.csv', 'w')
    f.write('symbol,group\n')
    T = scipy.cluster.hierarchy.fcluster(Z, n, criterion='maxclust')
    counts = {}
    for i, t in enumerate(T):
        if not counts.has_key(t):
            counts[t] = 0
        counts[t] += 1
        f.write(new_symbol_list[i] + ',' + str(t) + '\n')
    counts_list = []
    for c in counts.keys():
        counts_list.append(counts1)
    f.close()

    # output
    print
    print n
    print np.min(counts_list)
    print np.median(counts_list)
    print np.max(counts_list)
    plt.hist(counts_list)
    plt.title('Distribution of Number of Stocks per Group')
    plt.xlabel('Number of Stocks per Group')
    plt.ylabel('Frequency')
    plt.show()

References

  1. http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage
  2. http://homes.di.unimi.it/~valenti/SlideCorsi/MB0910/IntroClustering.pdf
  3. http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NYSE&render=download

3 thoughts on “clustering stocks by price correlation (part 1)

  1. I reran the analysis using daily prices, the Kolmogorov-Smirnov statistic as the distance metric, and 1000 groups. The results still did not cluster by industry.

Leave a Reply

Your email address will not be published.