picking stocks by graph database (part 2: machine learning)

In our last post, we demonstrated a graph database created to enable study of the stock market, particularly the study of causality relationships.

So how to proceed from there? At this stage we want to pick winning stocks, not write an academic paper, so our focus turns toward practical machine learning.

Source Data

We start with daily historical price information for every stock traded on the Nasdaq, NYSE, and AMEX exchanges from 2000 to late July 2017.

Finding the Trigger

Given that the database contains Granger causality relationships between daily changes in volume in one stock and daily changes in adjusted close price for another stock, we searched the historical data for dates with a daily percent increase in volume of 500,000. We hypothesized that these signals will indicate a two day increase in adjusted close price for some stocks in our database.


Given such a signal at day n-1 for stock X, we built a model to predict whether a substantial increase in adjusted closing price will occur between day n and day n+1 for stock Y, where the database shows that volume change in X granger causes adjusted close price change in Y two days later.

Used the following variables:

  • “Influence score” (more on this below)
  • Lags 1-6 of the percent change in adjusted close price for the stock being predicted
  • Percent of 52-week, 12-week, and 4-week highs for the stock being predicted
  • Weekday

“Influence Score”

The influence score captures information about the movements in volume in stocks where these movements Granger cause adjusted close price movement for the stock under consideration. For each influencing stock, we calculate:

  • Spearman’s R for 365 days for the adjusted close (of the stock under consideration) and the volume for the influencing stock. We lag the volume series to account for the Granger causality. Non-statistically significant results are set to zero. The procedure gives us an estimate of the sign of the relationship, which Granger causality does not.
  • Log base-10 of the Granger causality p-value. This gives the strength of the relationship.
  • The last difference in volume for the influencing stock.
  • That stock’s score: (-1 * log(Granger causality p-value, 10)) * (Spearman’s R) * (Last Difference)

We collect all the scores for each influencing stock and compute the minimum, 25th percentile, 50th percentile, 75th percentile, and maximum. These values then are included in the statistical model. We also include the difference between the mean and median of this list and the length of this list.


To classify, we take changes in adjusted closing price for stock Y between day n and n+1 and calculate the 95th percentile. Consider values above this threshold “good”. Similarly, we consider values below the 40th percentile “bad”, discarding the middle. This results in “bad” cases with fractional or no loss in money between days n and n+1, and “good” cases with gains of over 5%. (This is for a model for buying stock, we use 5% and 60% for producing a model for shorting stocks).

We employ LIBSVM [1] to build a support vector machine model. “Tune” the parameters using their “grid.py” program to choose them well:

Feature Importance

To discover which features matter the most, we run random forest classification on the data and plot the relative feature importances:

Machine Learning Results

Perform five-fold cross-validation with the SVM model 100 times and plot a histogram of the area under the ROC curve to test consistency (this is for the stock shorting model):

A representative ROC curve from one of these cross-validation iterations:

Not too happy with the results. However, the results bias enough toward positive selection that we proceeded to train against the full data set with the intention to trade using the model’s predictions.

Practical Results

We’ll test this with actual money soon and get back to you.

Web Interface and Automation

To enable practical use of the model, we wrote a web tool that reports daily (sorted) model scores:

We also plan to automate trading based on this model to remove the risk of “emotional” trading.

Ideas On How To Improve the Model

  1. Improve the method for combining multiple influencing stocks (in the causal network) into one “influence score”.
  2. Add a dummy variable indicating whether we are in the month of October, since October is a well known month for stock market turbulence. We’d consider adding all the months but need to keep variable number down.
  3. But maybe instead of month, indicate quarter in the year with a categorical variable…
  4. …and whether we are at the beginning, middle, or end of the quarter.
  5. Add features related to the candlestick on the day prior to the trade.
  6. Add relative strength index.
  7. Add accumulation/distribution line.
  8. Maybe do something with the volume trend in the stock being predicted.


  1. Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

Post Author: badassdatascience

Leave a Reply

Your email address will not be published.