I’ve started working with the Twitter API to develop my own Twitter analytics tool chain. My goals are to figure out who the influencers in my subjects are, figure out how best to position my tweets, etc. I could certainly pay for this service, but then I wouldn’t learn any new technical skills in the process!

Here I report on the first of these initiatives, using hierarchical clustering to identify which of my followers’ hashtag usage is most like mine. I detail method and results for three similarity scoring strategies: The first strategy simply computes the proportion of hashtags in common between each user. The second is like the first, but it only considers “original” tweets (no retweets). Finally, the third method is weighted by both number of hashtags in common and the frequency of those hashtags’ use. This too only considers “original” tweets. Here I present the overall results, followed by method details.

Using the first method, I determined that @DaveKirtan’s tweets are most like mine, with common hashtags related to not only Big Data, data science, and machine learning but also bioinformatics, genomics, RNA-Seq, and synthetic biology.

User | Hashtags In Common With My Tweets (Method One) |
---|---|

DaveKirtan | bigdata, bioinformatics, biology, data, datascience, genomics, machinelearning, matplotlib, ploscompbio, python, rnaseq, science, statistics, syntheticbiology |

The second method expanded the group of users whose tweets most resemble mine. Common hashtags were primarily Big Data and data science related:

User | Hashtags In Common With My Tweets (Method Two) |
---|---|

DaveKirtan | bigdata, bioinformatics, biology, data, datascience, genomics, machinelearning, matplotlib, ploscompbio, python, rnaseq, science, statistics, syntheticbiology |

YvesSagaert | bigdata, data, datascience, econometrics, forecasting, python, r, sas, statistics |

affirmedsystems | bigdata, business, datascience, investing, java, machinelearning, math, matlab, python, r, risk, stocks |

The common hashtags from the third method also centered around Big Data and data science, though the user list was quite different. Since this scoring method considered frequency of hashtag use in addition to commonality, I think it is a stronger comparison metric (though I did not normalize by number of users’ hashtags, see below for a more detailed discussion). The resulting dendrogram divides cleanly into two distinct groups. A casual review of the hashtags for the users placed in group each reveal that science and data science topics are represented in both groups, but that the hashtag “Big Data” is found in only one. “Big Data” is such a high frequency tag in the source data that its presence or absence for a user could contribute strongly to the clustering results.

User | Hashtags In Common With My Tweets (Method Three) |
---|---|

EXADude | bias, bigdata, data, datascience, hadoop, java, machinelearning, nosql, python, r, statistics |

Genokey | bigdata, bioinformatics, business, data, datascience, genomics, healthcare, hr, machinelearning, rna, science |

LineshDave | bigdata, business, data, datascience, hadoop, internetofthings, java, leadership, machinelearning, marketing, python, risk, statistics |

MarkAllen_Tech | apachespark, bayesian, bigdata, data, datascience, hadoop, machinelearning, python |

Pedrodoc82 | bigdata, data, datascience, hadoop, java, leadership, machinelearning, marketing, matlab, python, sas, statistics |

SeeLifeAsData | bigdata, datascience, machinelearning, r |

YvesSagaert | bigdata, data, datascience, econometrics, forecasting, python, r, sas, statistics |

biconnections | amazon, bigdata, business, data, datascience, hadoop, healthcare, java, machinelearning, python |

garydata | apachespark, bigdata, data, datascience, hadoop, internetofthings, java, machinelearning, nosql, python, risk |

gunjan_amit | bigdata, data, datascience, hadoop, healthcare, machinelearning, marketing, python, r, science |

imxuwang | bigdata, datascience, hadoop, machinelearning, mapreduce, python, statistics |

isragaytan | apachespark, bigdata, datascience, healthcare, internetofthings, machinelearning, nosql, postgresql, python |

jitinkapila | bigdata, data, datascience, hadoop, healthcare, internetofthings, marketing, matlab, mongodb, nosql, python, r, statistics |

# Method

I used Python’s scipy.cluster.hierarchy module to perform the clustering (see attached code), with the distance matrices generated for each method as follows:

### Method #1:

The similarity between user i’s tweets and user j’s tweets is the number of distinct hashtags they have in common, divided by the number of user i’s distinct hashtags. Note that the denominator is different when user j is considered first—both directions are represented when constructing the distance matrix. The distance is one minus the similarity.

### Method #2:

This is exactly like method #1, except I only considered original tweets (non re-tweets).

### Method #3:

The method incorporates not just the number of hashtags in common, but their frequency of use:

I created two empty vectors, one for user i and one for user j. If they shared a hashtag in common, I appended the number of each user’s original (non re-tweet) use of the hashtag to their respective vectors. Then I took the dot product of the vectors to form a similarity score, and took the logarithm of this dot product to penalize extremely close similarities. Then I added this value to the logarithm of the number of hashtags in common to produce the final similarity score. I assembled a similarity matrix containing each user-pair’s score, setting the diagonals to one plus the maximum value of the matrix’s non-diagonal values. To compute the distance matrix, I computed the similarity matrix’s maximum minus the similarity matrix and normalized it.

# Code

Code implementing the above described computations, used to produce the results shown above, is attached:

## 2 thoughts on “DIY Twitter analytics (part 1: clustering related users)”