DIY Twitter analytics (part 3: hashtag network)

I’ve been mathematically analyzing my Twitter feed to determine how best to position my tweets for maximum impact, and have been documenting the work on this blog. While I’ve not come to any brilliant conclusions yet, I’ve made progress. My first post on the subject described clustering my followers by their hashtag use to see whose tweets most resemble mine. My second post described correlation studies I conducted concerning the relationship between the number of followers a user has and the number of retweets and favorites they receive.

Here I describe a network (mathematical graph) approach I took to mapping my hashtag use. The idea was to determine which hashtags had the most influence. I only partially achieved this goal, as described below, but the work as it stands today is worth reporting on.

To build the network graph, I downloaded all my tweets and extracted their hashtags. Each hashtag became a node in the graph. If two hashtags appear together in the same tweet, I connected their respective nodes with an edge. The edges were then weighted by how many retweets and favorites the tweets involved received. The resulting network graph looks like:


The image posted above can be downloaded and zoomed in upon to read the actual node labels. I made sure to use an appropriate image resolution to make this possible.

I then wanted to figure out the “most important” nodes in the graph. I could have simply counted the hashtag use frequency, but that would have ignored the information stored in the edges. So I applied “clique” analysis, which was originally used to determine key individuals in social networks. To do this I used the NetworkX Python library to determine the number of maximal cliques for each node. I then sorted the nodes by number of maximal cliques per node, and regarded the top tenth percentile as the “most important”. These are shown in green in the image above, and are listed below:

Hashtag Number of Maximal Cliques
statistics 18
datascience 18
science 17
bigdata 12
transgender 9
bioinformatics 9
python 8
genomics 8
survey 5
marketing 5
machinelearning 5
anarchism 5

As expected, “statistics”, “data science”, and “science” come out on top. This is no surprise given the subject of most of my tweets. But I was surprised that “anarchism” showed up in the top tenth percentile as well. I don’t tweet much about anarchism, but must have linked it to a wide variety of other hashtags when I have tweeted about it. This is why the clique analysis approach is more informative than a simple hashtag frequency count; I would have missed anarchism’s influence on my writing had I simply counted the tags.

I mentioned above that I only partially achieved the goal of figuring which hashtags carry the most influence. From the analysis reported here I know each hashtag’s influence relative to other hashtags in my own writing. But I do not know which hashtags impact other users the most. I’ll have to think of a different approach to answer that question.

Post Author: badassdatascience

Leave a Reply

Your email address will not be published.