I have written 299 blog posts in the last decade, roughly half on badassdatascience.com and half on g——.com. Produced most of the Badass Data Science content while publicly expressing as a man, and most of the g—— content as a woman. Some articles appear on both blogs—for example this one—and in the analysis described below I account for such duplication.
My speech therapist observed that I successfully employ feminine language in my recent video “radical forgiveness”. This led me to thinking: Has the language I use in my prose evolved as I blossomed into femininity? I detail my attempt to answer this question using mathematical analysis below.
I make two major assumptions in this analysis, assumptions I will address in future work:
First, I assume my writing skill remained constant throughout the last ten years. Not a great assumption in the long haul but necessary to simplify the math for this “back of the envelope” analysis.
Second, the two blogs cover different subjects, and the first one even contains source code on occasion. This may distort the clustering process described below. Again, ignoring this concern proves acceptable for this “quick-and-dirty” calculation to enable exploration of the problem domain.
I download each of my blog posts and then calculated the part of speech (POS) for each word in the post. After that I computed the frequency distribution of the POSs. I then performed hierarchical clustering using a similarity matrix defined by the dot product of each pair of posts’ POS use frequency distribution vectors. The resulting dendrogram looks like:
I recommend downloading the image to view it at full size.
Each vertical line represents a blog post, and the trees linking the vertical lines indicate the degree of similarity between any two blog posts. For example, in the above image, the cyan and magenta colored posts prove similar but the green and black posts diverge significantly in terms of their POS use frequency distributions. The asterisks indicate posts created after I started expressing publicly as a woman full-time. The colors divide the tree into sections that group similar blog posts. Please note that I chose the grouping threshold manually (but rationally).
By visually inspecting the density of these asterisks for the different color groups we derive an indication of how “feminine” or how “masculine” we might regard each group of blog posts. For example, we see sparse femininity in the green, yellow, and black groups; while we see enriched femininity in the cyan and purple group. The algorithm clearly found little distinction between the posts within the large red group, but even there we visually recognize sections of diminished femininity and sections of enhanced femininity.
So a linguistical difference between my pre- and post-transition writing appears to exist. But is it real? Can we conclude that my prose grew more feminine after my public transition? Not so fast! We must build a model that includes time as a variable to cancel out possible influence of improvement in my writing skill, and then test that model for significance. I’ll save this work for a later date.