Recently I’ve become extremely interested in survey analysis and, more broadly, the social consequences of survey-based decision making. So when a friend asked for help extracting business intelligence from a market research survey they conducted, I jumped at the opportunity to test out some ideas.
The analysis presented below details a use of hierarchical clustering to identify groups of similar responses to a ranking question.
The survey asked respondents to rank the following nine product features according to what they most value:
My friend wanted to know how respondents “group” according to their answers, to see if they need to create one product that serves everyone, or more than one product to serve the distinct needs of respondent subsets. To figure this out I devised a scoring method to measure response similarity, clustered the responses by similarity score, and then (visually) validated the clustering output using network graphs:
Organized the data into 259 vectors (one for each respondent) such that the highest to lowest valued features read left to right:
- [A, I, E, G, F, C, D, B, H]
- [E, C, I, G, B, H, A, D, F]
- [I, G, F, E, H, B, D, A, C]
- [B, E, A, I, F, D, C, G, H]
- and so on…
Compared each response to every other response, scoring each pair by a distance metric to produce a similarity matrix (distance method detailed below):
Clustered the similarity matrix to form a hierarchal linkage model:
Flattened the hierarchal linkage model to assign each respondent to one of five groups:
To independently assess group assignment validity, I created a network graph of pair-wise respondent similarity where the scores computed above weigh the edges. Applied the Fruchterman-Reingold layout algorithm to produce a 3D network graph display where similar responses conglomerate together. In this display I colored the nodes according to group assignment:
Here the edges were removed to improve visual clarity. Each node represents a survey response.
Since the three tightly condensed regions in the network graph largely agree with the color-coded grouping assigned by the clustering process (for the three most uniform groups), I concluded that the clustering operation’s results are valid.
There exist a billion defendable scoring models for this application. Here is mine:
The similarity score for a pair of responses is computed as follows:
At = The number of selections in common within the top two selections
Ab = The number of distinct selections within the top two selections.
Bt = The number of selections in common within the top three selections
Bb = The number of distinct selections within the top three selections.
Ct = The number of selections in common within the top four selections
Cb = The number of distinct selections within the top four selections.
Score = (2*At + 1.5*Bt + Ct) / (Ab + Bb + Cb)
After computing all the scores, I normalized them by their maximum value so that the scores vary between zero and one.
Essentially, this method is a variation of Jaccard’s index method, organized to strongly
value similarity in the top two selections, less strongly value similarity in the top three
selections, and somewhat value similarity in the top four selections.
The strategy reflects the thinking that order of two (three, four) selections is interchangeable provided two (three, four) selections are right next to each other in the overall order, and the idea that survey-takers really don’t care about anything they did not rank in the top four.
Recommendations reported to my friend:
Using the results from the analysis outlined above, I produced the following recommendations:
- Try to get E, B, and C into one product. If possible, get D and/or I in as well.
- If that doesn’t work, try to get E and B into one product, and C plus E into another product.
- If that doesn’t work, get E into one product and B into another.
- Don’t spend any time on G and H.