Overview of Robert C. McNamee\\’s Paper "Can’t See the Forest for the Leaves: Similarity and Distance Measures for Hierarchical Taxonomies with a Patent Classification Example"
This paper introduces an appropriate methodology to the management domain that allows the use of hierarchical taxonomies for calculating distance and similarity.
Although my example is based on technology space and the patent classification system it applies to any theoretically complex space that can be characterized by a hierarchical taxonomy of classification: ex. industry and SIC/NAICS codes; culture and demographic variables, etc…
If we look at the theoretical phenomenon we are examining we should utilize constructs and measures that are as loyal to this view as possible
I extracted the data from the USPTO website and classes combined document. Then I extracted the count of patents classified into each of the 150,000+ class/subclasses
This extension draws on established methods from fields like machine learning and information search and retrieval.
Interestingly, the original quote accurately assumes that subclasses with more classifications are more important within the technology space, however, these subclasses are actually less important in establishing the similarity of two patents.
Since this uses probability within the entire universe of patents this effectively normalizes distance across all levels of analysis and whole universe – i.e. the analysis of similarity within a industry will be quite high whereas the similarity between industries should be much lower.
Top right actually shows somewhat arbitrary results based on non-704 classes in the sample – this does not describe the technology space within field 704 but rather something like interplay of class 704 with other fields.