2. Agenda
Background
Five W’s of Clustering
• What, why, who, how, when
Is it really repeatable?
Questions
3. About Information Management Services (IMS)
Analytics
Lifecycle Metadata
Mgmt. Mgmt.
- Standards
- Best Practices
- User Needs
- Service Models
Taxonomy
Search
Dev.
5. About this talk…
Case study on how we are improving search and
browse by performing clustering exercises on your
search query data
Not rocket science
High-level overview
You can follow this method, with your own insights and
tweaks
You can kick this off next week at your work
6. What is clustering?
A process for organizing and analyzing search log
data that:
Is repeatable, low-cost, scalable, simple
Yields actionable results
Supports constant incremental improvement
to search
7. What’s clustering good for?
Ensure results for high frequency queries
Improve Metadata and Taxonomy
Inform and validate decision making in site IA
Informs editorial/curatorial activities
Provides Feedback for Search Suggestions
o Autosuggest, synonym lists, no-hits page
suggestions
But more on this later...
8. So how do I cluster search queries?
A simple set of steps
Create
query report
Draw
Cluster
conclusions
queries
and ACT
Determine #
Analyze
queries to
clusters
analyze
9. Step 1: Create a query report
We started with the site with the most traffic
• Upper-bound limit
• One year’s data by quarter
• Cut off tail at frequency < 10
10. Step 1: Create a query report
We started with the site with the most traffic
• Upper-bound limit
HBS Working Knowledge FY12 Use Snapshot
• One year’s data by quarter
Overall Traffic
• Cut off tail at frequency < 10
Page Views: 6,439,485
Visits: 3,635,746
Unique visitors: 2,734,620
On-site searches: 174,425
Views per Visit: 1.77
Local Search visit rate: 5%
Organic Search visit rate: 46%
12. Step 2 (cont’d): Three levels of clustering
Level Method Example
Narrow Simple Eliminate
normalization grammatical,
spelling, typos, and
punctuation
differences
Mid-level Group by subject management,
finance, decision
making
Broad Group by facet topic, name, date,
content type
13. Step 2 (cont’d): Levels Tasks Enabled
Level Improve your Ensure Improve Improve
base for representation Metadata/Index Search
query of major /Taxonomy Suggestions
analysis clusters on your
site
Narrow X X X
(simple)
Mid-level X X X
(group by
subject)
Broad X X
(group by
facet)
19. Step 2 (cont’d): List of facets we used
Facet Example
case studies, cases, working papers, articles,
content type
newspaper
date 2011, world in 2030
demographic characteristics women, Gen Y, gender, baby boomers
event economic crisis
format podcast, video
geographic area india, japan, mount everest
industry global wine industry
independent director, entrepreneur, ceo, phd
job type/role
economist
organization name ikea, zara, toyota
person name michael porter, kanter, sebenius
product name / brand name ipad
product/commodity coffee, wine, cement
topic this covers the majority of keywords
faculty work, ex: publication name, title of a
work
case
20. Step 3: Choose #clusters to analyze
Number of Analyze Top Hits Improve Metadata/ Supply Search
Clusters Taxonomy Suggestions
Analyzed /Index
50 X
150 X X
300+ X X X
21. Small # Clusters can cover a lot of your data
Number of top clusters % Total Queries
Top 20 clusters 14
Top 30 clusters 18
Top 50 clusters 26
Top 100 clusters 37
22. Now you have your clusters…
What do you do with them?
TAKE ACTION!
23. Analyze Top (“Short Head”) Clusters
Clustering has created a condensed and reliable
list of your top search queries
Are they what you thought they would be?
Does the information on your site accurately
represent the top searches?
Are you fulfilling user needs?
24. Use your clusters: Improve Site Navigation
Examine the short-head of clusters, basically:
For each cluster, add up the frequencies
of queries
Reorder clusters by cumulative frequency
descending
Ensure top clusters are accounted for in your
navigation
Use cluster topics as browse/navigation
headers/footers for your website
25. WK Top Clusters
Cluster Frequency
innovation 867
balanced scorecard 794
leadership 570
cases 545
social media 508
negotiation 470
knowledge management 457
ethics 448
apple 430
corporate social responsibility 398
26. Use your clusters: Improve Taxonomy
• Missing categories in browse taxonomy
• "Balanced Scorecard"
• “Ethics”
• “Social media”
• Second-level topics in the WK context
27. Use your clusters: Improve Taxonomy
• Missing categories in browse taxonomy
• "Balanced Scorecard"
• “Ethics”
• “Social media”
• Second-level topics in the WK context
28. Use your clusters: Improve Taxonomy
• Missing categories in browse taxonomy
• "Balanced Scorecard"
• “Ethics”
• “Social media”
• Second-level topics in the WK context
29. Mid-level clustering:
Informs editorial /curatorial activities
“Featured Topics”
o What topics to highlight this week/month/year
o News items to focus on
o What research guides to create
o How to formulate queries for the topics
30. Use your clusters: Improve Synonym Handling
Clustered list provides synonyms for taxonomy
Requires human judgment and
standards/guidelines for synonyms – in our
case, synonyms are exact
Map to one "like term" in the search engine
Example:
Balanced Scorecard, BSC, Balanced score card
kaplan and norton -> Balanced Scorecard
32. Time Commitment
• 2 hours to 2 weeks
• Variables include:
• What kind of information you want to gather
• How broad or narrow you want your clusters
• How many queries you analyze
• In our case ~2 person-weeks
• We had Sophy Bishop
• Intern, MSLIS student
33. Results vs. Time Invested
Analyze top Update Create New Determine
clusters Taxonomy Metadata New Search
Suggestions
2 Hours X X
6 Hours X X X
One Week X X X X
34. Next Steps: Autosuggest
Your top clusters probably make up a large
percentage of what people are looking for
o Use them to establish/supplement
auto-suggest!
Example: suggestions for “innovation”
o innovation and leadership
o disruptive innovation
o innovation management
o open innovation
35. Next Steps: New Access Structures
Needed an obvious way to search podcasts
o Put in best bets for now
A lot of people searching for article titles
o Considering simple interface/approach for select
field-specific search, e.g. “title”
Consider adding other facets to browse
taxonomy where we have entities tagged
o “company name”, “job type/class”, etc.
36. Next Steps
SEO Optimization Input
o Advise authors to use top cluster terms in Titles,
Abstracts, Keywords
o Report on clusters in our monthly analytics reports
to faculty (“Top search topics/subjects in May 2012
were…” ; “Searchers found your works with
following queries”)
Repeat process on other sites/content
37. Summary
Established plan/process, but be willing to tweak
as you go
Keep it very simple.
Play with your data – the more we played, the better
we understood what benefits could be realized by
levels of clustering and effort
Tuning process/results
o Build staging/working prototypes
o Repeat process on other sites
TAKE ACTION!