This is a nice high-level summary for Gumshoe, the enterprise engine built by our group, which is currently powering IBM intranet search. One of SIGIR 2011 Industrial Track Keynote Talk.
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Building Search Systems for the Enterprise
1. Building Search SystemsBuilding Search Systems
for the Enterprisefor the Enterprise
IBM Research – Almaden
ACMACM
SIGIRSIGIR 20112011
Beijing, China
(on behalf of Shivakumar Vaithyanathan)
Yunyao Li
2. • Search for the EnterpriseSearch for the Enterprise
• Programmable Search (overview)Programmable Search (overview)
• Backend AnalyticsBackend Analytics
• Search RuntimeSearch Runtime
• Foundations and PrinciplesFoundations and Principles
• Concluding RemarksConcluding Remarks
outlineoutline
2
3. Experience at IBM Internal SearchExperience at IBM Internal Search
• IBM deployed a commercially available search engine
– Implementing standard IR techniques
• Search quality went down over time to the point that
Search results were unacceptable!Search results were unacceptable!
Success (≥ 1 relevant results): 14% on top-1, 23% on
top-5, 34% on top-50! [Zhu et al., WWW’07]
So, they implemented various solutions…
3
To the administrators managing the engine,
exposed knobs were insufficient
4. Attempts to Improve SearchAttempts to Improve Search
• Enhanced link analysis by
incorporating external links
to/from external WWW
• Creative hacks: added fake
terms to documents & queries
– # terms per document determined by
“popularity”: how much TF increase
required for needed rank boost ?
• Hard-coded custom results for
the top 1200+ queries
• Enhanced link analysis by
incorporating external links
to/from external WWW
• Creative hacks: added fake
terms to documents & queries
– # terms per document determined by
“popularity”: how much TF increase
required for needed rank boost ?
• Hard-coded custom results for
the top 1200+ queries
Didn’t help…
Quality went down!
Maintenance nightmare:
Heuristic needs to be updated
upon each nontrivial change in
term stats./ranking parameters
Even bigger nightmare!
How to deal with continuously
changing terminology?
4
5. What are the Problems?What are the Problems?
Network Station Manager search
Thin Client ManagerProduct names change:
Continually changing terminology!
Domain-specific meaning!
Paula Summa search
bring Paula Summa from
employee directories
per diem search
Domain-specific repetitions!
popcorn search
conference call!
These problems are not specific
to enterprise search… but:
• Result 1: IBM Travel: Per Diem
• Result 2: IBM Travel: Per Diem Rates
• Result 3: IBM Travel: National perdiems
• Result 25: IBM Travel: Per Diem Policy
5
…
6. The Enterprise Challenge!The Enterprise Challenge!
Domain-specific meaning! Domain-specific repetitions!
Generic search solutionGeneric search solution that is
customizable and maintainable in every
domain
Generic search solutionGeneric search solution that is
customizable and maintainable in every
domain
Simple customization with reasonable effort!Simple customization with reasonable effort!
Programmable SearchProgrammable Search
Ongoing search-quality managementOngoing search-quality management
6
Continually changing terminology!
7. • Search for the EnterpriseSearch for the Enterprise
• Programmable Search (overview)Programmable Search (overview)
• Backend AnalyticsBackend Analytics
• Search RuntimeSearch Runtime
• Foundations and PrinciplesFoundations and Principles
• Concluding RemarksConcluding Remarks
outlineoutline
7
8. Programmable Search: Main IdeaProgrammable Search: Main Idea
• Goals:Goals:
– Transparency
• Know “precisely” why every result item is being brought back
• Understand how changes in content/intents affect search
– Maintainability and “Debugability”
• Ranking logic is guided by explicit rules
• Properly react to changes in content/intents
• Building blocks:Building blocks:
– Deep analytics on documents
– Domain-specific analysis of queries
– Transparent customizable rule-driven ranking
runtime rulesruntime rules
backend
analytics
backend
analytics
interpretationsinterpretations
8
9. Distributed Analytics Platform
Crawling, information extraction, token generation (TG), indexing
Search runtime
Index
Index and rule
update services
backend
analytics
backend
analytics
runtime rulesruntime rulesinterpretationsinterpretations
Implementation Architecture
backend
frontend
9
10. • Search for the EnterpriseSearch for the Enterprise
• Programmable Search (overview)Programmable Search (overview)
• Backend AnalyticsBackend Analytics
• Search RuntimeSearch Runtime
• Foundations and PrinciplesFoundations and Principles
• Concluding RemarksConcluding Remarks
outlineoutline
10
11. Backend Analytics:Backend Analytics: 3 Parts3 Parts
Local AnalysisLocal Analysis
(per-page analysis)
Local AnalysisLocal Analysis
(per-page analysis)
Global AnalysisGlobal Analysis
(cross-page analysis)
Global AnalysisGlobal Analysis
(cross-page analysis)
Token GenerationToken Generation
(TG)
Token GenerationToken Generation
(TG)
index
11
12. Local AnalysisLocal Analysis
• Categorizing pages
– Label pages by custom categories
• IBM examples: HR, person, IT help, ISSI, sales information,
marketing, corporate standards, legal & IP-law, …
– Geo classification
• Associate documents with the relevant countries & regions
• Annotating pages
– Identify HomePage annotation for people, projects,
communities, …
Simply knowing where a page is physically hosted is not enough
(example: Czech Republic hosts all pages for IBM in Europe)
12
13. G J Chaitin Home Page
13
Homepage IdentificationHomepage Identification
Title ExtractionTitle Extraction
Matching title
patterns
Matching title
patterns
Title
s
Dictionary
Match
Dictionary
Match
Home Page for
G J Chaitin
• http://w3.ibm.com/hr/idp/
• http://w3-03.ibm.com/isc/index.html
• http://chis.at.ibm.com/
URL ExtractionURL Extraction
URLs
Matching URL
patterns
Matching URL
patterns
Homepage for: idp isc chis
Employee
directory
… many more …
Intranet
page
Intranet
page
More details in
[Zhu et al., WWW’07]
14. 14 IBM Confidential14 IBM Confidential
Among the 38 pages with the exact same title,
which is the best for “Paula Summa”?
Role of Global AnalysisRole of Global Analysis
14
15. PersonPerson
TitleTitle
Token Generation (TG)
Annotated values Index content
Ching-Tien T. (Howard) Ho
Ho Ching-Tien Tien Ho Ho, Tien
Howard Ho Ching-Tien H. ...
Global Technology Services
TG
personNameTG
Howard Ho Ching Tien ...
gts Global Technology Services
Global Technology Technology
Services Global Technology ...
GlobalTechnologyServices
nGramTG
spaceTG
acronymTG
nGramTG
……
… 15
…
…
16. • Search for the EnterpriseSearch for the Enterprise
• Programmable Search (overview)Programmable Search (overview)
• Backend AnalyticsBackend Analytics
• Search RuntimeSearch Runtime
• Foundations and PrinciplesFoundations and Principles
• Concluding RemarksConcluding Remarks
outlineoutline
16
18. Phase 3:Phase 3: Result Construction
Phase 2:Phase 2: Relevance Ranking
Phase 1:Phase 1: Query SemanticsQuery Semantics
query search rewrite rules
queries
interpretations
partially ordered interpretations
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results final results
re-ranking rules
Runtime Flow in More DetailsRuntime Flow in More Details
18
19. Runtime Rules:Runtime Rules: Pattern-Action Language
Query Pattern Queries Matching Possible Action
EQUALS
[r=ibm|information|info]
[d=COUNTRY]
• ibm germany
• info india
Rewrite into “[country] hr”
(e.g., germany hr)
ENDS_WITH installation
• acrobat installation
• db2 on aix installation
Replace installation with ISSI
(e.g., acrobat ISSI)
CONTAINS directions to
[d=SITE]
• driving directions to almaden
• directions to watson from jfk
Pages of “siteserv” category
should be ranked higher
STARTS_WITH
[d=PERSON]
• john kelly biography
• steve mills announcement
Group together pages that
represent blog entries
Pattern expression,
matched against the
keyword query
Perform when
matchQuery pattern → Action
19
21. 21
What’s Best for Benefits?What’s Best for Benefits?
The most important IBM page for benefits
changes over time: currently it is netbenefits
The most important IBM page for benefits
changes over time: currently it is netbenefits
21
23. Interpretations
Scenario: An IBM employee wants
to download Lotus Symphony 1.3
Scenario: An IBM employee wants
to download Lotus Symphony 1.3
Runtime interpretation:
download symphony 1.3 category=issi software=symphony 1.3
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results final results
re-ranking rules
rewrite rules
queries
interpretations
partially ordered interpretations
download symphony 1.3 search
23
24. 24
IBM Confidential
People with
first name Jim
People with
first name Jim
How can we avoid pages
from people category?
How can we avoid pages
from people category?
java jim
Complex RulesComplex Rules
24
25. java jim and not in person category
Complex RulesComplex Rules
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results final results
re-ranking rules
interpretations
partially ordered interpretations
rewrite rules
queries
java search
25
27. PersonPerson
TitleTitle
Recall: Token Generation (TG)
Annotated values Index content
Ching-Tien T. (Howard) Ho
Global Technology Services
TG
personNameTG
Howard Ho Ching Tien ...
gts Global Technology Services
Global Technology Technology
Services Global Technology ...
GlobalTechnologyServices
nGramTG
spaceTG
acronymTG
nGramTG
……
…
…
…
Ho Ching-Tien Tien Ho Ho, Tien
Howard Ho Ching-Tien H. ...Person + personNameTG
Person + nGramTG
Title + acronymTG
Title + spaceTG
Title + nGramTG
27
28. Annotation + TG Relevance Bucket
Howard Ho Ching Tien ...
GlobalTechnologyServices
… 28…
Person + personNameTG
Person + nGramTG
Title + acronymTG
Title + spaceTG
Title + nGramTG
query search
Relevance bucketsRelevance buckets
•Buckets are ranked
– Based on annotation type
– Based on TG quality
•A page can belong to
multiple buckets
•Within each bucket,
ranking is by
conventional IR
……
31. • Grouping rules define how search results should
be grouped together
• Search administrators can improve the diversity
of search results (in 1st
page)
– Based on their familiarity with the data sources
Group pages of the same category
per diem travel, you-and-ibm
ANY ISSI, IT Help Central, Forum,
Bluepedia, Media Library, …
Grouping RulesGrouping Rules
Query pattern
31
32. Need first page diversityNeed first page diversity
Flooding with Similar PagesFlooding with Similar Pages
32
33. 33
33 IBM Confidential
Grouping Rule to the RescueGrouping Rule to the Rescue
per diem travel, you-and-ibm
final results
re-ranking rules
interpretations
partially ordered interpretations
rewrite rules
queries
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results
per diem search
33
34. • Re-ranking rules adjust ranking of
search results based on categories
• Example: search administrator specifies the
important sources of “hot/current topics”
Re-ranking RulesRe-ranking Rules
Hot topics Rank these categories higher
Bluepedia, News, About-IBM
smarter planet, cloud
computing, centennial, …
34
35. BluepediaBluepedia
Technical NewsTechnical News
Re-ranking Rule for Hot TopicsRe-ranking Rule for Hot Topics
Homepages of
“About IBM”
Homepages of
“About IBM”
Hot topics Rank these categories higher
Bluepedia, News, About-IBM
smarter planet, cloud
computing, centennial, …
35
36. Re-ranking Rules for Person QueriesRe-ranking Rules for Person Queries
[d=PERSON]
executive_corner, media_library,
organization_chart, files
Media_librar
y
Media_librar
y
executive_cornerexecutive_corner
interpretations
partially ordered interpretations
rewrite rules
queries
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results final results
re-ranking rules
Paula Summa search
36
38. What Administrators Need…
• Search administrators have major problems
with an opaque search engine
• Programmable search provides
– Customization to the specific domain
– Ongoing search-quality management
• Search administrators have major problems
with an opaque search engine
• Programmable search provides
– Customization to the specific domain
– Ongoing search-quality management
Okay… but:
The proof of the pudding is in the eating!The proof of the pudding is in the eating!
Recap:
38
The people in change of search are not SIGIR audience; they are IT admins; hence, all they can do are these hacks and hardcoding.
“ It may be the case that a day before, Thin Client Manager meant something else; so, intents change over nights as well.”
So we have different types of tokenization applied to the different types of annotated items; for each annotation type and TG type, the result is stored in a separate part of the index. In a few slides, I will explain how we use that during runtime.
In phase 1, we manipulate the search query, add variants and so on, without touching the index. The result is a set of queries. Next, in phase 2, we run the queries against the index and apply ranking, by a combination of conventional IR and relevance buckets that I will describe shortly. In phase 3, we build the final result by invoking the grouping and re-ranking rules supplied by the admins.
This slide gives a more detailed view of the runtime flow. Mouse click. And these are where the three phases are. Next, I will discuss the different actions in the boxes here.