1) The document discusses the evolution of search engines and algorithms over time from early concepts like Hilltop and PageRank to more modern techniques like RankBrain that use neural networks.
2) It also examines how search engines have incorporated personalization and contextualization by using implicit and explicit user data and feedback to better understand search intent and tailor results.
3) Several studies summarized found that most users expect to find information within the first 2 minutes of searching, spend little time viewing individual results, and refine queries through an iterative process as understanding develops.
4. This is what users really think, or don’t when looking for information.
4
5. Hilltop Algorithm
Quality of links more important than quantity of links
Segmentation of corpus into broad topics
Selection of authority sources within these topic areas
Hilltop was one of the first to introduce the concept of machine-mediated “authority” to combat the
human manipulation of results for commercial gain (using link blast services, viral distribution of
misleading links. It is used by all of the search engines in some way, shape or form.
The beauty of Hilltop is that unlike PageRank, it is query-specific and reinforces the relationship
between the authority and the user’s query. You don’t have to be big or have a thousand links from
auto parts sites to be an “authority.” Google’s 2003 Florida update, rumored to contain Hilltop
reasoning, resulted in a lot of sites with extraneous links fall from their previously lofty placements as a
result.
Photo: Hilltop Hohenzollern Castle in Stuttgart
6. Topic Sensitive Ranking (2004)
Consolidation of Hypertext Induced Topic Selection [HITS] and PageRank
Pre-query calculation of factors based on subset of corpus
Context of term use in document
Context of term use in history of queries
Context of term use by user submitting query
Computes PR based on a set of representational topics [augments PR with content analysis]
Topic derived from the Open Source directory
Uses a set of ranking vectors: Pre-query selection of topics + at-query comparison of the similarity of
query to topics
Creator now a Senior Engineer at Google
6
7. About content: quality and freshness
About agile: frequent iterations and small fixes
About UX: or so it seems (Vanessa Fox/Eric Enge: Cllick-through, Bounce Rate, Conversion)
7
8. Recently released RankBrain that uses thought vectors to revise queries
http://www.seobythesea.com/2013/09/google-hummingbird-patent/
Comparison of search query to general population search behavior around query
terms
Revises query and submits both to search index
Confidence score
Relationship threshold
Adjacent context
Floating context
Results a consolidation of both queries
Entity=anything that can be tagged as being associated with certain documents, e.g.
Store, news source, product models, authors, artists, people, places thing.
Query logs (this is why they took away KW data – do not want us to reverse engineer
as we have in past)
User Behavior information: user profile, access to documents seen as related to
original document, amount of time on domain associated with one or more entities,
8
11. In 2002, Google acquired personalization technology Kaltix and founder Sep Kamver who has been head of
Google personalization since. Defines personalization: “product that can use information given by the user to
provide tailored, more individualized experience”
Uses implicit (Software agents, Enhanced proxy servers, Cookies, Session IDs) and explicit (HTML forms, Explicit
user feedback interaction (early Google personalization with More Like This), Provided by user with knowledge,
More accurate as user shares more about query intent and interests) collection methods
explicit has higher precision than implicit
Query Refinement
System adds terms based on past information searches
Computes similarity between query and user model
Synonym replacement
Dynamic query suggestions - displayed as searcher enters query
Results Re-ranking
Sorted by user model
Sorted by Seen/Not Seen
Personalization of results set
Calculation of information from 3 sources
User: previous search patterns
Domain: countries, cultures, personalities
GeoPersonalization: location-based results
Metrics used for probability modeling on future searches
Active: user actions in time
Passive: user toolbar information (bookmarks), desktop information (files), IP location, cookies
10
14. Reconciling Information-Seeking Behavior With Search User Interfaces for the Web
(2006)
Users don’t always know until the see results = gradual refinement
Search refinement principles
• Different interfaces (or at least different forms of interaction) should be available to
match different search goals.
• The interface should facilitate the selection of appropriate contexts for the search.
• The interface should support the iterative nature of the search task.
13
16. Real Time Search User Behavior: Jansen, Campbell, Gregg (April 2010)
The most frequent query accounted for 0.003% of the query set. Less than 8% of the
terms were unique.
More than 44% of the queries contained one term, 30% contained two terms, and
nearly 26% contained three terms or more. The average query length was 2.32 terms,
which is in line with that of traditional Web search. Moving to the term level of
analysis, there were 2,331,072 total terms used in all queries in the data set, with
3,477,163 total term pairs. There were 175,403 unique terms (7.5%) and 442,713
unique term pairs (12.7%), inline with Web search [3].
15
18. Miles Kehoe did a great post on LinkedIn with specifics if you need them
https://www.linkedin.com/pulse/solving-problems-enterprise-search-miles-
kehoe?trk=hb_ntf_MEGAPHONE_ARTICLE_POST
Perform and audit
Get data
Test security
17
19. Image courtesy of https://almanac2010.wordpress.com/spiritual-new-supernatural/
and “What’s So Funny About Science” Sidney Harris (1977)
18
20. Daniel Tunkelang: Director of Engineering, Search Linkedin, Tech Lead Local Search
Google, Chief Scientist ENdeca
• Communicate with Users
• Entity detection is crucial
• Queries vary in difficulty. Recognize and adapt.
19
21. How Many Results Per Page? A Study of SERP Size, Search Behavior and User
Experience; Kelly & Assopardi
(i) trust bias, where users trust the search engine to deliver the most relevant item
first, i.e., following the probability ranking principle [27], and (ii) quality bias, where
the behavior depends on the quality of the retrieval system. They concluded users are
more likely to click on highly ranked documents and that quality influences click
behavior, such that if the relevance of the items retrieved decreases, users click on
items that are less relevant, on average.
20
22. Optimizing Enterprise Search by Automatically Relating User Context to Textual
Document Content Reischold, Kerschbaumer,Fliedl
Similar roles = similar searches
Role term vector for role rank (use limited number of user profiles to build)
21
23. Is Enterprise Search Useful At All? Lessons Learned From Studying User Behavior:
Stocker, Zoier, et.al
22
25. How Many Results Per Page? A Study of SERP Size, Search Behavior and User
Experience; Kelly & Assopardi
24
26. Cost and Benefit Analysis of Mediated Enterprise Search: Wu, Thom et.al
Time Savings Times Salary (TSTS) methodology is most suitable for assessing the
direct labor cost and benefit - time saved/worker value (salary) used to calculate ROI
Findings: Our case study has shown that the insurance company would get
substantial benefit by investing in relevance judgments. Since the cost for assessing a
query is fixed, the more a query is searched, the more benefit the company would
gain.
25