[Webinar] SpiraTest - Setting New Standards in Quality Assurance
WSDM 2011 - Nicolaas Matthijs and Filip Radlinski
1. Personalizing Web Search using Long Term Browsing History Nicolaas Matthijs (University of Cambridge, UK) Filip Radlinski (Microsoft, Vancouver) WSDM 2011 10/02/2011
Talking about paper “Say Title” which was done as part of my Master’s thesis at the Univ of Cambridge Supervised by Filip from Microsoft Research
Search for IR = short, ambiguous query. For the search engine, that looks the same, even though information need is different => Physicist : more likely be interested in InfraRed => Attendee of the conference: more likely be interested in Information Retrieval => Stock broker: more likely be interested in stock information from International Rectifier All presented the same ranking => not optimal
More emphasis on their interests
Quite a lot of research in personalized web search, but in general we see 2 different approaches Pclick is the best approach within th clickthrough-based ones that we found and we compare to Teevan is best profile-based approach within profile-based ones that we found and compare to
3 major goals: - Improve personalization - Improve evaluation - Create tool that people can use
Search personalization is a 2-step process: first one is extracting user’s interests and second is re-ranking search results User is represented by the following things Last 2 can be trivially extracted from browsing history User Profile => has to be learned
Use structure encapsulated in HTML code Title, metadata description, full text, metadata keywords, extracted terms, noun phrases Specify how important each data source --> limited ourselves to give each data source a weight of 0, 1 or relative
WordNet: include only those of a given set of PoS tags N-Gram: only include those terms that appear more than a given number of times on the web
Calculate a weight for each term Frequency vector = number of occurrences for the term in each of the data sources TF weighting: dot product of weight vector and frequency vector TF-IDF: divide by log of Document Frequency. Normally, the document frequency is calculated from browsing history --> word that shows up a lot in your browsing history does actually mean it’s relevant relative to all information on internet --> used the Google N-Gram information pBM25: N = number of docs on internet, derived from google n-gram, nti = number of documents with that term (N-Gram), R = number of docs in browsing hist, rti = number of docs in browsing hist have that term
2nd step: re-rank results given the user profile Previously shown, re-ranking snippets is just as good as they are less noisy and more keyword focused + more realistic implementation.
Score is indication of how relevant the result is for the current user Matching: sum over all snippet terms of freq of term in snippet times times weight of term Unique matching: ignore multiple occurrences of the same term Language Model = probability of the snippet given the user profile Extra weight to previously visited pages = extension to the Pclick concept
Difficult --> show how the personalization impacts day-to-day search activity First step is an offline relevance judgments exercise in which we try to come up with some parameter configurations that work well Second step is a large scale online evaluation to check how well the parameter configurations generalize over unseen users and browsing history and whether it makes a difference in real life
Choose implicitly => don’t want to require additional user actions Generate unique identifier for every user => anonymous On every page visit it would store URL / Length of HTML / Duration Visit / Time and Date Except for secure HTTPS pages Stored in database => Server would fetch the actual HTML
Relevance: Not Relevant (0), Relevant (1), Very Relevant (2) Normalized Discounted Cumulative Gain == rank quality score
MaxNDCG = Approach that yielded highest average NDCG score (0.568 over 0.506) MaxQuer = Approach that improved highest number of queries (52 out of 72) MaxBestParam = Obtained by greedily selecting each parameter in given order MaxNoRank = Best approach that doesn’t take the Google ranking into account --> interesting that we were able to find an approach that outperformed Google on its own. Later we found that it’s probably a case of overfitting training data, didn’t generalize in the online evaluation.
Using the entire list of words performed considerably worse
Interleaved evaluation present single ranking that interleaves 2 rankings --> evaluate which one is higher quality