This document discusses query understanding and how it can help improve search relevance. It addresses problems related to precision such as stemming proper nouns, not respecting phrases, ambiguous keywords, and nomenclature mismatches. Problems related to recall discussed include missing data, missing structured data, and nomenclature mismatches. The document outlines both easy and harder solutions to these problems including using human mappings, existing data sources, heuristics, and machine learning. It emphasizes that query understanding focuses on user intent and can help achieve better search relevance.
Hi, my name is Gio. I’m here to talk about Query Understanding. Before we get started let me tell you a little bit about myself.
I started off my career working in the finance industry for a subsidiary of Standard & Poors. I spent most of my time here building a database, real-time processing system, and search engine for globally sourced financial documents. My life was endless data gymnastics.
It was interesting and challenging work, but surprisingly, finance didn’t really make me feel all warm and fuzzy inside so at some point I decided to make a break for it.
I had the good fortune of spending 6.5 years at Etsy. I mostly focused on search and data-driven product development. I lead the Search Experience Team, which was responsible for the Search user experience from the ground up including stuff like autosuggest, faceting, result treatments, taxonomy, mobile UX, etc. Most recently I led the Search Ranking team, which was responsible for the fairness and effectiveness of Etsy’s core search algorithm.
I left Etsy earlier this year to start co-found Related Works with some close friends. We’re a small consulting shop that’s helping businesses, probably like yours, unpack their data and search problems to ultimately improve their bottom line.
Let me give you some idea of what we’re going to talk about.
First we’re going to talk about why we would consider using query understanding techniques.
Then we’ll define query understanding and set a foundation for the approach.
We’ll dig into the nitty gritty by looking at some user problems.
Before talking about what’s at stake when you do Query Understanding work.
Why should care about Query Understanding? Well...what are we trying to do? What’s our goal?
You are probably at this conference/meetup because there’s some search experience back home you’d love to make better. As designers of information retrieval systems, I would posit that our goal is relevance, which is vague and has probably had many many definitions over the years. Let’s get more specific.
We start with a user that has some specific information need. Maybe it’s a student writing a report about pluto. They’d like to understand why pluto is no longer a planet.
The universe, or our system, has information that might satisfy this need. Our challenge is to bridge this divide and connect the user with the information they need.
Typically the user’s information need takes the form of a query or keywords, for instance, the student might type the word “pluto” in the search box on Google.
The information retrieval system represents information as documents. In Google’s case, a document is a website. The goal is to connect our student with the most useful websites with the keyword pluto.
Relevance is how well the documents returned satisfy the users information need. Were they able to figure out why pluto is no longer a planet? Measuring this is hard, but pretend it’s not for the sake of this exercise.
Query Understanding is a tool we can use to achieve relevance.
So the next obvious question is - if not QU, then what else? How can we make search engines relevant? We’ll examine two alternatives as motivating context.
The first alternative is what I’ll call statistical approaches to relevance.
Given a document, you’ll do some math, and come up with a figure that tries to quantify how good of a fit this document is for the query in question, which you can use to sort the results. The key idea is to use the demographics of your corpus to help you understand what’s important.
The most famous incarnation of statistical relevance has been with us since the 70s. I’m sure you all know TFIDF. It’s by no means the only form of statistical or probabilistic relevance. BM25, infogain, there are tons. But TFIDF is surprisingly effective and very understandable, so it’s useful for this demonstration.
Here’s my grossly simplified version. You won’t find the words in square brackets on wikipedia, I’ve added them for effect/clarity.
The trouble with statistical approaches, like TFIDF, is that they are generally hyper-focused on documents. But what if your documents are misleading you?
What’s more, where’s the user in this equation? You could argue that the user provides the terms. But is that sufficient?
Let’s look at an example.
Hmm. It looks like Flipkart doesn’t really have any fanny packs.
Or do they?
Document-focused approaches generally assume that users and documents are using the same language, and it turns out that’s almost never true. Your content creators are probably experts in their field. They probably use very precise, arcane nomenclature. Demographically they may be very different than your users, and thus lack the same shared cultural context. Your content creators may not even be human!
A popular alternative to statistical relevance is machine-learned relevance, or learning2rank.
Learning2rank frames relevance as a supervised learning problem. The simplest formulation of this task is called a point-wise model and it looks something like this.
Given the document and optionally a query or some context, the model will score that document. You can use that score to sort the results.
You train the model by showing it tons of examples. For instance, your examples might look like a document, a query, and a yes/no answer for whether or not that document is relevant for that query.
You can have humans hand-label examples, or often folks will harvest user behavior to produce them.
But Learning2rank suffers from the same pathologies as any machine-learned approach.
I recently gave a talk that basically covers all three of these in harrowing detail. You can find it on my website: giokincade.com.
But there’s a much more insidious reason vanilla learning2rank can let you down. I call it the poisoned result-set problem.
Even if you do a reasonable job of ordering the results, the presence of one or two viscerally irrelevant items can completely ruin the experience for your users. If you show folks dresses when you are searching for dress shirts, they will lose faith in your search engine. Which means they will try to hack around it by crafting complex queries, or they will just leave.
You can think of the problem of removing irrelevant results as akin to filtering email spam. We want to get rid of the junk.
Yes, I religiously inbox zero. I know, I’m an asshole.
Nobody uses ranking models to filter spam.
The trouble is, ranking is not the same thing as filtering out irrelevant results. They are fundamentally two different tasks, which will need different strategies, different models, and different evaluation metrics. If you try to accomplish both with one model, you will probably do neither very well.
It’s easy to fall prey to the temptation of just “trusting the models” to solve your relevance problems. I have personally wasted a lot of time, and seen teams waste a lot of time making sacrifices at this alter.
Having understood why other common strategies for achieving relevance can fall short, we arrive at the star of the show.
Rather than documents.
More precisely, I would define Query Understanding as trying to discern the intent of a searcher. As we learned earlier, language, and therefore queries, are not necessarily sufficient to produce good results.
The holy grail of QU looks something like this. The user gives you plain text. You are able to discern what all the pieces are and what they mean.
Extra points if you can turn this into truly structured data by mapping these pieces to entities that your search engine understands.
If you zoom out a bit, you can think of Query Understanding as a process by which you map queries onto a set of facts you understand. When the user says “dress shirt”, we know they are interested in a type of shirt with a vertical set of buttons. We know that shirts are a type of clothing. So you can think of these facts as expressing relationships - they naturally make a graph. A knowledge graph.
If you are mapping documents onto the same set of facts, you’ve built a bridge from your corpus to your users, based on knowledge. If you only return documents that correspond to the facts, you’ve achieved a baseline of relevance.
Before you go back to your office and tell your boss that I suggested that the solution to all your problems is a generalized knowledge graph service written in Rust, hold up.
Knowledge graphs are a real technique you can use to power query understanding, but they are also a good metaphor we can use to wrap our heads around the query understanding problem. At this point I’m just using KGs as an illustration. While KGs are useful, I think the vast majority of folks can solve their problems more simply.
So yeah, Knowledge graphs as a metaphor. Roll with me for a minute.
Now, the tricky thing about knowledge is that there’s an awful lot of it.
Google’s KG encoded 70 billion facts as of October 2016. That’s probably a tiny fraction of human understanding, and they’ve been working on it for something like a decade. Chances are good that you don’t have the resources to even chip away at that figure.
Which puts us in an interesting predicament. The way to make search smarter is to encode facts or semantic knowledge. But there’s so much of it it’s impractical for us to encode even a fraction of the knowledge out there. We have to make tradeoffs. Even google has to prioritize.
Chances are good your distribution of queries looks something like this. There’s some head of queries like “harry potter” that are issued very frequently and make up a ton of your traffic. Then there’s this long tail of queries that are only ever issued a handful of times, like “ithaca is gorgeous t-shirt red”.
A natural place to start is by looking at the head of this distribution, and encoding the minimum set of facts necessary for those queries. What I mean by that is do whatever it takes for your search engine to actually understand them and achieve some baseline of relevance, and we’ll talk about that in more detail for the remainder of this talk.
After you’ve tackled the head, I would suggest trying to find the situations where users aren’t successful. Like this fanny pack example.
Regardless of how you proceed, I think the most important thing is you should be using data as guide to figure out what problems to solve, and thus what knowledge to encode in your search engine. I think this is sort of foundational to the query understanding approach.
So for the bulk of this talk, we’re going to talk about what sorts of problems you might run into as you closely scrutinize your query logs, and how you can use query understanding to solve them.
We’re going to look at two categories of problems - precision and recall. I could show you some math or a confusion matrix, but you know, that can get confusing.
I like to think of precision problems as cases when you are getting a ton of irrelevant results, and recall problems cases where you are simply not getting enough results.
We’re going to look at 3 examples of each, and talk about some possible solutions from a QU perspective.
Let’s talk about precision problems. You can usually find these by finding cases where there are low click rates, or just abandoned searches all together.
It’s late 2016 and you just got home from the Beyonce concert. You are so hopped on Lemonade you start shopping for the perfect bey inspired accessories. Except when you search for “formation”, as in “formation world tour”, you get medium format cameras.
Searching for beyonce doesn’t help. 2 rows down the results are dominated by “beyond the beach” apparel.
You make one last ditch effort searching for Solange.
Should crop solange
https://www.google.com/search?biw=1440&bih=759&tbm=isch&sa=1&ei=8qH1Wtj7MIKe_Qbi_7D4Dw&q=solange+this+girl+called+you+the+n+word&oq=solange+this+girl+called+you+the+n+word&gs_l=img.3...33448.37202.0.37301.37.25.4.0.0.0.173.2233.14j11.25.0....0...1c.1.64.img..8.4.417...0j0i30k1j0i8i30k1j0i24k1.0.kCG8Xa1gKtw#imgdii=hitNq5LO-smihM:&imgrc=ZTLqpMPjNOF5lM:
Congratulations you’ve just discovered the the knowles family bermuda triangle. The problem is we have some very aggressive stemming happening.
The deeper problem is that our search engine can’t distinguish between normal words that should be stemmed, and proper nouns that should be searched for as-is.
The obvious solution is to just make exceptions when you run into proper nouns, i.e. add “beyonce” to the “do not stem list”.
Am I seriously suggesting you manually compile a list of proper nouns that shouldn’t be stemmed?
https://gist.github.com/giokincade/72bb449309b2bea9314c95acf26f1d78
Yes. Look your users don’t care how smart you are. It doesn’t matter how high-tech your solution is. I’m fairly certain most of y’all can go back and review the head queries in your search engine and compile a reasonable list of proper nouns that shouldn’t be stemmed.
The next step in sophistication is to harvest a list from some data. Maybe you have an “Artist” field in your database you can grab the unique values from.
Or maybe you can harvest an external source of the data. For instance, Wikipedia has a list of R&B artists you could use since it seems that your search engine is having a tough time with them.
https://en.wikipedia.org/wiki/List_of_R%26B_musicians
The most sophisticated solution is to use a part-of-speech tagger or a named-entity-recognition system. You don’t necessarily have to train your own - there are a bunch of trained models and services out there you can leverage.
Sounds easy in principle, but most NLP systems and models are trained on English prose and come to depend heavily on the nuances of language that’s organized into grammatically correct sentences, like you know, capitalizing the proper nouns. Here’s Google’s NLP API failing miserably at figuring out that beyonce is a thing.
https://cloud.google.com/natural-language/
Rather than trying to use an existing POS-tagger on your queries, it might be more successful to run your documents, which tend to have longer more natural language, and use that to generate a list of candidate entities, and possibly have a human just review these to make sure they make sense.
Let’s look back at this problem - we have t-shirt dresses showing up for the query “dress shirt”.
The trouble here is that the user intends “dress shirt” as a phrase but our search engine doesn’t understand that. Another name for phrases is n-grams.
Again the easiest solution is to just create a list of phrases that should always be respected.
The next solution is to use the data in your query logs and corpus and some simple heuristics to identify potential phrases.
For example, you could use pointwise mutual information, a measure of how related two concepts or events are. The idea is pretty simple. Measure the probability of “dress shirt” appearing together in your query logs, and divide that by the probability of both words appearing independently. The larger the PMI, the more likely the tokens should be treated as a phrase.
One technique that’s very popular is to check if a phrase is a thing in wikipedia. Some of the most effective heuristic approaches to the problem combine this with statistical co-ocurrence figures like PMI.
The most sophisticated approach using machine learning.
You can conceptualize this as a binary classification problem. Given two tokens, should we treat them as a phrase? Yes/No?
Often folks will take some of the heuristics that we described earlier like PMI or presence in wikipedia and use them as model features.
If you’re looking for research on the topic, you’ll find that the literature refers to this problem as query segmentation because they conceptualize in the negative - we’re not deciding if two tokens are a phrase, we’re deciding if they should be separated. I think that’s tremendously confusing, but it makes sense when you think about this as a step in the direction of that holy grail we talked about earlier. First you understand what the pieces are, then you decide what they mean.
You can start as simply as logic regression, but there are more sophisticated techniques that use sequence models like Conditional Random Fields(CRFs) or Recurrent Neural Networks (RNNs) are also popular. This is a diagram of a linear chain CRF, where we are tagging each token as either beginning a segment in green, or inside of a segment, in pink.
For our last precision problem, I searched for dress on Etsy. You’ll notice that the highlighted result is not a dress - it’s a pattern that you can print out and use to sew a dress.
One thing you may not realize about Etsy is that you can find both finished goods - like handmade dresses, but you can also find crafting supplies that you can use to make your dresses, clothing, and all sorts of other DIY projects.
The trouble is the keyword “dress” is mentioned by items of both type. In the abstract, it’s ambiguous. But you and I both know that when the user searched for “dress”, they probably wanted to see the finished variety.
There are many solutions to this problem, but one is called query classification.
Query classification just means taking a query and mapping it to some categorization scheme.
Usually we’re talking about fairly broad categories to start. Most search engines naturally have a few different categories people often call “document types” (which was until recently a concept enshrined in elastic search). Audio vs Video. News vs Websites, etc. These types usually have different schemas, target audiences, use-cases, context, etc. So it’s natural to start by classifying queries into these broad buckets.
Bu it’s not uncommon for folks to use a much more granular set of classes that are often arranged in a hierarchical taxonomy. So for instance, rather than classifying things as finished goods or craft supplies, you could literally classify them as “clothing”, “womens”, “dresses”.
If the broader buckets can solve your problem, I would start there. The more granular you get, the harder the task gets. In this case, just separating finished goods from craft supplies will work just fine.
Again the easiest way to get started is to leverage your domain expertise to compile a list of mappings from queries to categories. This can take the form of literal mappings or heuristic rules like “only queries that include fabrics or patterns should be treated as craft supplies”.
The next step in sophistication is harnessing some simple heuristics for query categories.
Even though the results for “dress” includes a bunch of craft supplies, if you look at what people click on, I bet they are all finished goods. You can use this intuition to calculate a probability, though I would caution that you probably only wanna do this for queries with a reasonable amount of traffic and clicks.
Lexical similarity is another useful heuristic. If a query is an exact match for some category name, or very close to it, there’s a good chance that’s the category the user is interested in.
And finally, you can of course, treat this as a multi-class classification problem.
So you show a model a bunch of examples of queries, and it decides which buckets to place the query in.
A natural extension of the probability ideas we just talked about is to use a Naive Bayes classifier. Instead of talking about the probability of a class given a query, you’ll combine the probability of a class given each token in a query.
This is a simple way to get started and it’s probably a reasonable base-line, and it’s easily interpretable.
The state of the art gets much more sophisticated. In this example researchers used a convolutional neural network to turn word embeddings from word2vec into query embeddings, and then used that as input to a tree-based classifier. Interesting stuff, but suffice to say I don’t think you need this level of sophistication to get somewhere.
http://people.cs.pitt.edu/~hashemi/papers/QRUMS2016_slides.pdf
Ok, let’s think about some recall problems. The way to get started finding these is just look for cases where there are few results.
Now let’s go back to our fanny pack example.
The problem here is that the words the user knows isn’t the words your inventory uses. There’s a nomenclature mismatch.
If we think back to our Knowledge graph metaphor, what we need to do is encode the idea that there is a thing called fanny pack, and it’s another name for a bum bag. I.e. it’s a synonym.
I’m fairly certain all of you folks have had to deal with synonyms before, and the most straightforward solution is to just manually add synonyms to your system.
The next thing you can do is leverage existing synonym data.
In this case wikipedia was more useful, but you get the idea.
For instance, you might try using word2vec on your corpus or queries and then find words that are close to each other in that embedding space. It’s hard to see but in this case we’re looking at the word “probable” and the closest neighbor is “likely” which is a decent synonym.
High quality synonym extraction is a tremendously hard problem and I don’t foresee you’ll have a ton of success here without human intervention. So this might be another instance where you can leverage semi-supervised techniques like word2vec on your corpus to discover synonyms or abbreviations that are specific to your domain, and then have a human review them to ensure they are high quality.
https://gist.github.com/giokincade/45b68d9042849f30d525a1884d3e54fd
Now for our last recall example, I turned to the Digital Public Library of America, which aggregates digital collections from libraries all over the US. Here I’m searching for red nascar, there are very few results, and almost none of them are red.
I realized after the fact that this example must paint me as some weird beyonce loving trump voter. I swear I have never seen nascar I have no idea how I happened on this search.
Anyway, there’s plenty of nascar stuff on the site. The first few results are even red.
Could this guy get any redder?
So this highlights a problem that receives very little attention in talks and papers, but that I’ve found to be tremendously critical and common. Your user is searching based on some facet or attribute, and your documents simply don’t have that information. In this case, we don’t know which of our photographs are red.
If you think back to our knowledge graph diagram, the trouble is that even if we understand color queries, we don’t understand the color of our documents, so we can’t build this bridge.
A lot of folks have documents that are created by humans - ecommerce shops, blogs, newspapers, etc. If you fall into that category, just ask them to provide this information when they upload content into your system. Often you’ll get plenty of data just by emailing folks and asking them to update their content, if you tell them it will affect search.
Depending on the information that you need, there might be some fairly straightforward heuristics for extracting decent structured data.
If you’re looking for color information there are a bunch of well-known algorithms, or pre-trained models, for extracting palettes from images. This still leaves the task of mapping RBG values to human-readable strings, which can be non-trivial depending on how specific you want to get, but you get the idea.
Algolia actually has a pretty chill command-line utility you can try as well:
https://github.com/algolia/color-extractor
Sometimes the information you need is actually available but it’s often buried in unstructured data. In this example from Etsy you can see sizing information buried in tags. A gentle application of regex can get you pretty far here.
And of course, the hardest path is to train some models to extract the metadata for you.
CNNs are a natural choice for image processing - so you might train a CNN to place images in to one of a number of color buckets, though this task in particular is sort of beneath their expressive power. You probably want to start by transfer learning from popular pre-trained models like Google’s Inception v3.
By now, I hope you’ve noticed some patterns.
The easiest and most accurate thing to do is ask humans to solve your problems. Whether it’s hand-tuning synonyms, building a list of proper nouns, or deciding on the best color tags for an image, humans are remarkably effective. This is where you should start.
Humans are great for helping the head. And remember that a lot of the queries in the long-tail are variations of the head queries. So some human contributions, like synonyms and entities, will help there.
The next step in sophistication is using data to solve your query understanding problems. This family of techniques is usually fairly effective, and has the benefit of being easy to understand and implement.
An obviously, the most sophisticated approaches leverage machine learning.
But you can’t have humans review every query in your system, and they are unlikely to craft heuristics for interpreting queries that will scale. You’ll usually need data-driven schemes or ML to start having some impact on the long-tail.
Typically you’ll see two patterns for how human and AI systems interact in the query understanding processes. The first is using data or ML to produce candidate pieces of knowledge to add to your system, and then asking experts to review those candidates. So for instance you might use the POS tagger to identify entities or word2vec to identify potential synonyms.
Now before we wrap up, it’s important for you to realize that Query Understanding is a high stakes game.
If I search for “clutch” and I meant the fashion variety, but you show me car parts, that’s a catastrophic failure. This means you need to have high confidence in your interpretations. You probably want to tune your models for precision over recall.
Which brings me to the intersection of Query Understanding and the user experience. Thus far we’ve been hyperfocused on results. Those are important. But you can use QU to improve your user experience as well, which is tremendously useful because the stakes can be much lower.
If we have some inkling that your query is about women’s hand bags, for instance, but we’re not 100% sure, we can recommend that category in auto suggest. If you get this wrong, you haven’t completely ruined the user’s experience. If you get it right, chances are good the user will select it and get better results.
And really, there’s a whole spectrum of options you in terms of how you leverage query understanding. You could start with adding stuff to autosuggest, adjusting faceting, suggesting a refinement. On the high-confidence end of the spectrum, you could give a strong preference to results that match your query understanding results without necessarily restricting the result set.
You can imagine a similar spectrum for recall problems.
I hope this has been helpful. This is really just the tip of the iceberg. There’s so much more I’d love to talk about.
What I really want you to take away is that is that QU is about achieving a baseline of relevance.
Query Understanding techniques accomplish that by trying to discern user intent. Focus less on documents. Focus more on queries.
You get there by closely scrutinizing your data, and figuring out what knowledge your search engine needs to understand.
One last shameless plug before I go. The name of my company is Related Works. We consult and advise companies to help them with their search and data problems. If that sounds interesting to you, please get in touch. I’m @giokincade on twitter, medium, and linked-in. There’s my email. If you wanna know more about us visit www.relatedworks.io, or checkout out our blog on medium. I’m in the middle of a long series about autosuggest that may be interesting to you.