Hear from Lucidworks Senior Solutions Consultant Ted Sullivan about how you can leverage Apache Solr and Lucidworks Fusion to improve semantic awareness of your search applications.
3. Building Search Applications
Search is about Technology & Language
• These are difficult but also different problems
• Solving the “language problem” requires that we understand
how language is used in search
• We understand language at the semantic level - where
“meaning” or intent lives
• Search Engines deal with language at the syntactic level
• Most problems relating to search quality stem from this basic
“disconnect” – the “what” vs “what words” dichotomy
Better
^
4. Technology – Horizontal Concerns
Search applications share these requirements with other
information retrieval systems
• Performance – returning results in HTT (Human Tolerable Time)
• Scalability – being able to search “billions and billions”of documents
serving thousands or tens of thousands of users at a time.
• Reliability – fault tolerance, fail-over, redundancy
• Maintainability – easy to upgrade, search index can be kept current
in the face of rapidly changing content.
• Usability – User Experience is critical to success. UI and UX Mobile
Technology Is a Game Changer here!!!
5. Language – Vertical Concerns
These requirements are more specific to search systems.
• Accuracy – returning the “correct” results.
• Precision – few false positives
• Recall – few false negatives
• Relevance – returning the “best” results at the top
Returning the wrong results very fast is not
necessarily a good thing. Returning too many
results can affect performance.
6. Time flies like an arrow
Fruit flies like a banana
Our mental image for the second sentence depends
on how we “parse” it. It depends on what the subject
noun or noun phrase is.
7. The subject can be “fruit” or “fruit flies”. This
decision changes the verb which is either “flies”
or “like” respectively.
Fruit flies like a banana
Fruit flies like a banana
8. We can do this because we know that both “fruit” and
“fruit flies” represent single concepts – even though
“fruit flies” is two words – i.e. a “noun phrase”.
Fruit flies like a banana
Fruit flies like a banana
9. Search algorithms
and semantics
Tokenization plus vector mathematics
(TF/IDF or one of its cousins) – “bag-of-words”
Algorithmic tweaks – enhanced bag-of-words:
1. Some fields are more relevant than others
2. Hitting on more terms in the query is better than
hitting on fewer (token scores are summed)
3. The nearer the query terms are to each other in the
document the better – same order as query is best
4. Getting 0 results provides no feedback – OR is safer
than AND (we already have “fuzzy” & with bullet (2)
Problem: Search engines don’t
understand semantics
10. Better Search: Detecting Noun Phrases
Can algorithms be used to detect noun phrases?
Yes, but not perfectly and may need too much
CPU at query-time
Another way is to use knowledge bases – a lot of
extra work, but in some cases – we already have
one - the search index itself!
11. Better Search: Detecting Noun Phrases
The basic technique is called “autophrasing” –
recognizing when more than one word
represents just one thing.
Autophrasing – uses an extra knowledge-base
file “autophrases.txt”
Query Autofiltering – uses the phrases that are
stored as metadata values in the index.
12. Multi-term Synonym Problem
Subject was inspired by an old JIRA ticket: Lucene-1622
“if multi-word synonyms are indexed together with the original
token stream (at overlapping positions), then a query for a partial
synonym sequence (e.g., ‘big’ in the synonym ‘big apple’ for
‘new york city’) causes the document to match”
(or “apple” which will hit on my blog post if you crawl lucidworks.com !)
13. Sausagization
From Mike McCandless blog: Changing Bits: Lucene's TokenStreams are actually graphs!
• This means certain phrase queries should match but don't (e.g.: "hotspot is down"), and other phrase
queries shouldn't match but do (e.g.: "fast hotspot fi").
• Other cases do work correctly (e.g.: "fast hotspot"). We refer to this "lossy serialization" as sausagization,
because the incoming graph is unexpectedly turned from a correct word lattice into an incorrect sausage.
• This limitation is challenging to fix: it requires changing the index format (and Codec APIs) to store an
additional int position length per position, and then fixing positional queries to respect this value.
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
14. Multi-term Synonym Demo
autophrases.txt
new york
new york state
empire state
new york city
new york new york
big apple
ny ny
city of new york
state of new york
ny state
synonyms.txt
new_york => new_york_state, new_york_city, big_apple,
new_york_new_york, ny_ny, nyc,empire_state,ny_state,
state_of_new_york
new_york_state,empire_state,ny_state, state_of_new_york
new_york_city,big_apple,new_york_new_york,
ny_ny,nyc, city_of_new_york
15. Multi-term Synonym Demo
This document is about new york state.
This document is about new york city.
There is a lot going on in NYC.
I heart the big apple.
The empire state is a great state.
New York, New York is a hellova town.
I am a native of the great state of New York.
New York New York City New York State
/select /autophrase
16. Multi-term Synonym Demo
This document is about new york state.
This document is about new york city.
There is a lot going on in NYC.
I heart the big apple.
The empire state is a great state.
New York, New York is a hellova town.
I am a native of the great state of New York.
Empire State
/select /autophrase
17. Query Autofiltering
Content Tagging and Intelligent Query
Filtering. Using the search index itself
as the knowledge source:
Search Index
Content
Content
Tagging
Auto FilteringQuery The Answer
18. Lucene FieldCache “In Action”
Standard “Inverted Index” (Lucene itself):
• Show all documents that have this term value in this field
• Used to get initial set of search result IDs
Uninverted or Forward Index (FieldCache):
• Show all term values that have been indexed in this field
• Can lookup term value for a doc ID
• Used to facet and get display values for doc IDs.
19. Query Autofiltering Implementation
Use Lucene FieldCache to build a map of field values
to field names (of string fields)
Add synonym mappings from synonyms.txt and
stemming to this value(s) -> field(s) map
Use this map to discover noun phrases in the query
that correspond to field values in the index – longest
contiguous phrase wins
Build filter or boost queries based on these
discovered mappings
22. Query Autofiltering – Basic Behavior
q = red socks -> fq=color:red&fq=product_type:socks
or bq=(color:red AND product_type:socks)^20
q = Red Lion socks -> fq=brand:”Red Lion”&fq=product_type:socks
q = scarlet Chaise Lounge -> color:red AND product_type:”Lounge Chair”
q = white dress shirts -> color:white AND product_type:”dress shirt”
23. Dealing With “Unstructured” Text
This term ITSELF is evidence that we think of language as
unstructured when we know that it actually is not - It HAS to have
structure or we couldn’t communicate very well.
“The Lady Is A Tramp” vs “Lady And The Tramp”
Dealing with unstructured text means better handling of phrases.
Little words – like “if” can have big meaning!
24. Classification Technologies
Machine Learning
• Automated vs Semi-Automated
Natural Language Processing (NLP)
• Parts Of Speech
Taxonomy / Ontology
• Relationships
• Handles Phrases naturally
• Knows what is what and what is related to what!
25. Ontologies Designed for Search
Category Nodes – ‘parent’ nodes
that can have child nodes,
including:
• Sub Categories
• Evidence Nodes
Evidence Node – tend to be a leaf
nodes (with no children) and contain
keyterms (synonyms)
• May contain “rules” e.g. (if contains term a and
term b but not term c)
• Evidence Nodes can have more than one
category node parent
Hits on Evidence Nodes add to the cumulative score of a Category Node.
Scores can be diluted as they traverse the graph – so that the nearest
category gets the strongest ‘vote’.
26. Fortune 100 Companies
Energy
• Financial Services
• Investment Banks
• Commercial Banks
Health Care
• Health Insurance
• HMO
• Medical Devices
• Pharmaceuticals
Hospitality
Manufacturing
• Aircraft
• Automobiles
• Electrical Equipment
Corporations
• US
• British
• Chinese
• French
• German
• Japanese
• Russian
• +
27. Fortune 100 Companies
Energy
• Financial Services
• Investment Banks
• Commercial Banks
Health Care
• Health Insurance
• HMO
• Medical Devices
• Pharmaceuticals
Hospitality
Manufacturing
• Aircraft
• Automobiles
• Electrical Equipment
Corporations
• US
• British
• Chinese
• French
• German
• Japanese
• Russian
• +
28. The Basic Search “Use Case”
Traditional - Brief display – snippeting,
hyperlinks and paging
• Faceted Navigation
• Highlighting
• Need To RETHINK for Mobile!!!
Query Formulation
–> Result Inspection
–> Query Refinement
29. Shortening The Loop
Query Suggestion (aka autocomplete,
typeahead)
• “Predictive” search
• Single field restriction
Recommendation
• Query – result – click – store – aggregate
• Boosting results or Suggesting queries
Best Bets (Query Elevation) – i.e. Punting
• Spotlighting
• Making it dynamic
Faceting
• Takes advantage of classification tagging
• Can be used to generate multi-field
phrases for suggestion
Inferential Search
• “I’m Feeling Lucky”
• Query Autofiltering
30. Enhanced Search: Pipelines
Document and Query Pre-Processing
Internal to Solr:
• Update Request Processor
• Data Import Handler (DIH)
• Search Component Chain
Big Data = Big Problem
or just a Big Opportunity:
• Hadoop – Solr
• Spark – Solr
• Morphlines
External to Solr:
• Custom ETL + SolrJ Integration
• Apache UIMA *
• DIH Client (SOLR-7188)
• Lucidworks Fusion
• Modular Informatic Designs framework
(coming soon to Open Source?)
31. Index Pipelines – Good Ole ETL + ______
Annotations!
Subject - Verb - Object
Entity Extractors – Identify Subject
and Object (noun phrases)
Annotations – mark locations of
entities in document
Discover Facts from Semantic Patterns
• $Person joined $Company
• $Drug is used to treat $Disease
• $Company acquired $Company
• $Person wrote $Song
Watson used IBM’s (now Apache’s) UIMA
(+40,000 PC’s)
Jeopardy is a “guess subject given object
and verb - posed as a question” – game
32. Who Needs Query Pipelines?
Who, What, Where, When:
• Security Filtering - Entitlements
• Dynamic Boost Block based on Preferences, Search History
• Geo Filtering – IP to geolocation
• Content Spotlighting based on time, place and search history
• Query Introspection – Infer User Intent
33. Lucidworks Fusion: Pipelines Proliferate
Documents and Queries are dynamic Metadata Objects
• PipelineDocument QueryRequestAndResponse respectively
Lots of Stages – more coming with every release
• Metadata -> metadata – lookup, clone, map, join
• Content -> metadata – extract, transform, classify
Index Pipelines: One-Way Query Pipelines: Round-Trip
• Both pre- and post-Query filtering opportunities
Connector
or Query
Stage Stage Stage Stage Solr Cloud