For a personalized search experience, search curation requires robust text interpretation, data enrichment, relevancy tuning and recommendations. In order to achieve this, language and entity identification are crucial.
For teams working on search applications, advanced language packages allow them to achieve greater recall without sacrificing precision.
Join us for a guided tour of our new Advanced Linguistics packages, available in Fusion, thanks to the technology partnership between Lucidworks and Basistech.
We’ll explore the application of language identification and entity extraction in the context of search, along with practical examples of personalizing search and enhancing entity extraction.
In this webinar, we’ll cover:
-How Fusion uses the Rosette Basic Linguistics and Entity Extraction packages
-Tips for improving language identification and treatment as well as data enrichment for personalization
-Speech2 demo modeling Active Recommendation
-Use Rosette’s packages with Fusion Pipelines to build custom entities for specific domain use cases
Featuring:
-Radu Miclaus, Director of Product, AI and Cloud, Lucidworks, Lucidworks
-Robert Lucarini, Senior Software Engineer, Lucidworks
-Nick Belanger, Solutions Engineer, Basis Technology
2. 2
Today’s Speakers
Radu Miclaus
Director of Product, AI and Cloud
Lucidworks
Robert Lucarini
Senior Software Engineer
Lucidworks
Nick Belanger
Solutions Engineer
Basis Technology
3. 3
Agenda
Challenges with Languages in Search Applications
How Fusion uses Rosette to address these Challenges
Deeper dive into Entities Customization
4. 4
Personalization through Search
experience
Documents Search Curation Personalization
Text Interpretation
Data Enrichment
Relevancy Tuning
Exactly what I am
searching for
Guide me to other
interesting things
Recommendations
✔
5. 5
• LANGUAGE IDENTIFICATION
• CHARACTER NORMALIZATION
• GREATER RECALL WITHOUT LOSING
PRECISION
• METADATA
EXTRACTION/ENTITIES/FACETS/FILTERS
Challenges with
Languages in Search
Applications
8. 8
Lemmatization
What is it?
Associates words with the same
meaning (child/children;
beau/belle/beaux/belles). This is an
alternative to stemming which
associates words that look alike with
endings removed (arsen|ic -- arsen|al).
Why it matters
Important for European languages
where adjective agreement of
gender/number and verb conjugation
create multiple word forms,
associating the forms of a single word
increases search recall.
Impact on search
Increases recall of relevant results,
especially for European languages.
French examples:
9. 9
Tokenization
What is it?
Divide sentences into words
for languages written without
spaces between words.
Why it matters
The bigram method ignores
meaning and essentially does
substring matching of one or
two characters. Chinese is
highly ambiguous. Any one
character could be a single
word, but often isn’t.
Impact on search
Greater precision of Chinese,
Japanese, Korean searches.
10. 10
Chinese Script Conversion
What is it?
Converts all records or queries
to between simplified and
traditional Chinese.
Why it matters
It’s impossible to search all
Chinese documents at once
unless a user searches twice: in
traditional and then simplified
Chinese.
Impact on search
With one query, one can search
both simplified and traditional
Chinese documents
simultaneously and see results
in your preferred script.
11. 11
Decompounding
What is it?
Splits compound nouns.
Why it matters
A search for a compound word
like Jugendarbeitslosigkeit
(German: “youth unemployment”)
misses results where the two
concepts (“youth” and
“unemployment”) are separated
(“20% more youth were
unemployed this month.”
Impact on search
Greater recall of German, Dutch,
Korean searches.
German examples:
12. 12
Named Entity Recognition (NER)
What is it?
Adds structure to your
unstructured, multilingual text by
automatically identifying people,
organizations, and locations,
dates, products, and much more.
Why it matters
Filter results for the ones
containing the entities most
pertinent to your search.
Impact on search
More quickly refine your search,
remove noise, and increase
search relevance.
14. 14
SOLR and Fusion
Rosette Enhancing Fusion
- SOLR support for multilingual tokenization
- 35 languages supported
- 7 entities supported with OpenNLP
integration
SOLR/Fusion/Rosette
Base Linguistics:
- 32 supported languages
- Sentence tagging
- Tokenization
- Lemmatization
- Part-of-speech tagging
- Decompounding
- Chinese/Japanese readings
Rosette Entity Extractor:
- 21 supported languages
- 29 entity types and 450+ sub-types detected
15. 15
Rosette is enhancing Fusion’s capabilities to enrich data for search and
personalization. Besides language interpretation, robust Entity Extraction can
enhance Search through the usage of Facets.
20. BASIS TECHNOLOGY
The Rosette Entity Extraction Workflow.
20
The Rosette Entity Extractor:
● comes with expertly crafted models.
● can extract 18 different kinds of entities in more
than 20 different languages.
● is made with high quality data.
● Is curated by our dedicated data team.
● Is backed by 25 years of NLP expertise.
21. BASIS TECHNOLOGY
The Rosette Entity Extraction Workflow.
21
Machine or deep learned
statistical models that
identify entities based on
context
A high performance
gazetteer that is
dynamically updatable
Rules based extraction
based on REGEX style
patterns
22. BASIS TECHNOLOGY
Configuration and Customization.
22
Configuration:
● Quick and easy
● Leverages pre-defined capabilities
● Primarily file manipulation
Customization:
● Drastically change REx capabilities
● Allows for truly custom approaches
● More time-intensive
23. BASIS TECHNOLOGY
Configuration: Gazetteer and Regex.
23
Gazetteer
● Easy to create/modify/maintain
● Create lists of entities to extract
● Great when set is limited/defined
● Accept and reject
Regex
● Match any pattern, simple or complex
● Extract all entities following a pattern
● Requires technical resources
● Accept and reject
24. BASIS TECHNOLOGY
Configuration: Model Training and Custom Processors.
24
Model Training
● Customize the ML models directly
● Train on your genre of text
● Teach it to recognize new entities
● Requires training process
Custom Processors
● Execute custom code in a sandbox
● Validation, redaction, transformation
● Create more complex extraction rules
● Accept and reject
25. 25
Take Away
● Text Interpretation and Enrichment are Crucial to Personalization
● Having robust language and entity support technology is essential for text
interpretation and enrichment
● Fusion and Rosette technologies stacks are now integrated to provide the
best of AI-Powered Search and AI-Powered Linguistics.
● Visit the BasisTech Booth at Activate