AI-Powered Linguistics and Search with Fusion and Rosette

1
AI-Powered
Linguistics
&
Search
ROSETTE FOR FUSION

2
Today’s Speakers
Radu Miclaus
Director of Product, AI and Cloud
Lucidworks
Robert Lucarini
Senior Software Engineer
Lucidworks
Nick Belanger
Solutions Engineer
Basis Technology

3
Agenda
Challenges with Languages in Search Applications
How Fusion uses Rosette to address these Challenges
Deeper dive into Entities Customization

4
Personalization through Search
experience
Documents Search Curation Personalization
Text Interpretation
Data Enrichment
Relevancy Tuning
Exactly what I am
searching for
Guide me to other
interesting things
Recommendations
✔

5
• LANGUAGE IDENTIFICATION
• CHARACTER NORMALIZATION
• GREATER RECALL WITHOUT LOSING
PRECISION
• METADATA
EXTRACTION/ENTITIES/FACETS/FILTERS
Challenges with
Languages in Search
Applications

6
Fusion + Rosette
Best-in-Class Search using Best-in-Class Linguistics
&

77
Boosting Global Search Quality with Rosette
Essential Elements of Multilingual Search

8
Lemmatization
What is it?
Associates words with the same
meaning (child/children;
beau/belle/beaux/belles). This is an
alternative to stemming which
associates words that look alike with
endings removed (arsen|ic -- arsen|al).
Why it matters
Important for European languages
where adjective agreement of
gender/number and verb conjugation
create multiple word forms,
associating the forms of a single word
increases search recall.
Impact on search
Increases recall of relevant results,
especially for European languages.
French examples:

9
Tokenization
What is it?
Divide sentences into words
for languages written without
spaces between words.
Why it matters
The bigram method ignores
meaning and essentially does
substring matching of one or
two characters. Chinese is
highly ambiguous. Any one
character could be a single
word, but often isn’t.
Impact on search
Greater precision of Chinese,
Japanese, Korean searches.

10
Chinese Script Conversion
What is it?
Converts all records or queries
to between simplified and
traditional Chinese.
Why it matters
It’s impossible to search all
Chinese documents at once
unless a user searches twice: in
traditional and then simplified
Chinese.
Impact on search
With one query, one can search
both simplified and traditional
Chinese documents
simultaneously and see results
in your preferred script.

11
Decompounding
What is it?
Splits compound nouns.
Why it matters
A search for a compound word
like Jugendarbeitslosigkeit
(German: “youth unemployment”)
misses results where the two
concepts (“youth” and
“unemployment”) are separated
(“20% more youth were
unemployed this month.”
Impact on search
Greater recall of German, Dutch,
Korean searches.
German examples:

12
Named Entity Recognition (NER)
What is it?
Adds structure to your
unstructured, multilingual text by
automatically identifying people,
organizations, and locations,
dates, products, and much more.
Why it matters
Filter results for the ones
containing the entities most
pertinent to your search.
Impact on search
More quickly refine your search,
remove noise, and increase
search relevance.

1313
How Does Fusion Use Rosette?

14
SOLR and Fusion
Rosette Enhancing Fusion
- SOLR support for multilingual tokenization
- 35 languages supported
- 7 entities supported with OpenNLP
integration
SOLR/Fusion/Rosette
Base Linguistics:
- 32 supported languages
- Sentence tagging
- Tokenization
- Lemmatization
- Part-of-speech tagging
- Decompounding
- Chinese/Japanese readings
Rosette Entity Extractor:
- 21 supported languages
- 29 entity types and 450+ sub-types detected

15
Rosette is enhancing Fusion’s capabilities to enrich data for search and
personalization. Besides language interpretation, robust Entity Extraction can
enhance Search through the usage of Facets.

17
Entity Extraction Workflow
REX engine for Entity Extraction and Fusion Pipelines

18
Fusion 5 Sample Architecture

1919
Deeper Dive
Entities Customization

BASIS TECHNOLOGY
The Rosette Entity Extraction Workflow.
20
The Rosette Entity Extractor:
● comes with expertly crafted models.
● can extract 18 different kinds of entities in more
than 20 different languages.
● is made with high quality data.
● Is curated by our dedicated data team.
● Is backed by 25 years of NLP expertise.

BASIS TECHNOLOGY
The Rosette Entity Extraction Workflow.
21
Machine or deep learned
statistical models that
identify entities based on
context
A high performance
gazetteer that is
dynamically updatable
Rules based extraction
based on REGEX style
patterns

BASIS TECHNOLOGY
Configuration and Customization.
22
Configuration:
● Quick and easy
● Leverages pre-defined capabilities
● Primarily file manipulation
Customization:
● Drastically change REx capabilities
● Allows for truly custom approaches
● More time-intensive

BASIS TECHNOLOGY
Configuration: Gazetteer and Regex.
23
Gazetteer
● Easy to create/modify/maintain
● Create lists of entities to extract
● Great when set is limited/defined
● Accept and reject
Regex
● Match any pattern, simple or complex
● Extract all entities following a pattern
● Requires technical resources

BASIS TECHNOLOGY
Configuration: Model Training and Custom Processors.
24
Model Training
● Customize the ML models directly
● Train on your genre of text
● Teach it to recognize new entities
● Requires training process
Custom Processors
● Execute custom code in a sandbox
● Validation, redaction, transformation
● Create more complex extraction rules

25
Take Away
● Text Interpretation and Enrichment are Crucial to Personalization
● Having robust language and entity support technology is essential for text
interpretation and enrichment
● Fusion and Rosette technologies stacks are now integrated to provide the
best of AI-Powered Search and AI-Powered Linguistics.
● Visit the BasisTech Booth at Activate

AI-Powered Linguistics and Search with Fusion and Rosette

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a AI-Powered Linguistics and Search with Fusion and Rosette

Similar a AI-Powered Linguistics and Search with Fusion and Rosette (20)

Más de Lucidworks

Más de Lucidworks (20)

Último

Último (20)

AI-Powered Linguistics and Search with Fusion and Rosette