SlideShare una empresa de Scribd logo
1 de 28
Rapid
Solr Schema
Development
Alexandre Rafalovitch (@arafalov)
Apache Solr Committer
Montreal Solr/ML meetup May 2018
Phone directory - content
Names, often from multiple cultures
Addresses
Phone numbers
Company/Group
Locations
Other fun data
I use https://www.fakenamegenerator.com/ for demos
 Can generate bulk entries in csv, tab-separated, sql, etc
 Many fields, languages, regions
 Warning: comes with an – invisible – byte order mark
Slide 2
Today's exploration
Solr 7.3 (latest)
The smallest learning schema/configuration required
Rapid schema evolution workflow
Free-form and fielded user entry
Dealing with multiple languages
Dealing with alternative name spellings
Searching phone numbers by any-length suffix
Configuring Solr to simplify API interface
(Bonus points) Fit into 40 minutes presentation!
Slide 3
Today's dataset
http://www.fakenamegenerator.com/ - Bulk request (20000 identities) – Free and configurable!
Name sets: American, Arabic, Australian, Chinese, French, Hispanic, Polish, Russian, Russian
(Cyrillic), Thai
Countries: Australia, Canada, France, Poland, Spain, United Kingdom, United States
Age range: 19 - 85 years old
Gender: 50% male, 50% female
Fields:
id,Gender,NameSet,Title,GivenName,MiddleInitial,Surname,StreetAddress,City,StateFull,ZipCod
e,CountryFull,EmailAddress,Username,TelephoneNumber,TelephoneCountryCode,Birthday,Age,T
ropicalZodiac,Color,Occupation,Company,BloodType,Kilograms,Centimeters,GUID,Latitude,Longi
tude
Renamed first field (Number) to id to fit Solr's naming convention
Removed BOM (in Vim, :set nobomb)
Slide 4
First try – Solr's built in schema
bin/solr start – standalone (non-clustered) server with no initial collections
bin/solr create -c demo1 – uses default configset, with 'schemaless' mode, not for production
Starts with 4 fields (id, _text_, _version_, _root_)
Auto-creates the rest on first occurance
bin/post -c demo1 ../dataset.csv
auto-detect content type from extension
can bulk upload files
see techproducts shipped example
bin/solr start –e techproducts
For one file, can also do via Admin UI
DEMO
Slide 5
Schemaless schema – lessons learned
Imported 1 record
Failed on the second one, because ZipCode was detected as a number
Can fix that by explicit configuration and rebuilding – see films example
(example/films/README.txt)
Other issues
Dual fields for text and string
Everything multivalued – because "just in case" – No sorting, API is messier, etc
Many large files
managed-schema: 546 lines (without comments)
solrconfig.xml: 1364 lines (with comments)
Plus another 42 configuration files, mostly language stopwords
Home work to get this working – not enough time today
Slide 6
Learning schema
managed-schema: start from nearly nothing – add as needed
solrconfig.xml: start from nearly all defaults – Most definitely NOT production ready
Not SolrCloud ready – add those as you scale
No extra field types – add as you need them
How small can we go?!?
Based on exploration done for my presentation at Lucene/Solr Revolution 2016
https://www.slideshare.net/arafalov/rebuilding-solr-6-examples-layer-by-layer-lucenesolrrevolution-
2016 (slides and video)
https://github.com/arafalov/solr-deconstructing-films-example - repo
A bit out of date – schemaless mode was tuned since
Today's version uses latest Solr feature
https://github.com/arafalov/solr-presentation-2018-may/commits/master (changes commit-
by-commit)
Slide 7
Learning schema – managed-schema
<?xml version="1.0" encoding="UTF-8"?>
<schema name="smallest-config" version="1.6">
<field name="id" type="string" required="true" indexed="true" stored="true" />
<field name="_text_" type="text_basic" multiValued="true" indexed="true"
stored="false" docValues="false"/>
<dynamicField name="*" type="text_basic" indexed="true" stored="true"/>
<copyField source="*" dest="_text_"/>
<uniqueKey>id</uniqueKey>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" docValues="true"/>
<fieldType name="text_basic" class="solr.SortableTextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
</schema>
Slide 8
Learning schema – solrconfig.xml
<?xml version="1.0" encoding="UTF-8" ?>
<config>
<luceneMatchVersion>7.3.0</luceneMatchVersion>
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="df">_text_</str>
<str name="echoParams">all</str>
</lst>
</requestHandler>
</config>
Slide 9
2 files, 33 lines combined, including blanks – but Will It Blend Search?
bin/solr create -c tinydir -d ../configs/smallest/ - provide custom config files to the collection
bin/post -c tinydir ../dataset.csv – Remember the BOM and renaming column Number->id
Does it search?
General search?
Case-insensitive search?
Range search: Centimeters:[* TO 99]
Fielded search?
Facet?
Sort?
Are ids preserved?
Are individual fields easy to work with (fl, etc)?
DEMO
Learning schema – create and index
Slide 10
It works! And ready to start being used from other parts of the project
Do NOT expose Solr directly to the Internet. Not until you are a Solr Wizard, the Gray.
managed-schema file has NOT changed – because of dynamicField
Still 21 lines
Would still keep the comments
Would still preserve field/type definitions
Will change on first AdminUI/API modification – gets rewritten
What else? Actual search-engine tuning!
Special cases
Numerics – e.g. for Range search
Spatial search – e.g. for Mapping/distance ranking
Multivalued fields
Dates
Special parsing (e.g. names/surnames)
Useful telephone number search
Relevancy tuning!
Learning schema - conclusion
Slide 11
Several possibilities
Admin UI
Delete schema field
Add schema field with new definition
Reindex
Sometimes causes docValue-related exception, have to rebuild collection from scratch
Schema API (Admin UI uses a subset of it)
See: https://lucene.apache.org/solr/guide/7_3/schema-api.html
Also has Replace a Field
Also has Add/Delete Field Type
Great to use programmatically or with something like Postman (https://www.getpostman.com/)
Edit schema/solrconfig.xml directly and reload the collection
Not recommended for production, but OK with a single server/single developer
Remember to edit actual scheme not the original config one
◦ Check "Instance" location in Admin UI, in collections' Overview screen
Remember that in SolrCloud mode, the config files are NOT on disk (they are in ZooKeeper).
Evolving schema
Slide 12
Numeric fields
 Age – int
 Centimeters (height?) – int
 Kilograms – float
Copy missing field types (pint, pfloat) from solr-7.3.0/server/solr/configsets/_default/conf/managed-schema
Map numeric fields explicitly
Delete content due to radical storage needs change
 bin/post -c tinydir -format solr -d "<delete><query>*:*</query></delete>"
Reload the core in Admin UI's Core Admin (menu is different in SolrCloud mode)
Index again
 bin/post -c tinydir ../dataset.csv
New queries
 facet=true&facet.range=Age&facet.range.start=0&facet.range.end=200&facet.range.gap=10
 Centimeters:[* TO 99] (again)
DEMO
Evolving schema – add numeric fields
Slide 13
Solr supports extensive spatial search
https://lucene.apache.org/solr/guide/7_3/spatial-search.html
bounding-box with different shapes (circles, polygons, etc)
distance limiting or boosting
different options with different functionalities
LatLonPointSpatialField
SpatialRecursivePrefixTreeFieldType
BBoxField
All require combined Lat Lon coordinates (lat,lon)
We are providing separate Latitude and Longitude fields – need to merge them with a comma
Let's copy a field type and create a field:
<fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType" geo="true"
distErrPct="0.025" maxDistErr="0.001" distanceUnits="kilometers" />
<field name="location" type="location_rpt" indexed="true" stored="true" />
Remember to reload – no need to delete, as it is a new field
Next, need to also give merge instructions with an Update Request Processor
Evolving schema – spatial search
Slide 14
Update Request Processors
Deal with the data before it touches the schema
Can do pre-processing magic with many, many processors
See: https://lucene.apache.org/solr/guide/7_3/update-request-processors.html
See: http://www.solr-start.com/info/update-request-processors/ (mine)
Some are more magical then others and have shortcuts, e.g. TemplateUpdateProcessorFactory
All can be configured with chains in solrconfig.xml and apply explicitly or by default
That's how the schemaless mode works (default chain in solrconfig.xml of _default configset)
Also check the way dates are parsed in it, search for parse-date – can be used standalone
IgnoreFieldUpdateProcessorFactory could be useful to drop fields we don't want Solr to process at all
(including in collect-all _text_ field)
Let's reindex everything using the template to populate the new field:
bin/post -c tinydir -params "processor=template&template.field=location:{Latitude},{Longitude}" ../dataset.csv
Query:
q=*:*&rows=1&
fq={!geofilt sfield=location}&
pt=45.493444, -73.558154&d=100&
facet=on&facet.field=City&facet.mincount=1
DEMO
Evolving schema – URPs
Slide 15
Search for John and look at the phone numbers (q=John&fl=TelephoneNumber):
03.99.56.91.63
(08) 9435 3911
79 196 65 43
306-724-3986
Can we search that?
TelephoneNumber:3911 – yes
TelephoneNumber:"65 43" – sort of (need to quote or know these are together)
TelephoneNumber:3986 – sort of: some at the end, some at middle
Use Case: Just search the last digits (suffix) regardless of formatting
We have MANY analyzers, tokenizers, and character and token filters to help us with it
https://lucene.apache.org/solr/guide/7_3/understanding-analyzers-tokenizers-and-filters.html
http://www.solr-start.com/info/analyzers/ (mine)
Evolving schema – phone numbers
Slide 16
Let's define a super-custom field type:
<fieldType name="phone" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern="([^0-9])"
replacement="" replace="all"/>
<filter class="solr.ReverseStringFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern="([^0-9])"
replacement="" replace="all"/>
<filter class="solr.ReverseStringFilterFactory"/>
</analyzer>
</fieldType>
Notice
Asymmetric analyzers
Reversing the string to make it end-digits starts digit (make sure that's symmetric!)
Edge n-grams (3-30 character substrings) - makes the index larger, but the search very fast
Evolving schema – digits-only type
Slide 17
Remap TelephoneNumber to it
<field name="TelephoneNumber" type="phone"
indexed="true" stored="true" />
And reindex (don't forget our speed hack' for now):
bin/post -c tinydir -params
"processor=template&template.field=location:{Latitude},{Longitude
}" ../dataset.csv
Check terms in Admin UI Schema screen and do our test searches
TelephoneNumber:3911
TelephoneNumber:"65 43"
TelephoneNumber:6543
TelephoneNumber:3986
DEMO
Evolving schema – digits-only type - cont
Slide 18
Many languages have accents on letters
Frédéric, Thérèse, Jérôme
Many users can't be bothered to type them
Sometimes, they don't even know how to type them
Łódź, Kędzierzyn-Koźle
Can we just ignore the accents when we search?
Several ways, but let's use the simplest by insert a filter into the text_basic type definition
<filter class="solr.ASCIIFoldingFilterFactory" />
Before the LowerCaseFilterFactory
Reload the collection and reindex – because the filter is symmetric (affects indexing)
Search without accents, general or fielded
Lodz, Frederic, Therese, GivenName:jerome
DEMO
Evolving schema – collapsing accents
Slide 19
What are similar names to 'Alexandre':
q=GivenName:Alexandre~2&
facet=on&facet.field=GivenName&facet.mincount=1
Alexander, Alexandra, Alexandrin, Leixandre, Alexandre, Alexandrie
We can't ask the user to enter arcane Solr syntax
Let's do a phonetic search instead
Bunch of different ways, each with its own tradeoffs
PhoneticFilterFactory, BeiderMorseFilterFactory, DaitchMokotoffSoundexFilterFactory,
DoubleMetaphoneFilterFactory,....
https://lucene.apache.org/solr/guide/7_3/phonetic-matching.html
Best to have one - or several - separate Field Type definitions with a copy field
Allows to experiment
Allows to trigger them at different times (e.g. in advanced search, but not general one)
Allows to tune them for relevancy by assign different weights
Evolving schema – Names and Surnames
Slide 20
How do we actually search multiple fields at once?
We've been using the default 'lucene' query parser so far on either _text_ or specific field
Solr has MANY parsers
General: "lucene", DisMax, Extended DisMax (edismax)
Specialized: Block Join, Boolean, Boost, Collapsing, Complex Phrase, Field, Filters, Function, Function Range,
Graph, Join, Learning to Rank, .....
 https://lucene.apache.org/solr/guide/7_3/other-parsers.html
We already used Spatial geofilt query parser: fq={!geofilt sfield=location}
edismax allows to search against multiple fields, with different weights, boosts, ties, minimum-
match specifications, etc
Choose with defType=edismax or {edismax param=value param=value}search_string
Let's search for "George Brown" against (qf) "GivenName Surname Company StreetAddress City"
and display same fields only
DEMO
Try using http://splainer.io/ to review the results
Try with qf=GivenName^5 Surname^5 Company StreetAddress City
Side-trip into eDisMax and query parsers
Slide 21
Result: 149 records, but all over the field values
Enter RELEVANCY
Recall – did we find all documents?
Precision – did we find just the documents we needed
Recall and Precision – fight. Perfect recall is q=*:* ......
Ranking – First hit is very important, ones after that less so (not always)
Side note: Field sorting destroys ranking.
We were optimizing Recall
Dump everything into _text_ and let search sort it out
Optimizing for Precision may seem easy too
Under eDisMax, set mm=100%
DEMO
eDisMax exploration continues
Slide 22
It is a business decision what Precision and Recall mean for your use case
Often "find more just in case" and focus on "ranking better" is the right approach
Try
qf=GivenName^5 Surname^5 Company StreetAddress City (no mm)
qf=GivenName^5 Surname^5 Company StreetAddress City and mm=100%
qf=GivenName^5 Surname^5 _text_ and mm=100%
DEMO in Splainer
Relevancy business case for our names (GivenName, Surname)
UPPER/lower case does not matter
Exact spelling (with accents) matches best – new Field Type needed (actually original text_basic...)
Accent-free spelling matches next – existing text_basic and therefore dynamic field match is fine
Phonetic spelling matches lowest (but higher than fallback _text_ field) – new Field Type needed
eDisMax for ranking
Slide 23
<fieldType name="text_exact" class="solr.SortableTextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_phonetic" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/>
</analyzer>
</fieldType>
<field name="GivenName_exact" type="text_exact" indexed="true" stored="false"/>
<field name="Surname_exact" type="text_exact" indexed="true" stored="false"/>
<field name="GivenName_ph" type="text_phonetic" indexed="true" stored="false"/>
<field name="Surname_ph" type="text_phonetic" indexed="true" stored="false"/>
<copyField source="GivenName" dest="GivenName_exact"/>
<copyField source="GivenName" dest="GivenName_ph"/>
<copyField source="Surname" dest="Surname_exact"/>
<copyField source="Surname" dest="Surname_ph"/>
Multiple fields for same content
Slide 24
Our test cases
Frédéric, Thérèse, Jérôme
Check different analysis in Admin UI's Analysis screen
Can choose fields or field types from drop-down, use types as we have dynamic fields
Can also test analysis vs search and highlight the matches
Test search with Admin UI and Splainer with eDisMax enabled and Thérèse against different set
of Query Fields (qf)
Default search (qf=_text_)
GivenName
GivenName _text_
GivenName^10 _text_
GivenName_exact^15 GivenName^10 GivenName_ph^5 _text_
DEMO
Testing multiple representations
Slide 25
Original search URL: http://...:8983/solr/tinydir/select?defType=edismax&fl=.....
The good parameter set:
defType=edismax
qf=GivenName_exact^15 GivenName^10 GivenName_ph^5% _text_
fl=GivenName Surname Company StreetAddress City CountryFull
Lock it in a dedicated request handler in solrconfig.xml
<requestHandler name="/namesearch" class="solr.SearchHandler">
<lst name="defaults">
<str name="df">_text_</str>
<str name="echoParams">all</str>
<str name="defType">edismax</str>
<str name="qf">GivenName_exact^15 GivenName^10 GivenName_ph^5 _text_</str>
<str name="fl">GivenName Surname Company StreetAddress City CountryFull</str>
</lst>
</requestHandler>
Now: http://...:8983/solr/tinydir/namesearch?q=Thérèse
DEMO
Simplify API usage
Slide 26
Based on previous work with Thai language: https://github.com/arafalov/solr-thai-test
Needs ICU libraries in solrconfig.xml
 <lib path="../../../contrib/analysis-extras/lucene-libs/lucene-analyzers-icu-7.3.0.jar" />
<lib path="../../../contrib/analysis-extras/lib/icu4j-59.1.jar" />
Field, type, and copyField definition in managed-schema:
<fieldType name="text_ru_en" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUTransformFilterFactory" id="ru-en" />
<filter class="solr.BeiderMorseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.BeiderMorseFilterFactory" />
</analyzer>
</fieldType>
<field name="GivenName_ruen" type="text_ru_en" indexed="true" stored="false"/>
<copyField source="GivenName" dest="GivenName_ruen"/>
Reload, reindex
Search
 GivenName:Zahar
 GivenName_ruen:Zahar
And BOOM!
Bonus magic
Slide 27
Rapid
Solr Schema
Development
Alexandre Rafalovitch (@arafalov)
Apache Solr Committer
Montreal Solr/ML meetup May 2018

Más contenido relacionado

La actualidad más candente

Solr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksSolr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksErik Hatcher
 
Apache Solr + ajax solr
Apache Solr + ajax solrApache Solr + ajax solr
Apache Solr + ajax solrNet7
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conferenceErik Hatcher
 
An Introduction to Basics of Search and Relevancy with Apache Solr
An Introduction to Basics of Search and Relevancy with Apache SolrAn Introduction to Basics of Search and Relevancy with Apache Solr
An Introduction to Basics of Search and Relevancy with Apache SolrLucidworks (Archived)
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrJayesh Bhoyar
 
Using Apache Solr
Using Apache SolrUsing Apache Solr
Using Apache Solrpittaya
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature PreviewYonik Seeley
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEcommerce Solution Provider SysIQ
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered LuceneErik Hatcher
 
Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query ParsingErik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 

La actualidad más candente (20)

Solr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksSolr Indexing and Analysis Tricks
Solr Indexing and Analysis Tricks
 
Apache Solr + ajax solr
Apache Solr + ajax solrApache Solr + ajax solr
Apache Solr + ajax solr
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conference
 
An Introduction to Basics of Search and Relevancy with Apache Solr
An Introduction to Basics of Search and Relevancy with Apache SolrAn Introduction to Basics of Search and Relevancy with Apache Solr
An Introduction to Basics of Search and Relevancy with Apache Solr
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Using Apache Solr
Using Apache SolrUsing Apache Solr
Using Apache Solr
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Solr workshop
Solr workshopSolr workshop
Solr workshop
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature Preview
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered Lucene
 
Solr Flair
Solr FlairSolr Flair
Solr Flair
 
Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query Parsing
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
Solr Presentation
Solr PresentationSolr Presentation
Solr Presentation
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 

Similar a Rapid Solr Schema Development (Phone directory)

New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache SolrEdureka!
 
Tuning and optimizing webcenter spaces application white paper
Tuning and optimizing webcenter spaces application white paperTuning and optimizing webcenter spaces application white paper
Tuning and optimizing webcenter spaces application white paperVinay Kumar
 
Drupal Efficiency - Coding, Deployment, Scaling
Drupal Efficiency - Coding, Deployment, ScalingDrupal Efficiency - Coding, Deployment, Scaling
Drupal Efficiency - Coding, Deployment, Scalingsmattoon
 
Open Source Content Management Systems
Open Source Content Management SystemsOpen Source Content Management Systems
Open Source Content Management SystemsMatthew Turland
 
Drupal Efficiency using open source technologies from Sun
Drupal Efficiency using open source technologies from SunDrupal Efficiency using open source technologies from Sun
Drupal Efficiency using open source technologies from Sunsmattoon
 
Simplify your professional web development with symfony
Simplify your professional web development with symfonySimplify your professional web development with symfony
Simplify your professional web development with symfonyFrancois Zaninotto
 
Intro to-html-backbone
Intro to-html-backboneIntro to-html-backbone
Intro to-html-backbonezonathen
 
Using and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted WatersUsing and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted WatersVladimir Pavlov
 
New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache SolrEdureka!
 
Crash Course HTML/Rails Slides
Crash Course HTML/Rails SlidesCrash Course HTML/Rails Slides
Crash Course HTML/Rails SlidesUdita Plaha
 
Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013
Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013
Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013Mack Hardy
 
Enterprise search in_drupal_pub
Enterprise search in_drupal_pubEnterprise search in_drupal_pub
Enterprise search in_drupal_pubdstuartnz
 
Red5workshop 090619073420-phpapp02
Red5workshop 090619073420-phpapp02Red5workshop 090619073420-phpapp02
Red5workshop 090619073420-phpapp02arghya007
 
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
Movile Internet Movel SA: A Change of Seasons: A big move to Apache CassandraMovile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
Movile Internet Movel SA: A Change of Seasons: A big move to Apache CassandraDataStax Academy
 
Cassandra Summit 2015 - A Change of Seasons
Cassandra Summit 2015 - A Change of SeasonsCassandra Summit 2015 - A Change of Seasons
Cassandra Summit 2015 - A Change of SeasonsEiti Kimura
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesJon Meredith
 

Similar a Rapid Solr Schema Development (Phone directory) (20)

New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache Solr
 
Tuning and optimizing webcenter spaces application white paper
Tuning and optimizing webcenter spaces application white paperTuning and optimizing webcenter spaces application white paper
Tuning and optimizing webcenter spaces application white paper
 
Drupal Efficiency - Coding, Deployment, Scaling
Drupal Efficiency - Coding, Deployment, ScalingDrupal Efficiency - Coding, Deployment, Scaling
Drupal Efficiency - Coding, Deployment, Scaling
 
Open Source Content Management Systems
Open Source Content Management SystemsOpen Source Content Management Systems
Open Source Content Management Systems
 
Drupal Efficiency using open source technologies from Sun
Drupal Efficiency using open source technologies from SunDrupal Efficiency using open source technologies from Sun
Drupal Efficiency using open source technologies from Sun
 
Lightweight web frameworks
Lightweight web frameworksLightweight web frameworks
Lightweight web frameworks
 
Ruby On Rails
Ruby On RailsRuby On Rails
Ruby On Rails
 
Simplify your professional web development with symfony
Simplify your professional web development with symfonySimplify your professional web development with symfony
Simplify your professional web development with symfony
 
Intro to-html-backbone
Intro to-html-backboneIntro to-html-backbone
Intro to-html-backbone
 
Using and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted WatersUsing and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted Waters
 
Dn D Custom 1
Dn D Custom 1Dn D Custom 1
Dn D Custom 1
 
Dn D Custom 1
Dn D Custom 1Dn D Custom 1
Dn D Custom 1
 
New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache Solr
 
Crash Course HTML/Rails Slides
Crash Course HTML/Rails SlidesCrash Course HTML/Rails Slides
Crash Course HTML/Rails Slides
 
Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013
Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013
Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013
 
Enterprise search in_drupal_pub
Enterprise search in_drupal_pubEnterprise search in_drupal_pub
Enterprise search in_drupal_pub
 
Red5workshop 090619073420-phpapp02
Red5workshop 090619073420-phpapp02Red5workshop 090619073420-phpapp02
Red5workshop 090619073420-phpapp02
 
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
Movile Internet Movel SA: A Change of Seasons: A big move to Apache CassandraMovile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
 
Cassandra Summit 2015 - A Change of Seasons
Cassandra Summit 2015 - A Change of SeasonsCassandra Summit 2015 - A Change of Seasons
Cassandra Summit 2015 - A Change of Seasons
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 

Último

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 

Último (20)

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 

Rapid Solr Schema Development (Phone directory)

  • 1. Rapid Solr Schema Development Alexandre Rafalovitch (@arafalov) Apache Solr Committer Montreal Solr/ML meetup May 2018
  • 2. Phone directory - content Names, often from multiple cultures Addresses Phone numbers Company/Group Locations Other fun data I use https://www.fakenamegenerator.com/ for demos  Can generate bulk entries in csv, tab-separated, sql, etc  Many fields, languages, regions  Warning: comes with an – invisible – byte order mark Slide 2
  • 3. Today's exploration Solr 7.3 (latest) The smallest learning schema/configuration required Rapid schema evolution workflow Free-form and fielded user entry Dealing with multiple languages Dealing with alternative name spellings Searching phone numbers by any-length suffix Configuring Solr to simplify API interface (Bonus points) Fit into 40 minutes presentation! Slide 3
  • 4. Today's dataset http://www.fakenamegenerator.com/ - Bulk request (20000 identities) – Free and configurable! Name sets: American, Arabic, Australian, Chinese, French, Hispanic, Polish, Russian, Russian (Cyrillic), Thai Countries: Australia, Canada, France, Poland, Spain, United Kingdom, United States Age range: 19 - 85 years old Gender: 50% male, 50% female Fields: id,Gender,NameSet,Title,GivenName,MiddleInitial,Surname,StreetAddress,City,StateFull,ZipCod e,CountryFull,EmailAddress,Username,TelephoneNumber,TelephoneCountryCode,Birthday,Age,T ropicalZodiac,Color,Occupation,Company,BloodType,Kilograms,Centimeters,GUID,Latitude,Longi tude Renamed first field (Number) to id to fit Solr's naming convention Removed BOM (in Vim, :set nobomb) Slide 4
  • 5. First try – Solr's built in schema bin/solr start – standalone (non-clustered) server with no initial collections bin/solr create -c demo1 – uses default configset, with 'schemaless' mode, not for production Starts with 4 fields (id, _text_, _version_, _root_) Auto-creates the rest on first occurance bin/post -c demo1 ../dataset.csv auto-detect content type from extension can bulk upload files see techproducts shipped example bin/solr start –e techproducts For one file, can also do via Admin UI DEMO Slide 5
  • 6. Schemaless schema – lessons learned Imported 1 record Failed on the second one, because ZipCode was detected as a number Can fix that by explicit configuration and rebuilding – see films example (example/films/README.txt) Other issues Dual fields for text and string Everything multivalued – because "just in case" – No sorting, API is messier, etc Many large files managed-schema: 546 lines (without comments) solrconfig.xml: 1364 lines (with comments) Plus another 42 configuration files, mostly language stopwords Home work to get this working – not enough time today Slide 6
  • 7. Learning schema managed-schema: start from nearly nothing – add as needed solrconfig.xml: start from nearly all defaults – Most definitely NOT production ready Not SolrCloud ready – add those as you scale No extra field types – add as you need them How small can we go?!? Based on exploration done for my presentation at Lucene/Solr Revolution 2016 https://www.slideshare.net/arafalov/rebuilding-solr-6-examples-layer-by-layer-lucenesolrrevolution- 2016 (slides and video) https://github.com/arafalov/solr-deconstructing-films-example - repo A bit out of date – schemaless mode was tuned since Today's version uses latest Solr feature https://github.com/arafalov/solr-presentation-2018-may/commits/master (changes commit- by-commit) Slide 7
  • 8. Learning schema – managed-schema <?xml version="1.0" encoding="UTF-8"?> <schema name="smallest-config" version="1.6"> <field name="id" type="string" required="true" indexed="true" stored="true" /> <field name="_text_" type="text_basic" multiValued="true" indexed="true" stored="false" docValues="false"/> <dynamicField name="*" type="text_basic" indexed="true" stored="true"/> <copyField source="*" dest="_text_"/> <uniqueKey>id</uniqueKey> <fieldType name="string" class="solr.StrField" sortMissingLast="true" docValues="true"/> <fieldType name="text_basic" class="solr.SortableTextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> </schema> Slide 8
  • 9. Learning schema – solrconfig.xml <?xml version="1.0" encoding="UTF-8" ?> <config> <luceneMatchVersion>7.3.0</luceneMatchVersion> <requestHandler name="/select" class="solr.SearchHandler"> <lst name="defaults"> <str name="df">_text_</str> <str name="echoParams">all</str> </lst> </requestHandler> </config> Slide 9
  • 10. 2 files, 33 lines combined, including blanks – but Will It Blend Search? bin/solr create -c tinydir -d ../configs/smallest/ - provide custom config files to the collection bin/post -c tinydir ../dataset.csv – Remember the BOM and renaming column Number->id Does it search? General search? Case-insensitive search? Range search: Centimeters:[* TO 99] Fielded search? Facet? Sort? Are ids preserved? Are individual fields easy to work with (fl, etc)? DEMO Learning schema – create and index Slide 10
  • 11. It works! And ready to start being used from other parts of the project Do NOT expose Solr directly to the Internet. Not until you are a Solr Wizard, the Gray. managed-schema file has NOT changed – because of dynamicField Still 21 lines Would still keep the comments Would still preserve field/type definitions Will change on first AdminUI/API modification – gets rewritten What else? Actual search-engine tuning! Special cases Numerics – e.g. for Range search Spatial search – e.g. for Mapping/distance ranking Multivalued fields Dates Special parsing (e.g. names/surnames) Useful telephone number search Relevancy tuning! Learning schema - conclusion Slide 11
  • 12. Several possibilities Admin UI Delete schema field Add schema field with new definition Reindex Sometimes causes docValue-related exception, have to rebuild collection from scratch Schema API (Admin UI uses a subset of it) See: https://lucene.apache.org/solr/guide/7_3/schema-api.html Also has Replace a Field Also has Add/Delete Field Type Great to use programmatically or with something like Postman (https://www.getpostman.com/) Edit schema/solrconfig.xml directly and reload the collection Not recommended for production, but OK with a single server/single developer Remember to edit actual scheme not the original config one ◦ Check "Instance" location in Admin UI, in collections' Overview screen Remember that in SolrCloud mode, the config files are NOT on disk (they are in ZooKeeper). Evolving schema Slide 12
  • 13. Numeric fields  Age – int  Centimeters (height?) – int  Kilograms – float Copy missing field types (pint, pfloat) from solr-7.3.0/server/solr/configsets/_default/conf/managed-schema Map numeric fields explicitly Delete content due to radical storage needs change  bin/post -c tinydir -format solr -d "<delete><query>*:*</query></delete>" Reload the core in Admin UI's Core Admin (menu is different in SolrCloud mode) Index again  bin/post -c tinydir ../dataset.csv New queries  facet=true&facet.range=Age&facet.range.start=0&facet.range.end=200&facet.range.gap=10  Centimeters:[* TO 99] (again) DEMO Evolving schema – add numeric fields Slide 13
  • 14. Solr supports extensive spatial search https://lucene.apache.org/solr/guide/7_3/spatial-search.html bounding-box with different shapes (circles, polygons, etc) distance limiting or boosting different options with different functionalities LatLonPointSpatialField SpatialRecursivePrefixTreeFieldType BBoxField All require combined Lat Lon coordinates (lat,lon) We are providing separate Latitude and Longitude fields – need to merge them with a comma Let's copy a field type and create a field: <fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType" geo="true" distErrPct="0.025" maxDistErr="0.001" distanceUnits="kilometers" /> <field name="location" type="location_rpt" indexed="true" stored="true" /> Remember to reload – no need to delete, as it is a new field Next, need to also give merge instructions with an Update Request Processor Evolving schema – spatial search Slide 14
  • 15. Update Request Processors Deal with the data before it touches the schema Can do pre-processing magic with many, many processors See: https://lucene.apache.org/solr/guide/7_3/update-request-processors.html See: http://www.solr-start.com/info/update-request-processors/ (mine) Some are more magical then others and have shortcuts, e.g. TemplateUpdateProcessorFactory All can be configured with chains in solrconfig.xml and apply explicitly or by default That's how the schemaless mode works (default chain in solrconfig.xml of _default configset) Also check the way dates are parsed in it, search for parse-date – can be used standalone IgnoreFieldUpdateProcessorFactory could be useful to drop fields we don't want Solr to process at all (including in collect-all _text_ field) Let's reindex everything using the template to populate the new field: bin/post -c tinydir -params "processor=template&template.field=location:{Latitude},{Longitude}" ../dataset.csv Query: q=*:*&rows=1& fq={!geofilt sfield=location}& pt=45.493444, -73.558154&d=100& facet=on&facet.field=City&facet.mincount=1 DEMO Evolving schema – URPs Slide 15
  • 16. Search for John and look at the phone numbers (q=John&fl=TelephoneNumber): 03.99.56.91.63 (08) 9435 3911 79 196 65 43 306-724-3986 Can we search that? TelephoneNumber:3911 – yes TelephoneNumber:"65 43" – sort of (need to quote or know these are together) TelephoneNumber:3986 – sort of: some at the end, some at middle Use Case: Just search the last digits (suffix) regardless of formatting We have MANY analyzers, tokenizers, and character and token filters to help us with it https://lucene.apache.org/solr/guide/7_3/understanding-analyzers-tokenizers-and-filters.html http://www.solr-start.com/info/analyzers/ (mine) Evolving schema – phone numbers Slide 16
  • 17. Let's define a super-custom field type: <fieldType name="phone" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.KeywordTokenizerFactory" /> <filter class="solr.PatternReplaceFilterFactory" pattern="([^0-9])" replacement="" replace="all"/> <filter class="solr.ReverseStringFilterFactory"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.KeywordTokenizerFactory" /> <filter class="solr.PatternReplaceFilterFactory" pattern="([^0-9])" replacement="" replace="all"/> <filter class="solr.ReverseStringFilterFactory"/> </analyzer> </fieldType> Notice Asymmetric analyzers Reversing the string to make it end-digits starts digit (make sure that's symmetric!) Edge n-grams (3-30 character substrings) - makes the index larger, but the search very fast Evolving schema – digits-only type Slide 17
  • 18. Remap TelephoneNumber to it <field name="TelephoneNumber" type="phone" indexed="true" stored="true" /> And reindex (don't forget our speed hack' for now): bin/post -c tinydir -params "processor=template&template.field=location:{Latitude},{Longitude }" ../dataset.csv Check terms in Admin UI Schema screen and do our test searches TelephoneNumber:3911 TelephoneNumber:"65 43" TelephoneNumber:6543 TelephoneNumber:3986 DEMO Evolving schema – digits-only type - cont Slide 18
  • 19. Many languages have accents on letters Frédéric, Thérèse, Jérôme Many users can't be bothered to type them Sometimes, they don't even know how to type them Łódź, Kędzierzyn-Koźle Can we just ignore the accents when we search? Several ways, but let's use the simplest by insert a filter into the text_basic type definition <filter class="solr.ASCIIFoldingFilterFactory" /> Before the LowerCaseFilterFactory Reload the collection and reindex – because the filter is symmetric (affects indexing) Search without accents, general or fielded Lodz, Frederic, Therese, GivenName:jerome DEMO Evolving schema – collapsing accents Slide 19
  • 20. What are similar names to 'Alexandre': q=GivenName:Alexandre~2& facet=on&facet.field=GivenName&facet.mincount=1 Alexander, Alexandra, Alexandrin, Leixandre, Alexandre, Alexandrie We can't ask the user to enter arcane Solr syntax Let's do a phonetic search instead Bunch of different ways, each with its own tradeoffs PhoneticFilterFactory, BeiderMorseFilterFactory, DaitchMokotoffSoundexFilterFactory, DoubleMetaphoneFilterFactory,.... https://lucene.apache.org/solr/guide/7_3/phonetic-matching.html Best to have one - or several - separate Field Type definitions with a copy field Allows to experiment Allows to trigger them at different times (e.g. in advanced search, but not general one) Allows to tune them for relevancy by assign different weights Evolving schema – Names and Surnames Slide 20
  • 21. How do we actually search multiple fields at once? We've been using the default 'lucene' query parser so far on either _text_ or specific field Solr has MANY parsers General: "lucene", DisMax, Extended DisMax (edismax) Specialized: Block Join, Boolean, Boost, Collapsing, Complex Phrase, Field, Filters, Function, Function Range, Graph, Join, Learning to Rank, .....  https://lucene.apache.org/solr/guide/7_3/other-parsers.html We already used Spatial geofilt query parser: fq={!geofilt sfield=location} edismax allows to search against multiple fields, with different weights, boosts, ties, minimum- match specifications, etc Choose with defType=edismax or {edismax param=value param=value}search_string Let's search for "George Brown" against (qf) "GivenName Surname Company StreetAddress City" and display same fields only DEMO Try using http://splainer.io/ to review the results Try with qf=GivenName^5 Surname^5 Company StreetAddress City Side-trip into eDisMax and query parsers Slide 21
  • 22. Result: 149 records, but all over the field values Enter RELEVANCY Recall – did we find all documents? Precision – did we find just the documents we needed Recall and Precision – fight. Perfect recall is q=*:* ...... Ranking – First hit is very important, ones after that less so (not always) Side note: Field sorting destroys ranking. We were optimizing Recall Dump everything into _text_ and let search sort it out Optimizing for Precision may seem easy too Under eDisMax, set mm=100% DEMO eDisMax exploration continues Slide 22
  • 23. It is a business decision what Precision and Recall mean for your use case Often "find more just in case" and focus on "ranking better" is the right approach Try qf=GivenName^5 Surname^5 Company StreetAddress City (no mm) qf=GivenName^5 Surname^5 Company StreetAddress City and mm=100% qf=GivenName^5 Surname^5 _text_ and mm=100% DEMO in Splainer Relevancy business case for our names (GivenName, Surname) UPPER/lower case does not matter Exact spelling (with accents) matches best – new Field Type needed (actually original text_basic...) Accent-free spelling matches next – existing text_basic and therefore dynamic field match is fine Phonetic spelling matches lowest (but higher than fallback _text_ field) – new Field Type needed eDisMax for ranking Slide 23
  • 24. <fieldType name="text_exact" class="solr.SortableTextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> <fieldType name="text_phonetic" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/> </analyzer> </fieldType> <field name="GivenName_exact" type="text_exact" indexed="true" stored="false"/> <field name="Surname_exact" type="text_exact" indexed="true" stored="false"/> <field name="GivenName_ph" type="text_phonetic" indexed="true" stored="false"/> <field name="Surname_ph" type="text_phonetic" indexed="true" stored="false"/> <copyField source="GivenName" dest="GivenName_exact"/> <copyField source="GivenName" dest="GivenName_ph"/> <copyField source="Surname" dest="Surname_exact"/> <copyField source="Surname" dest="Surname_ph"/> Multiple fields for same content Slide 24
  • 25. Our test cases Frédéric, Thérèse, Jérôme Check different analysis in Admin UI's Analysis screen Can choose fields or field types from drop-down, use types as we have dynamic fields Can also test analysis vs search and highlight the matches Test search with Admin UI and Splainer with eDisMax enabled and Thérèse against different set of Query Fields (qf) Default search (qf=_text_) GivenName GivenName _text_ GivenName^10 _text_ GivenName_exact^15 GivenName^10 GivenName_ph^5 _text_ DEMO Testing multiple representations Slide 25
  • 26. Original search URL: http://...:8983/solr/tinydir/select?defType=edismax&fl=..... The good parameter set: defType=edismax qf=GivenName_exact^15 GivenName^10 GivenName_ph^5% _text_ fl=GivenName Surname Company StreetAddress City CountryFull Lock it in a dedicated request handler in solrconfig.xml <requestHandler name="/namesearch" class="solr.SearchHandler"> <lst name="defaults"> <str name="df">_text_</str> <str name="echoParams">all</str> <str name="defType">edismax</str> <str name="qf">GivenName_exact^15 GivenName^10 GivenName_ph^5 _text_</str> <str name="fl">GivenName Surname Company StreetAddress City CountryFull</str> </lst> </requestHandler> Now: http://...:8983/solr/tinydir/namesearch?q=Thérèse DEMO Simplify API usage Slide 26
  • 27. Based on previous work with Thai language: https://github.com/arafalov/solr-thai-test Needs ICU libraries in solrconfig.xml  <lib path="../../../contrib/analysis-extras/lucene-libs/lucene-analyzers-icu-7.3.0.jar" /> <lib path="../../../contrib/analysis-extras/lib/icu4j-59.1.jar" /> Field, type, and copyField definition in managed-schema: <fieldType name="text_ru_en" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.ICUTokenizerFactory"/> <filter class="solr.ICUTransformFilterFactory" id="ru-en" /> <filter class="solr.BeiderMorseFilterFactory" /> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.BeiderMorseFilterFactory" /> </analyzer> </fieldType> <field name="GivenName_ruen" type="text_ru_en" indexed="true" stored="false"/> <copyField source="GivenName" dest="GivenName_ruen"/> Reload, reindex Search  GivenName:Zahar  GivenName_ruen:Zahar And BOOM! Bonus magic Slide 27
  • 28. Rapid Solr Schema Development Alexandre Rafalovitch (@arafalov) Apache Solr Committer Montreal Solr/ML meetup May 2018

Notas del editor

  1. Line 205-206 facet=true&facet.range=Age&facet.range.start=0&facet.range.end=200&facet.range.gap=10
  2. http://localhost:8983/solr/tinydir/select?rows=1&d=100&facet.field=City&facet=on&fq={!geofilt%20sfield=location}&pt=45.493444,%20-73.558154&q=*:*&facet.mincount=1
  3. TelephoneNumber:3911 – yes TelephoneNumber:"65 43" – sort of (need to quote or know these are together) TelephoneNumber:3986
  4. Frédéric, Thérèse, Jérôme Łódź, Kędzierzyn-Koźle
  5. Thérèse