SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
Introduction to Apache Solr
Andrew Jackson
UK Web Archive Technical Lead
www.bl.uk 2
Web Archive Overall Architecture
www.bl.uk 3
Understanding Your Use Case(s)
• Full text search, right?
– Yes, but there are many variations and choices to make.
• Work with users to understand their information needs:
– Are they looking for…
• Particular (archived) web resources?
• Resources on a particular issue or subject?
• Evidence of trends over time?
– What aspects of the content do they consider important?
– What kind of outputs do they want?
www.bl.uk 4
Working With Historians…
• JISC AADDA Project:
– Initial index and UI of the 1996-2010 data
– Great learning experience and feedback
– http://domaindarkarchive.blogspot.co.uk/
• AHRC ‘Big Data’ Project:
– Second iteration of index and UI
– Bursary holders reports coming soon
– http://buddah.projects.history.ac.uk/
• Interested in trends and reflections of society
– Who links to who/what, over time?
www.bl.uk 5
Apache Solr & Lucene
• Apache Lucene:
– A Java library for full text indexes
• Apache Solr:
– A web service and API that exposes Lucene functionality in a
as a document database
– Supports SolrCloud mode for distributed searches
• See also:
– Elasticsearch (also built around Lucene)
– We ‘chose’ Solr before Elasticsearch existed
– http://solr-vs-elasticsearch.com/
www.bl.uk 6
Example: Indexing Quotes
• Quotes to be indexed:
– “To do is to be.” - Jean-Paul Sartre
– “To be is to do.” - Socrates
– “Do be do be do.” - Frank Sinatra
• Goals:
– Index the quotation for full-text search.
• e.g. Show me all quotes that contain “to be”.
– Index the author for faceted search.
• e.g. Show me all quotes by “Frank Sinatra”.
www.bl.uk 7
Lucene’s Inverted Indexes
www.bl.uk 8
Solr as a Document Database
• Solr Indexes/Stores & Retrieves:
– Documents
composed of:
• Multiple Fields
each of which has a defined:
– Field Type
such as ‘text’, ‘string’, ‘int’, etc.
• The queries you can support depend on on many
parameters, but the fields and their types are the most
critical factors.
– See Overview of Documents, Fields, and Schema Design
www.bl.uk 9
The Quotes As Solr Documents
• Our Documents contain three fields:
– ‘id’ field of type ‘string’
– ‘text’ field of type ‘text_general’
– ‘author’ field, of type ‘string’
• Example Documents:
– id: “1”, text: “To do is to be.”, author: “Jean-Paul Sartre”
– id: “2”, text: “To be is to do.”, author: “Socrates”
– id: “3”, text: “Do be do be do.”, author: “Frank Sinatra”
www.bl.uk 10
Solr Update Flow
www.bl.uk 11
Analyzing The Text Field
• Analyzing the text on document 1:
– Input: “To do is to be.”, type = ‘text_general’
– Standard Tokeniser:
• ‘To’ ‘be’ ‘is’ ‘to’ ‘do’
– Lower Case Filter:
• ‘to’ ‘be’ ‘is’ ‘to’ ‘do’
• Adding the tokens to the index:
– ‘be’ => id:1
– ‘do’ => id:1
– …
www.bl.uk 12
Analyzing The Author Field
• Analyzing the author on document 1:
– Input: “Jean-Paul Sartre”, type = ‘string’
– Strings are stored as is.
• Adding the tokens to the index:
– ‘Jean-Paul Sartre’ => id:1
www.bl.uk 13
Solr Query Flow
www.bl.uk 14
Query for text:“To be”
• Uses the same analyser
as the indexer:
– “To be?”
– ST: “To” “be”
– LCF: “to” “be”
• Returns
documents:
– 1
– 2
www.bl.uk 15
Solr’s Built-in UI
www.bl.uk 16
Solr Overall Flow
www.bl.uk 17
Choice: Ignore ‘stop words’?
• Removes common words, unrelated to subject/topic
– Input: “To do is to be”
– Standard Tokeniser:
• ‘To’ ‘be’ ‘is’ ‘to’ ‘do’
– Stop Words Filter (stopwords_en.txt):
• ‘do’
– Lower Case Filter:
• ‘do’
• Cannot support phrase search
– e.g. searching for “to be”
www.bl.uk 18
Choice: Stemming?
• Attempts to group concepts together:
– "fishing", "fished”, "fisher" => "fish"
– "argue", "argued", "argues", "arguing”, "argus” => "argu"
• Sometimes confused:
– "axes” => "axe”, or ”axis”?
• Better at grouping related items together
• Makes precise phrase searching difficult
www.bl.uk 19
So Many Choices…
• Lots of text indexing options to tune:
– Punctuation and tokenization:
• is www.google.com one or three tokens?
– Stop word filter (“the” => “”)
– Lower case filter (“This” => “this”)
– Stemming (choice of algorithms too)
– Keywords (excepted from stemming)
– Synonyms (“TV” => “Television”)
– Possessive Filter (“Blair’s” => “Blair”)
– …and many more Tokenizers and Filters.
www.bl.uk 20
Even More Choices: Query Features
• As well as full-text search variations, we have
– Query parsers and features:
• Proximity, wildcards, term frequencies, relevance…
– Faceted search
– Numeric or Date values and range queries
– Geographic data and spatial search
– Snippets/fragments and highlighting
– Spell checking i.e. ‘Did you mean …?’
– MoreLikeThis
– Clustering
www.bl.uk 21
How to get started?
• Experimenting with the UKWA stack:
– Indexing:
• webarchive-discovery
– User Interfaces:
• Drupal Sarnia
• Shine (Play Framework, by UKWA)
• See https://github.com/ukwa/webarchive-
discovery/wiki/Front-ends
www.bl.uk 22
The webarchive-discovery system
• The webarchive-discovery codebase is an indexing stack
that reflects our (UKWA) use cases
– Contains our choices, reflects our progress so far
– Turns ARC or WARC records into Solr Documents
– Highly robust against (W)ARC data quality problems
• Adds custom fields for web archiving
– Text extracted using Apache Tika
– Various other analysis features
• Workshop sessions will use our setup
– but this is only a starting point…
www.bl.uk 23
Features: Basic Metadata Fields
• From the file system:
– The source (W)ARC filename and offset
• From the WARC record:
– URL, host, domain, public suffix
– Crawl date(s)
• From the HTTP headers:
– Content length
– Content type (as served)
– Server software IDs
www.bl.uk 24
Features: Payload Analysis
• Binary hash, embedded metadata
• Format and preservation risk analysis:
– Apache Tika & DROID format and encoding ID
– Notes parse errors to spot access problems
– Apache Preflight PDF risk analysis
– XML root namespace
– Format signature generation tricks
• HTML links, elements used, licence/rights URL
• Image properties, dominant colours, face detection
www.bl.uk 25
Features: Text Analysis
• Text extraction from binary formats
• ‘Fuzzy’ hash (ssdeep) of text
– for similarity analysis
• Natural language detection
• UK postcode extraction and geo-indexing
• Experimental language analysis:
– Simplistic sentiment analysis
– Stanford NLP named entity extraction
– Initial GATE NLP analyser
www.bl.uk 26
Command-line Indexing Architecture
www.bl.uk 27
Hadoop Indexing Architecture
www.bl.uk 28
Scaling Solr
• We are operating outside Solr’s sweet spot:
– General recommendation is RAM = Index Size
– We have a 15TB index. That’s a lot of RAM.
• e.g. from this email
– “100 million documents [and 16-32GB] per node”
– “it's quite the fool's errand for average developers to try to
replicate the "heroic efforts" of the few.”
• So how to scale up?
www.bl.uk 29
Basic Index Performance Scaling
• One Query:
– Single-threaded binary search
– Seek-and-read speed is critical, not CPU
• Add RAID/SAN?
– More IOPS can support more concurrent queries
– BUT each query is no faster
• Want faster queries?
– Use SSD, and/or
– More RAM to cache more disk, and/or
– Split the data into more shards (on independent media)
www.bl.uk 30
Sharding & SolrCloud
• For > ~100 million documents, use shards
– More, smaller independent shards == faster search
• Shard generation:
– SolrCloud ‘Live’ shards
• We use Solr’s standard sharding
• Randomly distributes records
• Supports updates to records
– Manual sharding
• e.g. ‘static’ shards generated from files
• As used by the Danish web archive (see later today)
www.bl.uk 31
Next Steps
• Prototype, Prototype, Prototype
– Expect to re-index
– Expect to iterate your front and back end systems
– Seek real user feedback
• Benchmark, Benchmark, Benchmark
– More on scaling issues and benchmarking this afternoon
• Work Together
– Share use cases, indexing tactics
– Share system specs, benchmarks
– Share code where appropriate

Más contenido relacionado

La actualidad más candente

20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorialChris Huang
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBertrand Delacretaz
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesRahul Jain
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solrsagar chaturvedi
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHPPaul Borgermans
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From SolrRamzi Alqrainy
 
New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache SolrEdureka!
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchRafał Kuć
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache SolrBiogeeks
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy SokolenkoProvectus
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
State-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache SolrState-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache Solrguest432cd6
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and SparkLucidworks
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5israelekpo
 

La actualidad más candente (20)

20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorial
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHP
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From Solr
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache Solr
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache Solr
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
State-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache SolrState-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache Solr
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
 
Solr Architecture
Solr ArchitectureSolr Architecture
Solr Architecture
 

Destacado

Introduction to Apache Solr.
Introduction to Apache Solr.Introduction to Apache Solr.
Introduction to Apache Solr.ashish0x90
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrChristos Manios
 
An Introduction to Solr
An Introduction to SolrAn Introduction to Solr
An Introduction to Solrtomhill
 
Introduction to SolrCloud
Introduction to SolrCloudIntroduction to SolrCloud
Introduction to SolrCloudVarun Thacker
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineTrey Grainger
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Alexandre Rafalovitch
 
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014francelabs
 
Apprendre Solr en deux heures
Apprendre Solr en deux heuresApprendre Solr en deux heures
Apprendre Solr en deux heuresSaïd Radhouani
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsJulien Nioche
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platformmteutelink
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...hannonhill
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01David Smiley
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-endgagravarr
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache TikaPaolo Mottadelli
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Manish kumar
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutchsebastian_nagel
 

Destacado (20)

Introduction to Apache Solr.
Introduction to Apache Solr.Introduction to Apache Solr.
Introduction to Apache Solr.
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
An Introduction to Solr
An Introduction to SolrAn Introduction to Solr
An Introduction to Solr
 
Solr Presentation
Solr PresentationSolr Presentation
Solr Presentation
 
Introduction to SolrCloud
Introduction to SolrCloudIntroduction to SolrCloud
Introduction to SolrCloud
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engine
 
Solr formation Sparna
Solr formation SparnaSolr formation Sparna
Solr formation Sparna
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
 
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
 
Apprendre Solr en deux heures
Apprendre Solr en deux heuresApprendre Solr en deux heures
Apprendre Solr en deux heures
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 

Similar a Introduction to Apache Solr

Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Roxanne Missingham
 
Why do you consider to adopt Koha Open Source Integrated Library System for y...
Why do you consider to adopt Koha Open Source Integrated Library System for y...Why do you consider to adopt Koha Open Source Integrated Library System for y...
Why do you consider to adopt Koha Open Source Integrated Library System for y...Md. Zahid Hossain Shoeb
 
Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Roy Russo
 
Everything you always wanted to know about WorldCat (but were afraid to ask) ...
Everything you always wanted to know about WorldCat (but were afraid to ask) ...Everything you always wanted to know about WorldCat (but were afraid to ask) ...
Everything you always wanted to know about WorldCat (but were afraid to ask) ...CILIP MDG
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...Lucidworks
 
Computer Science Library Training
Computer Science Library TrainingComputer Science Library Training
Computer Science Library Trainingpvhead123
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBSean Laurent
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksLucidworks
 
SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.Julian Hyde
 
ELK stack introduction
ELK stack introduction ELK stack introduction
ELK stack introduction abenyeung1
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Ontico
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrJake Mannix
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkJake Mannix
 
SPARQL in the Semantic Web
SPARQL in the Semantic WebSPARQL in the Semantic Web
SPARQL in the Semantic WebJan Beeck
 
What do you want to discover today? / Janet Aucock, University of St Andrews
What do you want to discover today? / Janet Aucock, University of St AndrewsWhat do you want to discover today? / Janet Aucock, University of St Andrews
What do you want to discover today? / Janet Aucock, University of St AndrewsCIGScotland
 

Similar a Introduction to Apache Solr (20)

IIPC GA 2014 Solr
IIPC GA 2014 SolrIIPC GA 2014 Solr
IIPC GA 2014 Solr
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
 
Why do you consider to adopt Koha Open Source Integrated Library System for y...
Why do you consider to adopt Koha Open Source Integrated Library System for y...Why do you consider to adopt Koha Open Source Integrated Library System for y...
Why do you consider to adopt Koha Open Source Integrated Library System for y...
 
Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015
 
Everything you always wanted to know about WorldCat (but were afraid to ask) ...
Everything you always wanted to know about WorldCat (but were afraid to ask) ...Everything you always wanted to know about WorldCat (but were afraid to ask) ...
Everything you always wanted to know about WorldCat (but were afraid to ask) ...
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
 
Computer Science Library Training
Computer Science Library TrainingComputer Science Library Training
Computer Science Library Training
 
Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
 
SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.
 
ELK stack introduction
ELK stack introduction ELK stack introduction
ELK stack introduction
 
Google Dorks
Google DorksGoogle Dorks
Google Dorks
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
SPARQL in the Semantic Web
SPARQL in the Semantic WebSPARQL in the Semantic Web
SPARQL in the Semantic Web
 
What do you want to discover today? / Janet Aucock, University of St Andrews
What do you want to discover today? / Janet Aucock, University of St AndrewsWhat do you want to discover today? / Janet Aucock, University of St Andrews
What do you want to discover today? / Janet Aucock, University of St Andrews
 

Más de Andy Jackson

The 'Digital Object Types' Issue
The 'Digital Object Types' IssueThe 'Digital Object Types' Issue
The 'Digital Object Types' IssueAndy Jackson
 
Ten years of the UK web archive: what have we saved?
Ten years of the UK web archive: what have we saved?Ten years of the UK web archive: what have we saved?
Ten years of the UK web archive: what have we saved?Andy Jackson
 
Seeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archivesSeeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archivesAndy Jackson
 
Digging into the Web Archive at the British Library 2014-11-27
Digging into the Web Archive at the British Library 2014-11-27Digging into the Web Archive at the British Library 2014-11-27
Digging into the Web Archive at the British Library 2014-11-27Andy Jackson
 
Unified characterisation, please
Unified characterisation, pleaseUnified characterisation, please
Unified characterisation, pleaseAndy Jackson
 
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web HistoryFormats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web HistoryAndy Jackson
 

Más de Andy Jackson (6)

The 'Digital Object Types' Issue
The 'Digital Object Types' IssueThe 'Digital Object Types' Issue
The 'Digital Object Types' Issue
 
Ten years of the UK web archive: what have we saved?
Ten years of the UK web archive: what have we saved?Ten years of the UK web archive: what have we saved?
Ten years of the UK web archive: what have we saved?
 
Seeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archivesSeeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archives
 
Digging into the Web Archive at the British Library 2014-11-27
Digging into the Web Archive at the British Library 2014-11-27Digging into the Web Archive at the British Library 2014-11-27
Digging into the Web Archive at the British Library 2014-11-27
 
Unified characterisation, please
Unified characterisation, pleaseUnified characterisation, please
Unified characterisation, please
 
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web HistoryFormats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
 

Último

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 

Último (20)

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 

Introduction to Apache Solr

  • 1. Introduction to Apache Solr Andrew Jackson UK Web Archive Technical Lead
  • 2. www.bl.uk 2 Web Archive Overall Architecture
  • 3. www.bl.uk 3 Understanding Your Use Case(s) • Full text search, right? – Yes, but there are many variations and choices to make. • Work with users to understand their information needs: – Are they looking for… • Particular (archived) web resources? • Resources on a particular issue or subject? • Evidence of trends over time? – What aspects of the content do they consider important? – What kind of outputs do they want?
  • 4. www.bl.uk 4 Working With Historians… • JISC AADDA Project: – Initial index and UI of the 1996-2010 data – Great learning experience and feedback – http://domaindarkarchive.blogspot.co.uk/ • AHRC ‘Big Data’ Project: – Second iteration of index and UI – Bursary holders reports coming soon – http://buddah.projects.history.ac.uk/ • Interested in trends and reflections of society – Who links to who/what, over time?
  • 5. www.bl.uk 5 Apache Solr & Lucene • Apache Lucene: – A Java library for full text indexes • Apache Solr: – A web service and API that exposes Lucene functionality in a as a document database – Supports SolrCloud mode for distributed searches • See also: – Elasticsearch (also built around Lucene) – We ‘chose’ Solr before Elasticsearch existed – http://solr-vs-elasticsearch.com/
  • 6. www.bl.uk 6 Example: Indexing Quotes • Quotes to be indexed: – “To do is to be.” - Jean-Paul Sartre – “To be is to do.” - Socrates – “Do be do be do.” - Frank Sinatra • Goals: – Index the quotation for full-text search. • e.g. Show me all quotes that contain “to be”. – Index the author for faceted search. • e.g. Show me all quotes by “Frank Sinatra”.
  • 8. www.bl.uk 8 Solr as a Document Database • Solr Indexes/Stores & Retrieves: – Documents composed of: • Multiple Fields each of which has a defined: – Field Type such as ‘text’, ‘string’, ‘int’, etc. • The queries you can support depend on on many parameters, but the fields and their types are the most critical factors. – See Overview of Documents, Fields, and Schema Design
  • 9. www.bl.uk 9 The Quotes As Solr Documents • Our Documents contain three fields: – ‘id’ field of type ‘string’ – ‘text’ field of type ‘text_general’ – ‘author’ field, of type ‘string’ • Example Documents: – id: “1”, text: “To do is to be.”, author: “Jean-Paul Sartre” – id: “2”, text: “To be is to do.”, author: “Socrates” – id: “3”, text: “Do be do be do.”, author: “Frank Sinatra”
  • 11. www.bl.uk 11 Analyzing The Text Field • Analyzing the text on document 1: – Input: “To do is to be.”, type = ‘text_general’ – Standard Tokeniser: • ‘To’ ‘be’ ‘is’ ‘to’ ‘do’ – Lower Case Filter: • ‘to’ ‘be’ ‘is’ ‘to’ ‘do’ • Adding the tokens to the index: – ‘be’ => id:1 – ‘do’ => id:1 – …
  • 12. www.bl.uk 12 Analyzing The Author Field • Analyzing the author on document 1: – Input: “Jean-Paul Sartre”, type = ‘string’ – Strings are stored as is. • Adding the tokens to the index: – ‘Jean-Paul Sartre’ => id:1
  • 14. www.bl.uk 14 Query for text:“To be” • Uses the same analyser as the indexer: – “To be?” – ST: “To” “be” – LCF: “to” “be” • Returns documents: – 1 – 2
  • 17. www.bl.uk 17 Choice: Ignore ‘stop words’? • Removes common words, unrelated to subject/topic – Input: “To do is to be” – Standard Tokeniser: • ‘To’ ‘be’ ‘is’ ‘to’ ‘do’ – Stop Words Filter (stopwords_en.txt): • ‘do’ – Lower Case Filter: • ‘do’ • Cannot support phrase search – e.g. searching for “to be”
  • 18. www.bl.uk 18 Choice: Stemming? • Attempts to group concepts together: – "fishing", "fished”, "fisher" => "fish" – "argue", "argued", "argues", "arguing”, "argus” => "argu" • Sometimes confused: – "axes” => "axe”, or ”axis”? • Better at grouping related items together • Makes precise phrase searching difficult
  • 19. www.bl.uk 19 So Many Choices… • Lots of text indexing options to tune: – Punctuation and tokenization: • is www.google.com one or three tokens? – Stop word filter (“the” => “”) – Lower case filter (“This” => “this”) – Stemming (choice of algorithms too) – Keywords (excepted from stemming) – Synonyms (“TV” => “Television”) – Possessive Filter (“Blair’s” => “Blair”) – …and many more Tokenizers and Filters.
  • 20. www.bl.uk 20 Even More Choices: Query Features • As well as full-text search variations, we have – Query parsers and features: • Proximity, wildcards, term frequencies, relevance… – Faceted search – Numeric or Date values and range queries – Geographic data and spatial search – Snippets/fragments and highlighting – Spell checking i.e. ‘Did you mean …?’ – MoreLikeThis – Clustering
  • 21. www.bl.uk 21 How to get started? • Experimenting with the UKWA stack: – Indexing: • webarchive-discovery – User Interfaces: • Drupal Sarnia • Shine (Play Framework, by UKWA) • See https://github.com/ukwa/webarchive- discovery/wiki/Front-ends
  • 22. www.bl.uk 22 The webarchive-discovery system • The webarchive-discovery codebase is an indexing stack that reflects our (UKWA) use cases – Contains our choices, reflects our progress so far – Turns ARC or WARC records into Solr Documents – Highly robust against (W)ARC data quality problems • Adds custom fields for web archiving – Text extracted using Apache Tika – Various other analysis features • Workshop sessions will use our setup – but this is only a starting point…
  • 23. www.bl.uk 23 Features: Basic Metadata Fields • From the file system: – The source (W)ARC filename and offset • From the WARC record: – URL, host, domain, public suffix – Crawl date(s) • From the HTTP headers: – Content length – Content type (as served) – Server software IDs
  • 24. www.bl.uk 24 Features: Payload Analysis • Binary hash, embedded metadata • Format and preservation risk analysis: – Apache Tika & DROID format and encoding ID – Notes parse errors to spot access problems – Apache Preflight PDF risk analysis – XML root namespace – Format signature generation tricks • HTML links, elements used, licence/rights URL • Image properties, dominant colours, face detection
  • 25. www.bl.uk 25 Features: Text Analysis • Text extraction from binary formats • ‘Fuzzy’ hash (ssdeep) of text – for similarity analysis • Natural language detection • UK postcode extraction and geo-indexing • Experimental language analysis: – Simplistic sentiment analysis – Stanford NLP named entity extraction – Initial GATE NLP analyser
  • 28. www.bl.uk 28 Scaling Solr • We are operating outside Solr’s sweet spot: – General recommendation is RAM = Index Size – We have a 15TB index. That’s a lot of RAM. • e.g. from this email – “100 million documents [and 16-32GB] per node” – “it's quite the fool's errand for average developers to try to replicate the "heroic efforts" of the few.” • So how to scale up?
  • 29. www.bl.uk 29 Basic Index Performance Scaling • One Query: – Single-threaded binary search – Seek-and-read speed is critical, not CPU • Add RAID/SAN? – More IOPS can support more concurrent queries – BUT each query is no faster • Want faster queries? – Use SSD, and/or – More RAM to cache more disk, and/or – Split the data into more shards (on independent media)
  • 30. www.bl.uk 30 Sharding & SolrCloud • For > ~100 million documents, use shards – More, smaller independent shards == faster search • Shard generation: – SolrCloud ‘Live’ shards • We use Solr’s standard sharding • Randomly distributes records • Supports updates to records – Manual sharding • e.g. ‘static’ shards generated from files • As used by the Danish web archive (see later today)
  • 31. www.bl.uk 31 Next Steps • Prototype, Prototype, Prototype – Expect to re-index – Expect to iterate your front and back end systems – Seek real user feedback • Benchmark, Benchmark, Benchmark – More on scaling issues and benchmarking this afternoon • Work Together – Share use cases, indexing tactics – Share system specs, benchmarks – Share code where appropriate