Search engines in the industry

Search engines in the
industry
a use case

Different interests
● researchers / engineers look for high
precision and recall
● editors / writers are concerned about
matching of queries and results
● marketers want to change / adapt results

Designing a search engine
● functional requirements
○ search
■ keywords, boolean retrieval, natural language
○ indexing
■ data sources
■ data types
○ administration
■ manage scoring / boosting functions

● architectural requirements
○ resiliency
○ scalability
○ no downtime
○ work with existing infrastructure
○ platforms
○ migrating from legacy systems
○ talk to other systems

● performance requirements
○ search
■ query per second
■ time per search request
○ index
■ document per second
■ time per indexing request
○ SLA?

● search engine performance requirements
○ recall percentiles threshold
○ precision percentiles threshold
○ minimize empty results

● often mostly unknown
○ published vs unpublished / to be written documents
● almost always umanageable
○ cannot decide when
■ it’ll be ready
■ it’ll have to be indexed
■ it’ll have to be searchable
● heterogeneous
○ different writers, languages, topics, styles, etc.
Data

Project
● ~50M heterogeneous documents
● Migrating from old commercial solution to
Apache Solr
● Google like search
● Targeted search for different types of
contents

Advanced capabilities
● Smart understanding of queries
● Smart suggestion of queries
● Suggestion of similar / important contents
● Automatic classification of contents

Responsibilities
● architecture analysis and design
○ scaling under high load
● continuous definition of algorithms for
indexing and searching
● system maintenance

Skills required
● basics of information retrieval
● a bit of distributed systems
● some natural language processing
● some machine learning

Architecture analysis and design
● Shape up a prototype architecture
○ separate machines for indexing and search
○ multiple load balanced machines for searching
○ define indexing and search algorithms
● Evaluate architecture
○ stress tests (performance)
○ quality tests (accuracy)
● Iterate

Architecture analysis and design
● analyze existing documents
○ avg size
○ language
○ topics, style, etc.
● analyze existing query logs
○ avg response time
○ avg length (how much it takes to specify a query?)
○ avg query per second

Most time spent on
● testing how documents get indexed
● testing how user queries get transformer in
platform specific queries
● tweaking indexing algorithms
● tweaking search algorithms
● tweaking ranking
● platform optimization for scalability

Challenges
● Architecture constraints
● Performance
● Diverging stakeholders concerns
● Dynamically scaling search

Sample architecture constraint #1
● Data storage has to be on NFS
● Lucene is IO intensive
● NFS makes it slower
● Concurrent read writes makes it error prone

Sample architecture constraint #2
● Change search engine
● Systems talking to the SE need to switch
API
● Only in the long run
● In the short run an adapter layer for old APIs
on new APIs has to be developed

Indexing performance
● Most of the indexing time is spent converting
data from the old (indxing) format to the new
(indexing) format
● The adaption layer between old and new API
becomes the bottleneck
● Time to switch to the new API natively

Diverging concerns
● Article authors check the search engine
exactly handles their writings wanting perfect
recall and precision
○ so lot of time is spent on adjusting ranking
● Markters want to be able to overcome
ranking and put something they want to sell
○ ranking algorithm gets breached
● Need flexible algorithms

Scale dinamically
● Search engine needs not to break even
under high peaks of load
● Such peaks are often unpredictable
● Need a fast way to add more computing
power

Takeaways
● small iterations (no waterfalls!)
○ analyze portion of data / queries
○ change search / index algorithms
○ test, involve stakeholders
○ forces ability to reindex quickly
● look at data (documents, query logs)

Search engines in the industry

Recomendados

Recomendados

Más contenido relacionado

Similar a Search engines in the industry

Similar a Search engines in the industry (20)

Más de Tommaso Teofili

Más de Tommaso Teofili (19)

Último

Último (20)

Search engines in the industry