3. Different interests
● researchers / engineers look for high
precision and recall
● editors / writers are concerned about
matching of queries and results
● marketers want to change / adapt results
4. Designing a search engine
● functional requirements
○ search
■ keywords, boolean retrieval, natural language
○ indexing
■ data sources
■ data types
○ administration
■ manage scoring / boosting functions
5. Designing a search engine
● architectural requirements
○ resiliency
○ scalability
○ no downtime
○ work with existing infrastructure
○ platforms
○ migrating from legacy systems
○ talk to other systems
6. Designing a search engine
● performance requirements
○ search
■ query per second
■ time per search request
○ index
■ document per second
■ time per indexing request
○ SLA?
8. ● often mostly unknown
○ published vs unpublished / to be written documents
● almost always umanageable
○ cannot decide when
■ it’ll be ready
■ it’ll have to be indexed
■ it’ll have to be searchable
● heterogeneous
○ different writers, languages, topics, styles, etc.
Data
10. Project
● ~50M heterogeneous documents
● Migrating from old commercial solution to
Apache Solr
● Google like search
● Targeted search for different types of
contents
11. Advanced capabilities
● Smart understanding of queries
● Smart suggestion of queries
● Suggestion of similar / important contents
● Automatic classification of contents
12. Responsibilities
● architecture analysis and design
○ scaling under high load
● continuous definition of algorithms for
indexing and searching
● system maintenance
13. Skills required
● basics of information retrieval
● a bit of distributed systems
● some natural language processing
● some machine learning
14. Architecture analysis and design
● Shape up a prototype architecture
○ separate machines for indexing and search
○ multiple load balanced machines for searching
○ define indexing and search algorithms
● Evaluate architecture
○ stress tests (performance)
○ quality tests (accuracy)
● Iterate
15. Architecture analysis and design
● analyze existing documents
○ avg size
○ language
○ topics, style, etc.
● analyze existing query logs
○ avg response time
○ avg length (how much it takes to specify a query?)
○ avg query per second
16. Most time spent on
● testing how documents get indexed
● testing how user queries get transformer in
platform specific queries
● tweaking indexing algorithms
● tweaking search algorithms
● tweaking ranking
● platform optimization for scalability
18. Sample architecture constraint #1
● Data storage has to be on NFS
● Lucene is IO intensive
● NFS makes it slower
● Concurrent read writes makes it error prone
19. Sample architecture constraint #2
● Change search engine
● Systems talking to the SE need to switch
API
● Only in the long run
● In the short run an adapter layer for old APIs
on new APIs has to be developed
20. Indexing performance
● Most of the indexing time is spent converting
data from the old (indxing) format to the new
(indexing) format
● The adaption layer between old and new API
becomes the bottleneck
● Time to switch to the new API natively
21. Diverging concerns
● Article authors check the search engine
exactly handles their writings wanting perfect
recall and precision
○ so lot of time is spent on adjusting ranking
● Markters want to be able to overcome
ranking and put something they want to sell
○ ranking algorithm gets breached
● Need flexible algorithms
22. Scale dinamically
● Search engine needs not to break even
under high peaks of load
● Such peaks are often unpredictable
● Need a fast way to add more computing
power
23.
24. Takeaways
● small iterations (no waterfalls!)
○ analyze portion of data / queries
○ change search / index algorithms
○ test, involve stakeholders
○ forces ability to reindex quickly
● look at data (documents, query logs)