SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
Simon Lia-Jonassen
UiO, 20/04/15
Efficient Query Processing in Web Search Engines
Outline
¨  Introduction and Motivation
¨  Query Processing and Optimizations
¨  Caching
¨  Parallel and Distributed Query Processing
¨  Research and Evaluation Methodology
Search Engines
ü  A4 page is ~.005 cm thick. 3000 drawers x 304.8 cm/drawer x 200 pages/cm
gives 182.9 million pages. A4 page fits 350 – 400 words.
ü  Indexed Web contains at least 4.57 billion pages. An avg. page is 478 words.
ü  Google has more than 1 billion global visitors and annually spends billions of USD
on data centers.
PopularScience,August1937.
©pro.clubic.com
Web-Search in a Nutshell
¨  Query processing basics
¤  Given an indexed collection of textual documents D and a textual query q,
find the k documents with largest similarity score.
¨  Some similarity model alternatives
¤  Boolean model.
¤  Vector space models, e.g., TFxIDF, Okapi BM25.
¤  Term proximity models.
¤  Link analysis models, e.g., PageRank, HITS, Salsa.
¤  Bayesian and language models.
¤  Machine learned models.
Crawling Documents IndexIndexing
Query
Processing
User
Web
Effectiveness (quality and relevance)
¨  Boolean, not ranked
¤  Precision – how many selected items are relevant.
¤  Recall – how many relevant items are selected.
¤  F1 – weighted harmonic mean of precision and recall.
¨  Boolean, ranked
¤  Precision and recall at 5, 10, …k.
¤  Average Precision and Mean Average Precision.
¨  Graded, ranked
¤  Cumulative Gain (CG), Discounted CG (DCG) and Normalized DCG.
¤  Mean Reciprocal Rank (MRR).
Efficiency (volume, speed and cost)
¨  Index and Collection Size
¤  Number of documents and document size.
¤  Index granularity, size and freshness.
¨  Infrastructure Cost
¤  Number of machines, memory/disk size, network speed, etc.
¤  Resource utilization and load balancing.
¨  Queries and Time
¤  Latency – time per query.
¤  Throughput – queries per time unit.
¤  Query volume (offline) and arrival rate (online).
¤  Query degradation rate (online).
¤  Result freshness (cached results).
CambazogluandBaeza-Yates
Query Processing with Inverted Index
Query Processing with Inverted Index
We could use k1 = 1.2, b = 8, k3= 0.75, but note that
IDFs would be quite strange in this particular example.
Posting Lists (doc. ids and frequencies)
¨  Usually represented as one or two integer lists ordered by document id.
¤  Alternatives:
n  bitmaps instead of document ids, weights instead of frequencies, or impact ordered lists.
¨  Document ids replaced by differences (gaps or deltas) and global ids can
be remapped/reordered to minimize resulting deltas.
¨  Usually accessed via an iterator (get and next) and an auxiliary index/set
of pointers can be used to skip forward in the inverted file.
Posting List Compression
¨  Bit-Aligned Methods
¤  E.g., Unary, Elias, Golomb, Rice, Interpolative, etc.
Ø  Space efficient but require less efficient bit arithmetic.
¨  Byte-Aligned Methods
¤  E.g., VByte (VB)
1 :1000 0001
13 :1000 1101
133 :0000 0001 1000 0101
1337 :0000 1010 1011 1001
Ø  Fast but less space efficient.
Posting List Compression
¨  Word-Aligned Methods
¤  E.g., Simple9 (left) and NewPFoR (right)
Ø  Branch-free and superscalar, can be tuned for compression/speed.
Ø  In the rest, we consider NewPFoR with chunks of 128 deltas followed by 128 frequencies.
Ø  Alternatives: 32/64 or variable size chunks.
Example: Posting List Layout with Skips
Query Evaluation and Pruning
¨  Disjunctive (OR) and Conjunctive (AND) Evaluation
¤  OR – The documents that match any query term will be considered.
n  Slower, lower precision, higher recall.
¤  AND – Only the documents that match all query terms will be considered.
n  Faster, higher precision, lower recall.
¨  Static Pruning – aka Index Pruning
¤  Term Pruning – remove less important terms from the index.
¤  Document Pruning – remove less important documents from the index.
¨  Dynamic Pruning – aka Early Exit Optimization
¤  Avoid scoring postings for some of the candidates under evaluation.
ü  Safe pruning – guarantees the same result as a full evaluation (score, rank or set safe).
ü  Unsafe pruning – does not guarantee the same result as a full evaluation.
Term-At-A-Time Evaluation (TAAT)
¨  Score all postings in the same list and merge with the partial result set
(accumulators).
ü  Need to iterate only one posting list at a time, but store all possible candidates.
¨  Some optimizations
¤  AND – Evaluate query in conjunctive mode, starting with shortest posting list.
¤  Quit/Continue – Stop adding new accumulators when a target size is reached.
¤  Lester – Dynamically scale a frequency threshold based on the accumulator set size.
¤  MaxScore – Ignore postings and remove candidates for documents that are guaranteed
not to be in the result set (need to track the k best candidates seen so far).
Document-At-A-Time Evaluation (DAAT)
¨  Score all postings for the same document and add to the result head.
ü  Need to iterate all posting lists at once, but store only the k best candidates.
¨  Some optimizations
¤  And – Don’t score unless all pointers align, skip to the largest document id.
¤  MaxScore – Stop candidate evaluation when it is guaranteed not to be in the result set.
¤  WAND – Similar to MaxScore but a bit different (see the next slide).
DAAT MaxScore and WAND
¨  MaxScore
¤  Evaluate lists in maxScore order.
¤  Don’t advance optional term pointers.
n  Must match at least one term with maxAg >= thres.
¤  Discard when score + maxAg < thres.
¨  WAND
¤  Evaluate lists in the current document id order.
¤  Pivot by first pointer with maxAg >= thres.
¤  If first pointer and pivot are equal, evaluate.
Otherwise, skip using the pivot.
ü  Both methods can be improved with
block-wise maximum scores.
Post-processing and Multistep Processing
¨  Traditional post-processing
¤  Generate snippets.
¤  Retrieve page screenshots.
¤  Result clustering and/or faceting.
¨  Two-phase processing
1.  Use a simple ranking model to generate a pool of candidates.
2.  Refine the results using a more expensive model.
n  E.g., machine learned model, phrase matching, etc.
¨  Query refinement
¤  Suggest another query based on the found results.
Caching
¨  Two-level (Saraiva et al.)
¤  Result cache and posting list cache.
¨  Three-level (Long and Suel)
¤  Adds intersection cache.
¨  Five-level (Ozcan et al.)
¤  Adds document cache and score cache.
¨  Other alternatives
¤  Projections instead of intersections.
¤  Blocks instead of posting lists.
¨  Static, dynamic and hybrid caches
¤  Static – rebuilt once in a while.
¤  Dynamic – updated on each access.
¤  Static-dynamic – splits the cache into a static part
and a dynamic part.
Cache Replacement Strategies
¨  Some of the most common methods
¤  LRU – put newly accessed entries into a queue, discard from the tail.
¤  LFU – each entry has a counter, discard entries with lowest count.
n  Other alternatives: Two-Stage LRU, LRU-2, 2Q, Multi-Queue (MQ),
Frequency-Based Replacement (FBR), Adaptive Replacement Cache (ARC).
¤  Cost – evict entries with the lowest processing cost.
¤  Landlord
n  assign H=cost/size on insertion or hit
n  cost = 1 gives most hits, cost = proc. time gives best avg. latency.
n  evict the one with minH and subtract it from remaining H.
n  Improvement: use H + alpha * remainingH on the first renewal,
and alpha’ on the second and subsequent renewals.
http://cis.poly.edu/cs912/indexcomp.pdf
Freshness with Infinite Result Cache
¨  Motivation
¤  We can cache results infinitely but they will become outdated after a while.
¨  A solution and the improvements
1.  Set a TTL to invalidate the results after a while.
n  Always rerun the expired queries on request.
2.  Refresh cache results when have capacity.
n  May improves average result freshness.
3.  Proactively refresh expensive queries that
are likely to be expired at peak-load hours.
n  May improves result freshness, degradation and
latency, but determining such queries is really hard.
Online and Offline Query Processing
¨  Offline
¤  All queries are available at the beginning.
¤  Query throughput must be maximized, individual query latency is unimportant.
¤  Specific optimizations
n  Query reordering, clairvoyant intersection and posting list caching, etc.
¨  Online
¤  Queries arrive at different times and the volume varies during the day.
¤  Throughput must be sustained and individual query latency minimized.
¤  Specific optimizations
n  Degradation, result prefetching, micro-batching, etc.
Parallel Query Processing
¨  Motivation
¤  Modern CPUs are multi-core and we would like to utilize this in the best possible way.
¨  Inter-query concurrency
¤  – assign different queries to different cores.
ü  Improves throughput, affects latency.
¨  Intra-query concurrency
¤  – assign different blocks to different cores.
ü  Improves latency, affects throughput.
¨  Some of the main issues
¤  Memory wall (I/O bottleneck).
¤  CPU cache related issues
n  Coherence, conflicts, affinity, etc.
¤  Amdahl’s and Gustafson's laws.
Distributed Query Processing
¨  Partitioning and Replication
ü  Add more partitions to scale with the collection size.
ü  Add more replicas to scale with the query volume.
¨  Clustering
ü  Group similar documents in clusters.
ü  Build a different index for each cluster.
¨  Tiering
ü  Split the collection into several parts, e.g.
n  1 mil important documents.
n  10 mil less important documents.
n  100 mil not really important documents.
ü  Decide when a query should to fall through.
Risviketal.
Partitioned Query Processing
¨  Document-wise partitioning
¤  Each node has a subset of documents.
¤  Each query is processed by all of the nodes in
parallel, one of the nodes merges the results.
ü  Pros: Simple, fast and scalable.
ü  Cons: Each query is processed by all of the nodes.
ü  Cons: Large number of posting lists to be processed.
¨  Term-wise partitioning
¤  Each node has a subset of terms.
¤  Posting lists are fetched and sent to a node
that processes all of them.
ü  Pros: Only a few nodes are involved.
ü  Pros: Possibility for inter-query concurrency.
ü  Pros: Only one posting list for each term.
ü  Cons: Only one node does all processing.
ü  Cons: Network load and load balancing are critical.
Partitioned Query Processing
¨  Pipelined query processing (on top of TP)
¤  Route a bundle from one node to next, process on
each node and extract the results on the last one.
ü  Pros: The work is dived among the nodes.
ü  Pros: The network load is reduced.
ü  Cons: Sequential dependency and load balancing.
¤  Possible improvements
n  Do early exit to reduce the amount of transferred data.
n  Apply fragment based or semi-pipelined processing to reduce query latency.
n  Optimize term assignment to improve network load and load balancing.
¨  Other techniques
¤  Hybrid partitioning – divide posting lists into chunks and distribute the chunks.
¤  2D partitioning – apply both document- and term-wise partitioning (m x n).
Multi-Site Query Processing
¨  Motivation
¤  Search engines spanning across multiple datacenters open a range of new possibilities.
¨  Some of the techniques
¤  User-to-center assignment – choose the nearest datacenter.
¤  Corpus partitioning – cluster and assign documents geographically.
¤  Partial replication – replicate only certain documents to other datacenters.
¤  Reduce costs by forwarding requests to a remote datacenter at peak.
Experimental Research
¨  How to publish a paper (simplified version)
1.  Get a good and novel idea or hypothesis.
ü  Related work, previous observations or the good old gut feeling.
2.  Write the paper.
ü  Abstract, introduction, motivation, related work, contributions, experiments, results, conclusions.
3.  Do the experiments and fill the missing parts.
ü  Need at least a TREC document collection, query set and a strong baseline.
ü  GOV2 + Efficiency Track 03/04; Clueweb09 + Million Query Track 09; Clueweb12 + Web Track 13.
4.  Submit the paper to a good conference.
ü  WWW, SIGIR, WSDM, CIKM, ECIR, SPIRE, WISE, …
5.  Get accepted, go to the conference, enjoy the life.
ü  Novelty + Presentation + Luck = Success!
Evaluation by Analytical Modeling
¨  Example: “Modeling Static Caching in Web Search Engines”
¤  Provide an equation that quantifies the performance.
¤  Obtain parameter values from a real implementation / micro-benchmark / literature.
¤  Compare the alternatives or find the variable in question.
¨  Pros
ü  Almost no hardware cost.
ü  Easy to evaluate new models or revise the results.
ü  Numbers can be obtained from the previous work or taken from a thin air.
¨  Cons
ü  Hard to model things like a multi-core CPU or a JVM.
Evaluation by Implementation
¨  Example: “Intra-query concurrent pipelined processing”
¤  Write a minimalistic version of the system in question (i.e., query processor).
¤  Obtain necessary test data and run the experiments.
¤  Gather the results and compare the alternatives.
¨  Pros
ü  Very realistic.
ü  Great detail and insight level.
¨  Cons
ü  Quite hard and time consuming.
ü  Hard to scale (need hardware/data).
ü  Implementation details and quality matter.
ü  There’s always one more bug™.
Evaluation by Simulation
¨  Example: “Prefetching Query Results and its Impact on Search Engines”
¤  Write a simple program simulating the system (hint: event-driven simulator).
¤  Obtain parameter values from a real implementation / micro-benchmark / literature.
¤  Run the simulator, gather the results and compare the alternatives.
¨  Pros
ü  Less time consuming than implementation,
gives more details/realism than modeling.
ü  Can simulate really large systems.
¨  Cons
ü  Impossible to write a realistic
simulator.
ü  Impossible to write bug-free
code even here ;)
Acknowledgements
¨  For details and references see Chapter 2 of my PhD thesis:
Ø  “Efficient Query Processing in Distributed Search Engines” – http://goo.gl/vDNGGb
¨  Also a great tutorial by Cambazoglu and Baeza-Yates:
Ø  “Scalability And Efficiency Challenges In Large-Scale Web Search Engines” – http://goo.gl/oyWDqU
¨  A very good book:
Ø  “Information Retrieval: Implementing and Evaluating Search Engines” – Büttcher et al., The MIT Press, 2010
¨  Any publication co-authored or even cited by these guys:
Ø  Ricardo Baeza-Yates, B. Barla Cambazoglu, Mauricio Marin, Alistair Moffat,
Fabrizio Silvestri, Torsten Suel, Justin Zobel (in alphabetic order).
Thank you!
¨  Questions?
phdcomics.com
simon.jonassen@gmail.com

Más contenido relacionado

La actualidad más candente

Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets robertlz
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopDenis Shestakov
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceDenis Shestakov
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2Stefanie Zhao
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsCarl Lu
 
Four NoSQL Databases You Should Know
Four NoSQL Databases You Should KnowFour NoSQL Databases You Should Know
Four NoSQL Databases You Should KnowMahmoud Khaled
 
Main MeMory Data Base
Main MeMory Data BaseMain MeMory Data Base
Main MeMory Data BaseSiva Rushi
 
Distro-independent Hadoop cluster management
Distro-independent Hadoop cluster managementDistro-independent Hadoop cluster management
Distro-independent Hadoop cluster managementDataWorks Summit
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache SparkGao Yunzhong
 
in-memory database system and low latency
in-memory database system and low latencyin-memory database system and low latency
in-memory database system and low latencyhyeongchae lee
 
Dremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasetsDremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasetsHung-yu Lin
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
Solr Payloads for Ranking Data
Solr Payloads for Ranking Data Solr Payloads for Ranking Data
Solr Payloads for Ranking Data BloomReach
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Robert Grossman
 

La actualidad más candente (20)

Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with Hadoop
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
 
Four NoSQL Databases You Should Know
Four NoSQL Databases You Should KnowFour NoSQL Databases You Should Know
Four NoSQL Databases You Should Know
 
Main MeMory Data Base
Main MeMory Data BaseMain MeMory Data Base
Main MeMory Data Base
 
Caching in
Caching inCaching in
Caching in
 
Distro-independent Hadoop cluster management
Distro-independent Hadoop cluster managementDistro-independent Hadoop cluster management
Distro-independent Hadoop cluster management
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
 
in-memory database system and low latency
in-memory database system and low latencyin-memory database system and low latency
in-memory database system and low latency
 
RDD
RDDRDD
RDD
 
Dremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasetsDremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasets
 
Dremel Paper Review
Dremel Paper ReviewDremel Paper Review
Dremel Paper Review
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
No SQL and MongoDB - Hyderabad Scalability Meetup
No SQL and MongoDB - Hyderabad Scalability MeetupNo SQL and MongoDB - Hyderabad Scalability Meetup
No SQL and MongoDB - Hyderabad Scalability Meetup
 
Solr Payloads for Ranking Data
Solr Payloads for Ranking Data Solr Payloads for Ranking Data
Solr Payloads for Ranking Data
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 

Similar a Efficient Query Processing in Web Search Engines

Benchmarking Apache Druid
Benchmarking Apache DruidBenchmarking Apache Druid
Benchmarking Apache DruidImply
 
Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid Matt Sarrel
 
MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...
MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...
MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...MongoDB
 
Elasticsearch selected topics
Elasticsearch selected topicsElasticsearch selected topics
Elasticsearch selected topicsCube Solutions
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLHyderabad Scalability Meetup
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSqlOmid Vahdaty
 
IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents Sharvil Katariya
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveSematext Group, Inc.
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...Gianmario Spacagna
 
Applications of data structures
Applications of data structuresApplications of data structures
Applications of data structuresWipro
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB
 
SQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersSQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersAdam Hutson
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingScyllaDB
 
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J..."Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...Dataconomy Media
 
Efficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesEfficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesRui Vieira
 
Inventing the future Business Programming Language
Inventing the future  Business Programming LanguageInventing the future  Business Programming Language
Inventing the future Business Programming LanguageESUG
 

Similar a Efficient Query Processing in Web Search Engines (20)

Benchmarking Apache Druid
Benchmarking Apache DruidBenchmarking Apache Druid
Benchmarking Apache Druid
 
Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid
 
MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...
MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...
MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...
 
Elasticsearch selected topics
Elasticsearch selected topicsElasticsearch selected topics
Elasticsearch selected topics
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSql
 
IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep dive
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
 
Query processing System
Query processing SystemQuery processing System
Query processing System
 
Applications of data structures
Applications of data structuresApplications of data structures
Applications of data structures
 
14 query processing-sorting
14 query processing-sorting14 query processing-sorting
14 query processing-sorting
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
 
SQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersSQL Server 2008 Development for Programmers
SQL Server 2008 Development for Programmers
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate Limiting
 
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J..."Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
 
Efficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesEfficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databases
 
System Design.pdf
System Design.pdfSystem Design.pdf
System Design.pdf
 
Inventing the future Business Programming Language
Inventing the future  Business Programming LanguageInventing the future  Business Programming Language
Inventing the future Business Programming Language
 

Más de Simon Lia-Jonassen

Xgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - ExplainedXgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - ExplainedSimon Lia-Jonassen
 
Large-Scale Real-Time Data Management for Engagement and Monetization
Large-Scale Real-Time Data Management for Engagement and MonetizationLarge-Scale Real-Time Data Management for Engagement and Monetization
Large-Scale Real-Time Data Management for Engagement and MonetizationSimon Lia-Jonassen
 
Leveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at CxenseLeveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at CxenseSimon Lia-Jonassen
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache SparkSimon Lia-Jonassen
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...Simon Lia-Jonassen
 

Más de Simon Lia-Jonassen (8)

HyperLogLog and friends
HyperLogLog and friendsHyperLogLog and friends
HyperLogLog and friends
 
No more bad news!
No more bad news!No more bad news!
No more bad news!
 
Xgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - ExplainedXgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - Explained
 
Chatbots are coming!
Chatbots are coming!Chatbots are coming!
Chatbots are coming!
 
Large-Scale Real-Time Data Management for Engagement and Monetization
Large-Scale Real-Time Data Management for Engagement and MonetizationLarge-Scale Real-Time Data Management for Engagement and Monetization
Large-Scale Real-Time Data Management for Engagement and Monetization
 
Leveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at CxenseLeveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at Cxense
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...
 

Último

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 

Último (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Efficient Query Processing in Web Search Engines

  • 1. Simon Lia-Jonassen UiO, 20/04/15 Efficient Query Processing in Web Search Engines
  • 2. Outline ¨  Introduction and Motivation ¨  Query Processing and Optimizations ¨  Caching ¨  Parallel and Distributed Query Processing ¨  Research and Evaluation Methodology
  • 3. Search Engines ü  A4 page is ~.005 cm thick. 3000 drawers x 304.8 cm/drawer x 200 pages/cm gives 182.9 million pages. A4 page fits 350 – 400 words. ü  Indexed Web contains at least 4.57 billion pages. An avg. page is 478 words. ü  Google has more than 1 billion global visitors and annually spends billions of USD on data centers. PopularScience,August1937. ©pro.clubic.com
  • 4. Web-Search in a Nutshell ¨  Query processing basics ¤  Given an indexed collection of textual documents D and a textual query q, find the k documents with largest similarity score. ¨  Some similarity model alternatives ¤  Boolean model. ¤  Vector space models, e.g., TFxIDF, Okapi BM25. ¤  Term proximity models. ¤  Link analysis models, e.g., PageRank, HITS, Salsa. ¤  Bayesian and language models. ¤  Machine learned models. Crawling Documents IndexIndexing Query Processing User Web
  • 5. Effectiveness (quality and relevance) ¨  Boolean, not ranked ¤  Precision – how many selected items are relevant. ¤  Recall – how many relevant items are selected. ¤  F1 – weighted harmonic mean of precision and recall. ¨  Boolean, ranked ¤  Precision and recall at 5, 10, …k. ¤  Average Precision and Mean Average Precision. ¨  Graded, ranked ¤  Cumulative Gain (CG), Discounted CG (DCG) and Normalized DCG. ¤  Mean Reciprocal Rank (MRR).
  • 6. Efficiency (volume, speed and cost) ¨  Index and Collection Size ¤  Number of documents and document size. ¤  Index granularity, size and freshness. ¨  Infrastructure Cost ¤  Number of machines, memory/disk size, network speed, etc. ¤  Resource utilization and load balancing. ¨  Queries and Time ¤  Latency – time per query. ¤  Throughput – queries per time unit. ¤  Query volume (offline) and arrival rate (online). ¤  Query degradation rate (online). ¤  Result freshness (cached results). CambazogluandBaeza-Yates
  • 7. Query Processing with Inverted Index
  • 8. Query Processing with Inverted Index We could use k1 = 1.2, b = 8, k3= 0.75, but note that IDFs would be quite strange in this particular example.
  • 9. Posting Lists (doc. ids and frequencies) ¨  Usually represented as one or two integer lists ordered by document id. ¤  Alternatives: n  bitmaps instead of document ids, weights instead of frequencies, or impact ordered lists. ¨  Document ids replaced by differences (gaps or deltas) and global ids can be remapped/reordered to minimize resulting deltas. ¨  Usually accessed via an iterator (get and next) and an auxiliary index/set of pointers can be used to skip forward in the inverted file.
  • 10. Posting List Compression ¨  Bit-Aligned Methods ¤  E.g., Unary, Elias, Golomb, Rice, Interpolative, etc. Ø  Space efficient but require less efficient bit arithmetic. ¨  Byte-Aligned Methods ¤  E.g., VByte (VB) 1 :1000 0001 13 :1000 1101 133 :0000 0001 1000 0101 1337 :0000 1010 1011 1001 Ø  Fast but less space efficient.
  • 11. Posting List Compression ¨  Word-Aligned Methods ¤  E.g., Simple9 (left) and NewPFoR (right) Ø  Branch-free and superscalar, can be tuned for compression/speed. Ø  In the rest, we consider NewPFoR with chunks of 128 deltas followed by 128 frequencies. Ø  Alternatives: 32/64 or variable size chunks.
  • 12. Example: Posting List Layout with Skips
  • 13. Query Evaluation and Pruning ¨  Disjunctive (OR) and Conjunctive (AND) Evaluation ¤  OR – The documents that match any query term will be considered. n  Slower, lower precision, higher recall. ¤  AND – Only the documents that match all query terms will be considered. n  Faster, higher precision, lower recall. ¨  Static Pruning – aka Index Pruning ¤  Term Pruning – remove less important terms from the index. ¤  Document Pruning – remove less important documents from the index. ¨  Dynamic Pruning – aka Early Exit Optimization ¤  Avoid scoring postings for some of the candidates under evaluation. ü  Safe pruning – guarantees the same result as a full evaluation (score, rank or set safe). ü  Unsafe pruning – does not guarantee the same result as a full evaluation.
  • 14. Term-At-A-Time Evaluation (TAAT) ¨  Score all postings in the same list and merge with the partial result set (accumulators). ü  Need to iterate only one posting list at a time, but store all possible candidates. ¨  Some optimizations ¤  AND – Evaluate query in conjunctive mode, starting with shortest posting list. ¤  Quit/Continue – Stop adding new accumulators when a target size is reached. ¤  Lester – Dynamically scale a frequency threshold based on the accumulator set size. ¤  MaxScore – Ignore postings and remove candidates for documents that are guaranteed not to be in the result set (need to track the k best candidates seen so far).
  • 15. Document-At-A-Time Evaluation (DAAT) ¨  Score all postings for the same document and add to the result head. ü  Need to iterate all posting lists at once, but store only the k best candidates. ¨  Some optimizations ¤  And – Don’t score unless all pointers align, skip to the largest document id. ¤  MaxScore – Stop candidate evaluation when it is guaranteed not to be in the result set. ¤  WAND – Similar to MaxScore but a bit different (see the next slide).
  • 16. DAAT MaxScore and WAND ¨  MaxScore ¤  Evaluate lists in maxScore order. ¤  Don’t advance optional term pointers. n  Must match at least one term with maxAg >= thres. ¤  Discard when score + maxAg < thres. ¨  WAND ¤  Evaluate lists in the current document id order. ¤  Pivot by first pointer with maxAg >= thres. ¤  If first pointer and pivot are equal, evaluate. Otherwise, skip using the pivot. ü  Both methods can be improved with block-wise maximum scores.
  • 17. Post-processing and Multistep Processing ¨  Traditional post-processing ¤  Generate snippets. ¤  Retrieve page screenshots. ¤  Result clustering and/or faceting. ¨  Two-phase processing 1.  Use a simple ranking model to generate a pool of candidates. 2.  Refine the results using a more expensive model. n  E.g., machine learned model, phrase matching, etc. ¨  Query refinement ¤  Suggest another query based on the found results.
  • 18. Caching ¨  Two-level (Saraiva et al.) ¤  Result cache and posting list cache. ¨  Three-level (Long and Suel) ¤  Adds intersection cache. ¨  Five-level (Ozcan et al.) ¤  Adds document cache and score cache. ¨  Other alternatives ¤  Projections instead of intersections. ¤  Blocks instead of posting lists. ¨  Static, dynamic and hybrid caches ¤  Static – rebuilt once in a while. ¤  Dynamic – updated on each access. ¤  Static-dynamic – splits the cache into a static part and a dynamic part.
  • 19. Cache Replacement Strategies ¨  Some of the most common methods ¤  LRU – put newly accessed entries into a queue, discard from the tail. ¤  LFU – each entry has a counter, discard entries with lowest count. n  Other alternatives: Two-Stage LRU, LRU-2, 2Q, Multi-Queue (MQ), Frequency-Based Replacement (FBR), Adaptive Replacement Cache (ARC). ¤  Cost – evict entries with the lowest processing cost. ¤  Landlord n  assign H=cost/size on insertion or hit n  cost = 1 gives most hits, cost = proc. time gives best avg. latency. n  evict the one with minH and subtract it from remaining H. n  Improvement: use H + alpha * remainingH on the first renewal, and alpha’ on the second and subsequent renewals. http://cis.poly.edu/cs912/indexcomp.pdf
  • 20. Freshness with Infinite Result Cache ¨  Motivation ¤  We can cache results infinitely but they will become outdated after a while. ¨  A solution and the improvements 1.  Set a TTL to invalidate the results after a while. n  Always rerun the expired queries on request. 2.  Refresh cache results when have capacity. n  May improves average result freshness. 3.  Proactively refresh expensive queries that are likely to be expired at peak-load hours. n  May improves result freshness, degradation and latency, but determining such queries is really hard.
  • 21. Online and Offline Query Processing ¨  Offline ¤  All queries are available at the beginning. ¤  Query throughput must be maximized, individual query latency is unimportant. ¤  Specific optimizations n  Query reordering, clairvoyant intersection and posting list caching, etc. ¨  Online ¤  Queries arrive at different times and the volume varies during the day. ¤  Throughput must be sustained and individual query latency minimized. ¤  Specific optimizations n  Degradation, result prefetching, micro-batching, etc.
  • 22. Parallel Query Processing ¨  Motivation ¤  Modern CPUs are multi-core and we would like to utilize this in the best possible way. ¨  Inter-query concurrency ¤  – assign different queries to different cores. ü  Improves throughput, affects latency. ¨  Intra-query concurrency ¤  – assign different blocks to different cores. ü  Improves latency, affects throughput. ¨  Some of the main issues ¤  Memory wall (I/O bottleneck). ¤  CPU cache related issues n  Coherence, conflicts, affinity, etc. ¤  Amdahl’s and Gustafson's laws.
  • 23. Distributed Query Processing ¨  Partitioning and Replication ü  Add more partitions to scale with the collection size. ü  Add more replicas to scale with the query volume. ¨  Clustering ü  Group similar documents in clusters. ü  Build a different index for each cluster. ¨  Tiering ü  Split the collection into several parts, e.g. n  1 mil important documents. n  10 mil less important documents. n  100 mil not really important documents. ü  Decide when a query should to fall through. Risviketal.
  • 24. Partitioned Query Processing ¨  Document-wise partitioning ¤  Each node has a subset of documents. ¤  Each query is processed by all of the nodes in parallel, one of the nodes merges the results. ü  Pros: Simple, fast and scalable. ü  Cons: Each query is processed by all of the nodes. ü  Cons: Large number of posting lists to be processed. ¨  Term-wise partitioning ¤  Each node has a subset of terms. ¤  Posting lists are fetched and sent to a node that processes all of them. ü  Pros: Only a few nodes are involved. ü  Pros: Possibility for inter-query concurrency. ü  Pros: Only one posting list for each term. ü  Cons: Only one node does all processing. ü  Cons: Network load and load balancing are critical.
  • 25. Partitioned Query Processing ¨  Pipelined query processing (on top of TP) ¤  Route a bundle from one node to next, process on each node and extract the results on the last one. ü  Pros: The work is dived among the nodes. ü  Pros: The network load is reduced. ü  Cons: Sequential dependency and load balancing. ¤  Possible improvements n  Do early exit to reduce the amount of transferred data. n  Apply fragment based or semi-pipelined processing to reduce query latency. n  Optimize term assignment to improve network load and load balancing. ¨  Other techniques ¤  Hybrid partitioning – divide posting lists into chunks and distribute the chunks. ¤  2D partitioning – apply both document- and term-wise partitioning (m x n).
  • 26. Multi-Site Query Processing ¨  Motivation ¤  Search engines spanning across multiple datacenters open a range of new possibilities. ¨  Some of the techniques ¤  User-to-center assignment – choose the nearest datacenter. ¤  Corpus partitioning – cluster and assign documents geographically. ¤  Partial replication – replicate only certain documents to other datacenters. ¤  Reduce costs by forwarding requests to a remote datacenter at peak.
  • 27. Experimental Research ¨  How to publish a paper (simplified version) 1.  Get a good and novel idea or hypothesis. ü  Related work, previous observations or the good old gut feeling. 2.  Write the paper. ü  Abstract, introduction, motivation, related work, contributions, experiments, results, conclusions. 3.  Do the experiments and fill the missing parts. ü  Need at least a TREC document collection, query set and a strong baseline. ü  GOV2 + Efficiency Track 03/04; Clueweb09 + Million Query Track 09; Clueweb12 + Web Track 13. 4.  Submit the paper to a good conference. ü  WWW, SIGIR, WSDM, CIKM, ECIR, SPIRE, WISE, … 5.  Get accepted, go to the conference, enjoy the life. ü  Novelty + Presentation + Luck = Success!
  • 28. Evaluation by Analytical Modeling ¨  Example: “Modeling Static Caching in Web Search Engines” ¤  Provide an equation that quantifies the performance. ¤  Obtain parameter values from a real implementation / micro-benchmark / literature. ¤  Compare the alternatives or find the variable in question. ¨  Pros ü  Almost no hardware cost. ü  Easy to evaluate new models or revise the results. ü  Numbers can be obtained from the previous work or taken from a thin air. ¨  Cons ü  Hard to model things like a multi-core CPU or a JVM.
  • 29. Evaluation by Implementation ¨  Example: “Intra-query concurrent pipelined processing” ¤  Write a minimalistic version of the system in question (i.e., query processor). ¤  Obtain necessary test data and run the experiments. ¤  Gather the results and compare the alternatives. ¨  Pros ü  Very realistic. ü  Great detail and insight level. ¨  Cons ü  Quite hard and time consuming. ü  Hard to scale (need hardware/data). ü  Implementation details and quality matter. ü  There’s always one more bug™.
  • 30. Evaluation by Simulation ¨  Example: “Prefetching Query Results and its Impact on Search Engines” ¤  Write a simple program simulating the system (hint: event-driven simulator). ¤  Obtain parameter values from a real implementation / micro-benchmark / literature. ¤  Run the simulator, gather the results and compare the alternatives. ¨  Pros ü  Less time consuming than implementation, gives more details/realism than modeling. ü  Can simulate really large systems. ¨  Cons ü  Impossible to write a realistic simulator. ü  Impossible to write bug-free code even here ;)
  • 31. Acknowledgements ¨  For details and references see Chapter 2 of my PhD thesis: Ø  “Efficient Query Processing in Distributed Search Engines” – http://goo.gl/vDNGGb ¨  Also a great tutorial by Cambazoglu and Baeza-Yates: Ø  “Scalability And Efficiency Challenges In Large-Scale Web Search Engines” – http://goo.gl/oyWDqU ¨  A very good book: Ø  “Information Retrieval: Implementing and Evaluating Search Engines” – Büttcher et al., The MIT Press, 2010 ¨  Any publication co-authored or even cited by these guys: Ø  Ricardo Baeza-Yates, B. Barla Cambazoglu, Mauricio Marin, Alistair Moffat, Fabrizio Silvestri, Torsten Suel, Justin Zobel (in alphabetic order).