SlideShare una empresa de Scribd logo
1 de 86
Descargar para leer sin conexión
Search @twitter
Michael Busch
@michibusch
michael@twitter.com
buschmi@apache.org

1
Search @twitter
Agenda
‣ Introduction
- Search Architecture
- Inverted Index 101
- Realtime Posting Lists

2
Introduction

3
Introduction

Twitter has more than 230 million
monthly active users.

4
Introduction

500 million tweets are sent per day.

5
Introduction

More than 300 billion tweets have been
sent since company founding in 2006.

6
Introduction

Tweets-per-second world record:
33,388 TPS.

7
Introduction

More than 2 billion search queries per
day.

8
Introduction
2008

Twitter acquires Summize (MySQL-based RT search engine)

2009
2010

Modified Lucene (Earlybird) ships and replaces MySQL indexes

2011

New Earlybird features: image/video search; index compression;
efficient relevance search in time-sorted index

2012
2013
2014

Tweet archive search on SSD with vanilla Lucene
New RT posting list format that supports arbitrary document
lengths, but keeps performance optimizations for tweets
9
Introduction
2008

Twitter acquires Summize (MySQL-based RT search engine)

2009
2010

Modified Lucene (Earlybird) ships and replaces MySQL indexes

2011

New Earlybird features: image/video search; index compression;
efficient relevance search in time-sorted index

2012
2013
2014

Tweet archive search on SSD with vanilla Lucene
New RT posting list format that supports arbitrary document
lengths, but keeps performance optimizations for tweets
10
Introduction
2008

Twitter acquires Summize (MySQL-based RT search engine)

2009
2010

Modified Lucene (Earlybird) ships and replaces MySQL indexes

2011

New Earlybird features: image/video search; index compression;
efficient relevance search in time-sorted index

2012
2013
2014

Tweet archive search on SSD with vanilla Lucene
New RT posting list format that supports arbitrary document
lengths, but keeps performance optimizations for tweets
11
Introduction
2008

Twitter acquires Summize (MySQL-based RT search engine)

2009
2010

Modified Lucene (Earlybird) ships and replaces MySQL indexes

2011

New Earlybird features: image/video search; index compression;
efficient relevance search in time-sorted index

2012
2013
2014

Tweet archive search on SSD with vanilla Lucene
New RT posting list format that supports arbitrary document
lengths, but keeps performance optimizations for tweets
12
Introduction
2008

Twitter acquires Summize (MySQL-based RT search engine)

2009
2010

Modified Lucene (Earlybird) ships and replaces MySQL indexes

2011

New Earlybird features: image/video search; index compression;
efficient relevance search in time-sorted index

2012
2013
2014

Tweet archive search on SSD with vanilla Lucene
New RT posting list format that supports arbitrary document
lengths, but keeps performance optimizations for tweets
13
Realtime Search @twitter
Agenda
- Introduction
‣ Search Architecture
- Inverted Index 101
- Realtime Posting Lists

14
Search Architecture

15
Search Architecture
RT stream

raw
tweets

Analyzer/
Partitioner

analyzed
tweets

RT index
RT index
(Earlybird)

Blender

Search
requests

Tweet archive
HDFS

raw Mapreduce
tweets Analyzer

analyzed
tweets

Archive
RT index
index
writes
searches
16
Search Architecture
Analyzer/
Partitioner

• Pre-processes Tweets for indexing
• Analyzing (tokenization/normalization) of text
• Geo-coding, URL expansion, etc.
• Hash partitioning

17
Search Architecture
RT stream

raw
tweets

Analyzer/
Partitioner

analyzed
tweets

RT index
RT index
(Earlybird)

Blender

Search
requests

Tweet archive
HDFS

raw Mapreduce
tweets Analyzer

analyzed
tweets

Archive
RT index
index
writes
searches
18
Search Architecture
RT index
RT index
(Earlybird)

• Modified Lucene index implementation optimized for realtime search
• IndexWriter buffer is searchable (no need to flush to allow searching)
• In-memory
• Hash-partitioned, static layout

19
Cluster layout

Earlybird
Earlybird
Earlybird

Replicas

20
Cluster layout
n hash partitions (docId % n)

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

...

Earlybird
Earlybird
Earlybird

Replicas
21
Cluster layout
n hash partitions (docId % n)

Earlybird
Earlybird
Earlybird

Timeslices

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

...
Earlybird
Earlybird
Earlybird

...
Earlybird
Earlybird
Earlybird

...

Earlybird
Earlybird
Earlybird

...

Earlybird
Earlybird
Earlybird

...
Earlybird
Earlybird
Earlybird

...
...

Earlybird
Earlybird
Earlybird

Replicas
22
Cluster layout

Writable
timeslice

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

Complete
timeslices

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

...
Earlybird
Earlybird
Earlybird

...
Earlybird
Earlybird
Earlybird

...

Earlybird
Earlybird
Earlybird

...

Earlybird
Earlybird
Earlybird

...
Earlybird
Earlybird
Earlybird

...
...

Earlybird
Earlybird
Earlybird

23
Search Architecture
RT index
RT index
(Earlybird)

• Modified Lucene index implementation optimized for realtime search
• IndexWriter buffer is searchable (no need to flush to allow searching)
• In-memory
• Hash-partitioned, static layout

24
Search Architecture
RT stream

raw
tweets

Analyzer/
Partitioner

analyzed
tweets

RT index
RT index
(Earlybird)

Blender

Search
requests

Tweet archive
HDFS

raw Mapreduce
tweets Analyzer

analyzed
tweets

Archive
RT index
index
writes
searches
25
Search Architecture
Mapreduce
Analyzer

• Daily jobs that process raw tweets
• Analyzes text
• Aggregates metadata and signals

26
Search Architecture
RT stream

raw
tweets

Analyzer/
Partitioner

analyzed
tweets

RT index
RT index
(Earlybird)

Blender

Search
requests

Tweet archive
HDFS

raw Mapreduce
tweets Analyzer

analyzed
tweets

Archive
RT index
index
writes
searches
27
Search Architecture
Archive
RT index
index

• Standard Lucene (4.4) indexes
• Reverse time-sorted (new to old)
• Cluster layout similar to realtime search cluster

28
Search Architecture
Archive
RT index
index

• Two tiers: In-memory and on SSD

In-memory index

SSD index

29
Search Architecture
Archive
RT index
index

• Two tiers: In-memory and on SSD
Contains small number of best
tweets of all time

In-memory index

SSD index

30
Search Architecture
Archive
RT index
index

• Two tiers: In-memory and on SSD

In-memory index

Much bigger index with more
tweets, less max. QPS, limited by
SSD IOPS.
Only needs to be queried if inmemory index did not yield
enough results

SSD index

31
Search Architecture
RT stream

raw
tweets

Analyzer/
Partitioner

analyzed
tweets

RT index
RT index
(Earlybird)

Blender

Search
requests

Tweet archive
HDFS

raw Mapreduce
tweets Analyzer

analyzed
tweets

Archive
RT index
index
writes
searches
32
Search Architecture

RT index
RT index
(Earlybird)

• Blender is our Thrift
service aggregator
Blender

• Queries multiple
Earlybirds, merges results

Search
requests

Archive
RT index
index
writes
searches
33
Search Architecture
RT stream

raw
tweets

Analyzer/
Partitioner

analyzed
tweets

RT index
RT index
(Earlybird)

Blender

Search
requests

Tweet archive
HDFS

raw Mapreduce
tweets Analyzer

analyzed
tweets

Archive
RT index
index
writes
searches
34
Search Architecture
Tweets

Analyzer/
Partitioner

RT index
RT index
(Earlybird)

queue

Updates

HDFS

Deletes/
Engagement (e.g. retweets/favs)

Mapreduce
Analyzer

Blender

Search
requests

Archive
RT index
index
writes
searches
35
Realtime Search @twitter
Agenda
- Introduction
- Search Architecture
‣ Inverted Index 101
- Realtime Posting Lists

36
Inverted Index 101

37
Inverted Index 101
1

The old night keeper keeps the keep in the town

2

In the big old house in the big old gown.

3

The house in the town had the big old keep

4

Where the old night keeper never did sleep.

5

The night keeper keeps the keep in the night

6

And keeps in the dark and sleeps in the light.

Table with 6 documents

Example from:
Justin Zobel , Alistair Moffat,
Inverted files for text search engines,
ACM Computing Surveys (CSUR)
v.38 n.2, p.6-es, 2006

38
Inverted Index 101
1

The old night keeper keeps the keep in the town

2

In the big old house in the big old gown.

3

The house in the town had the big old keep

4

Where the old night keeper never did sleep.

5

The night keeper keeps the keep in the night

6

And keeps in the dark and sleeps in the light.

Table with 6 documents

term
and
big
dark
did
gown
had
house
in
keep
keeper
keeps
light
never
night
old
sleep
sleeps
the
town
where

freq
1
2
1
1
1
1
2
5
3
3
3
1
1
3
4
1
1
6
2
1

<6>
<2> <3>
<6>
<4>
<2>
<3>
<2> <3>
<1> <2> <3> <5> <6>
<1> <3> <5>
<1> <4> <5>
<1> <5> <6>
<6>
<4>
<1> <4> <5>
<1> <2> <3> <4>
<4>
<6>
<1> <2> <3> <4> <5> <6>
<1> <3>
<4>

Dictionary and posting lists
39
Inverted Index 101
1

The old night keeper keeps the keep in the town

2

In the big old house in the big old gown.

3

The house in the town had the big old keep

4

Where the old night keeper never did sleep.

5

The night keeper keeps the keep in the night

6

And keeps in the dark and sleeps in the light.

Table with 6 documents

Query: keeper
term
and
big
dark
did
gown
had
house
in
keep
keeper
keeps
light
never
night
old
sleep
sleeps
the
town
where

freq
1
2
1
1
1
1
2
5
3
3
3
1
1
3
4
1
1
6
2
1

<6>
<2> <3>
<6>
<4>
<2>
<3>
<2> <3>
<1> <2> <3> <5> <6>
<1> <3> <5>
<1> <4> <5>
<1> <5> <6>
<6>
<4>
<1> <4> <5>
<1> <2> <3> <4>
<4>
<6>
<1> <2> <3> <4> <5> <6>
<1> <3>
<4>

Dictionary and posting lists
40
Inverted Index 101
1

The old night keeper keeps the keep in the town

2

In the big old house in the big old gown.

3

The house in the town had the big old keep

4

Where the old night keeper never did sleep.

5

The night keeper keeps the keep in the night

6

And keeps in the dark and sleeps in the light.

Table with 6 documents

Query: keeper
term
and
big
dark
did
gown
had
house
in
keep
keeper
keeps
light
never
night
old
sleep
sleeps
the
town
where

freq
1
2
1
1
1
1
2
5
3
3
3
1
1
3
4
1
1
6
2
1

<6>
<2> <3>
<6>
<4>
<2>
<3>
<2> <3>
<1> <2> <3> <5> <6>
<1> <3> <5>
<1> <4> <5>
<1> <5> <6>
<6>
<4>
<1> <4> <5>
<1> <2> <3> <4>
<4>
<6>
<1> <2> <3> <4> <5> <6>
<1> <3>
<4>

Dictionary and posting lists
41
Posting list encoding
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

42
Posting list encoding
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Delta encoding:

5 10 8985

2

90998

90

43
Posting list encoding
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Delta encoding:

5 10 8985

VInt compression:

00000101

2

90998

90

Values 0 <= delta <= 127 need
one byte

44
Posting list encoding
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Delta encoding:

VInt compression:

5 10 8985

2

90998

90

11000110 00011001

Values 128 <= delta <= 16384
need two bytes

45
Posting list encoding
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Delta encoding:

VInt compression:

5 10 8985

2

90998

90

11000110 00011001

First bit indicates whether next
byte belongs to the same value

46
Posting list encoding
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Delta encoding:

VInt compression:

5 10 8985

2

90998

90

11000110 00011001

• Variable number of bytes - a VInt-encoded posting can not be written as a
primitive Java type; therefore it can not be written atomically

47
Posting list encoding
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Delta encoding:

5 10 8985

2

90998

90

Read direction

• Each posting depends on previous one; decoding only possible in old-to-new
direction
• With recency ranking (new-to-old) no early termination is possible

48
Posting list encoding
• By default Lucene uses a combination of delta encoding and VInt
compression
• VInts are expensive to decode
• Problem 1: How to traverse posting lists backwards?
• Problem 2: How to write a posting atomically?

49
Realtime Search @twitter
Agenda
- Introduction
- Search Architecture
- Inverted Index 101
‣ Realtime Posting Lists

50
Realtime Posting Lists

51
Posting list encoding in Earlybird v1
int (32 bits)

docID
24 bits
max. 16.7M

textPosition
8 bits
max. 255

• Tweet text can only have 140 chars

52
Posting list encoding in Earlybird v1
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Earlybird encoding:
5

15

9000

9002

100000

100090

Read direction

53
Early query termination
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Earlybird encoding:
5

15

9000

9002

100000

100090

Read direction
E.g. 3 result are requested: Here
we can terminate after reading 3
postings

54
Inverted index components
Posting list storage

?
Dictionary

Parallel arrays

pointer to the most recently
indexed posting for a term

55
Inverted index components
Posting list storage

?
Dictionary

Parallel arrays

pointer to the most recently
indexed posting for a term

56
Posting lists storage - Objectives
• Store many single-linked lists of different lengths space-efficiently
• The number of java objects should be independent of the number of lists or
number of items in the lists
• Every item should be a possible entry point into the lists for iterators, i.e.
items should not be dependent on other items (e.g. no delta encoding)
• Append and read possible by multiple threads in a lock-free fashion (single
append thread, multiple reader threads)
• Traversal in backwards order

57
Memory management
4 int[]
pools

= 32K int[]

58
Memory management
4 int[]
pools

= 32K int[]

Each pool can
be grown
individually by
adding 32K
blocks

59
Memory management
4 int[]
pools

• For simplicity we can forget about the blocks for now and think of the pools
as continuous, unbounded int[] arrays
• Small total number of Java objects (each 32K block is one object)

60
Memory management
slice size
211
27
24
21

• Slices can be allocated in each pool
• Each pool has a different, but fixed slice size

61
Adding and appending to a list
slice size
211
27

available

24

allocated

21

current list

62
Adding and appending to a list
slice size
211
27

available

24

allocated

21

current list

Store first two
postings in this slice

63
Adding and appending to a list
slice size
211
27

available

24

allocated

21

current list

When first slice is full, allocate another one in second pool

64
Adding and appending to a list
slice size
211
27

available

24

allocated

21

current list

Allocate a slice on each level as list grows

65
Adding and appending to a list
slice size
211
27

available

24

allocated

21

current list

On upper most level one list can own multiple slices

66
Posting list format v1
int (32 bits)

docID
24 bits
max. 16.7M

textPosition
8 bits
max. 255

• Tweet text can only have 140 chars

67
Addressing items
• Use 32 bit (int) pointers to address any item in any list unambiguously:

int (32 bits)

poolIndex
2 bits
0-3

sliceIndex
19-29 bits
depends on pool

offset in slice
1-11 bits
depends on pool

• Nice symmetry: Postings and address pointers both fit into a 32 bit int

68
Linking the slices
slice size
211
27

available

24

allocated

21

current list

69
Linking the slices
slice size
211
27

available

24

allocated

21

current list

Dictionary

Parallel arrays

pointer to the last posting indexed for a term

70
Posting list encoding - Summary
• ints can be written atomically in Java
• Backwards traversal easy on absolute docIDs (not deltas)
• Every posting is a possible entry point for a searcher
• Skipping can be done without additional data structures as binary search,
though there are better approaches (skip lists)
• Repeating docIDs if a term occurs multiple times in the same document only
works for small docs
• Max. segment size: 2^24 = 16.7M tweets

71
New posting list encoding
• Objectives:
• 32 bit positions and variable-length payloads
• Store term frequency (TF) instead of repeating docIDs
• Keep:
• Concurrency model
• Space-efficiency for short documents
• Performance

72
New posting list encoding
DocID, termFreq

Position, Payload

73
New posting list encoding
DocID, termFreq

Position, Payload

Fixed length for each posting

74
New posting list encoding
DocID, termFreq

Position, Payload

Variable length

75
New posting list encoding

DocID, termFreq

Position, Payload

76
New posting list encoding

...

DocID, termFreq

DocID, termFreq

DocID, termFreq

Position, Payload

Position, Payload, Position

...

Position, Payload

77
New posting list encoding
...

DocID, termFreq

DocID, termFreq

DocID, termFreq

Position, Payload

Position, Payload, Position

...

Position, Payload

• Store TF instead of repeating the same DocID
• Store DocID/TF pairs separately from position/payloads
• Find a way to synchronously decode the two streams without storing a
pointer for each posting (expensive)

78
New posting list encoding
...

DocID, termFreq

DocID, termFreq

DocID, termFreq

Position, Payload

Position, Payload, Position

...

Position, Payload

Fixed length for each posting
(32 bits)

• Store TF instead of repeating the same DocID
• Store DocID/TF pairs separately from position/payloads
• Find a way to synchronously decode the two streams without storing a
pointer for each posting (expensive)

79
New posting list encoding

• Idea: Use an embedded skip list as periodical “synchronization points”
• Keeps memory overhead for pointers low and improves search performance

80
New posting list encoding
slice size
211
27

available

24

allocated

21

current list

81
New posting list encoding

Slice header

• Header contains:
• Back-pointer to previous slice (as before)
• Skip list
• Slice id

82
New posting list encoding
int (32 bits)

docID
24 bits
max. 16.7M

textPosition
8 bits
max. 255

• Observation: Most tweets don’t need all 8 bits for text position
• Idea: Use the position “inlining” approach for short documents, but support
Lucene’s 32-bit positions and variable length payloads

83
New posting list encoding
int (32 bits)

docID
24 bits
max. 16.7M

textPosition
or
termFreq
7 bits
max. 127

0=textPosition
1=termFreq
1 bit

As a storage optimization, the text position is stored with the docID if:
o termFreq == 1 (term occurs once only in the doc) AND
o textPosition <= 127 AND
o Posting has no payload AND
o Posting is not at a skip point of the docID posting list (see later).

84
New posting list encoding - Summary
• Support for 32 bit positions and arbitrary length payloads stored in separate
data structure
• Performance and space consumption very similar compared to previous
encoding for tweet search
• Skip lists used for speed and synchronization points
• For short documents positions can still be inlined

85
Questions?
Michael Busch
@michibusch
michael@twitter.com
buschmi@apache.org

Previous talk: http://vimeo.com/31195040
86

Más contenido relacionado

La actualidad más candente

What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy SokolenkoProvectus
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsOpenSource Connections
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Twitter Search Architecture
Twitter Search Architecture Twitter Search Architecture
Twitter Search Architecture Ramez Al-Fayez
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Adrien Grand
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsLucidworks
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingabial
 
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...Lucidworks
 
Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialeckilucenerevolution
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 

La actualidad más candente (20)

What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Elasticsearch speed is key
Elasticsearch speed is keyElasticsearch speed is key
Elasticsearch speed is key
 
Search Lucene
Search LuceneSearch Lucene
Search Lucene
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Twitter Search Architecture
Twitter Search Architecture Twitter Search Architecture
Twitter Search Architecture
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
 
Flexible Indexing in Lucene 4.0
Flexible Indexing in Lucene 4.0Flexible Indexing in Lucene 4.0
Flexible Indexing in Lucene 4.0
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
 
Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 

Destacado

Real-time systems at Twitter (Velocity 2012)
Real-time systems at Twitter (Velocity 2012)Real-time systems at Twitter (Velocity 2012)
Real-time systems at Twitter (Velocity 2012)Raffi Krikorian
 
Adapting Alax Solr to Compare different sets of documents - Joan Codina
Adapting Alax Solr to Compare different sets of documents - Joan CodinaAdapting Alax Solr to Compare different sets of documents - Joan Codina
Adapting Alax Solr to Compare different sets of documents - Joan Codinalucenerevolution
 
Personalizing Search at LinkedIn
Personalizing Search at LinkedInPersonalizing Search at LinkedIn
Personalizing Search at LinkedInViet Ha-Thuc
 
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...Lucidworks
 
Ektron 8.5 RC - Search
Ektron 8.5 RC - SearchEktron 8.5 RC - Search
Ektron 8.5 RC - SearchBillCavaUs
 
Events, Signals, and Recommendations
Events, Signals, and RecommendationsEvents, Signals, and Recommendations
Events, Signals, and RecommendationsLucidworks
 
Solr JDBC - Lucene/Solr Revolution 2016
Solr JDBC - Lucene/Solr Revolution 2016Solr JDBC - Lucene/Solr Revolution 2016
Solr JDBC - Lucene/Solr Revolution 2016Kevin Risden
 
Netflix Global Search - Lucene Revolution
Netflix Global Search - Lucene RevolutionNetflix Global Search - Lucene Revolution
Netflix Global Search - Lucene Revolutionivan provalov
 
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...Lucidworks
 
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Lucidworks
 
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Airbnb Search Architecture: Presented by Maxim Charkov, AirbnbAirbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Airbnb Search Architecture: Presented by Maxim Charkov, AirbnbLucidworks
 
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 Big & Personal: the data and the models behind Netflix recommendations by Xa... Big & Personal: the data and the models behind Netflix recommendations by Xa...
Big & Personal: the data and the models behind Netflix recommendations by Xa...BigMine
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Sourceaspyker
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applicationsJoey Echeverria
 
AWS re:Invent 2016: Searching Inside Video at Petabyte Scale Using Spot (WIN307)
AWS re:Invent 2016: Searching Inside Video at Petabyte Scale Using Spot (WIN307)AWS re:Invent 2016: Searching Inside Video at Petabyte Scale Using Spot (WIN307)
AWS re:Invent 2016: Searching Inside Video at Petabyte Scale Using Spot (WIN307)Amazon Web Services
 

Destacado (16)

Real-time systems at Twitter (Velocity 2012)
Real-time systems at Twitter (Velocity 2012)Real-time systems at Twitter (Velocity 2012)
Real-time systems at Twitter (Velocity 2012)
 
Adapting Alax Solr to Compare different sets of documents - Joan Codina
Adapting Alax Solr to Compare different sets of documents - Joan CodinaAdapting Alax Solr to Compare different sets of documents - Joan Codina
Adapting Alax Solr to Compare different sets of documents - Joan Codina
 
Personalizing Search at LinkedIn
Personalizing Search at LinkedInPersonalizing Search at LinkedIn
Personalizing Search at LinkedIn
 
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
 
Ektron 8.5 RC - Search
Ektron 8.5 RC - SearchEktron 8.5 RC - Search
Ektron 8.5 RC - Search
 
Streaming ETL for All
Streaming ETL for AllStreaming ETL for All
Streaming ETL for All
 
Events, Signals, and Recommendations
Events, Signals, and RecommendationsEvents, Signals, and Recommendations
Events, Signals, and Recommendations
 
Solr JDBC - Lucene/Solr Revolution 2016
Solr JDBC - Lucene/Solr Revolution 2016Solr JDBC - Lucene/Solr Revolution 2016
Solr JDBC - Lucene/Solr Revolution 2016
 
Netflix Global Search - Lucene Revolution
Netflix Global Search - Lucene RevolutionNetflix Global Search - Lucene Revolution
Netflix Global Search - Lucene Revolution
 
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
 
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
 
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Airbnb Search Architecture: Presented by Maxim Charkov, AirbnbAirbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
 
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 Big & Personal: the data and the models behind Netflix recommendations by Xa... Big & Personal: the data and the models behind Netflix recommendations by Xa...
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
 
AWS re:Invent 2016: Searching Inside Video at Petabyte Scale Using Spot (WIN307)
AWS re:Invent 2016: Searching Inside Video at Petabyte Scale Using Spot (WIN307)AWS re:Invent 2016: Searching Inside Video at Petabyte Scale Using Spot (WIN307)
AWS re:Invent 2016: Searching Inside Video at Petabyte Scale Using Spot (WIN307)
 

Similar a Search at Twitter

Elastic meetup june16
Elastic meetup june16Elastic meetup june16
Elastic meetup june16Miguel Bosin
 
Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...
Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...
Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...Amazon Web Services
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftAmazon Web Services
 
Apache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTApache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTjixuan1989
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Vinay Kumar
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack IntroductionVikram Shinde
 
Solr and ElasticSearch demo and speaker feb 2014
Solr  and ElasticSearch demo and speaker feb 2014Solr  and ElasticSearch demo and speaker feb 2014
Solr and ElasticSearch demo and speaker feb 2014nkabra
 
NoSQL Couchbase Lite & BigData HPCC Systems
NoSQL Couchbase Lite & BigData HPCC SystemsNoSQL Couchbase Lite & BigData HPCC Systems
NoSQL Couchbase Lite & BigData HPCC SystemsFujio Turner
 
What’s Evolving in the Elastic Stack
What’s Evolving in the Elastic StackWhat’s Evolving in the Elastic Stack
What’s Evolving in the Elastic StackElasticsearch
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCampGokulD
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
 
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisliang chen
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 
Tutorial(release)
Tutorial(release)Tutorial(release)
Tutorial(release)Oshin Hung
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
AWS re:Invent 2016: [REPEAT] How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016: [REPEAT] How EA Leveraged Amazon Redshift and AWS Partner...AWS re:Invent 2016: [REPEAT] How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016: [REPEAT] How EA Leveraged Amazon Redshift and AWS Partner...Amazon Web Services
 
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...Amazon Web Services
 
Web analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.comWeb analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.comJungsu Heo
 

Similar a Search at Twitter (20)

Elastic meetup june16
Elastic meetup june16Elastic meetup june16
Elastic meetup june16
 
Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...
Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...
Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Apache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTApache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoT
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack Introduction
 
Solr and ElasticSearch demo and speaker feb 2014
Solr  and ElasticSearch demo and speaker feb 2014Solr  and ElasticSearch demo and speaker feb 2014
Solr and ElasticSearch demo and speaker feb 2014
 
NoSQL Couchbase Lite & BigData HPCC Systems
NoSQL Couchbase Lite & BigData HPCC SystemsNoSQL Couchbase Lite & BigData HPCC Systems
NoSQL Couchbase Lite & BigData HPCC Systems
 
What’s Evolving in the Elastic Stack
What’s Evolving in the Elastic StackWhat’s Evolving in the Elastic Stack
What’s Evolving in the Elastic Stack
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
Tutorial(release)
Tutorial(release)Tutorial(release)
Tutorial(release)
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Redshift overview
Redshift overviewRedshift overview
Redshift overview
 
AWS re:Invent 2016: [REPEAT] How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016: [REPEAT] How EA Leveraged Amazon Redshift and AWS Partner...AWS re:Invent 2016: [REPEAT] How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016: [REPEAT] How EA Leveraged Amazon Redshift and AWS Partner...
 
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
 
Web analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.comWeb analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.com
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 

Más de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...lucenerevolution
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platformlucenerevolution
 
Query Latency Optimization with Lucene
Query Latency Optimization with LuceneQuery Latency Optimization with Lucene
Query Latency Optimization with Lucenelucenerevolution
 

Más de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
 
Query Latency Optimization with Lucene
Query Latency Optimization with LuceneQuery Latency Optimization with Lucene
Query Latency Optimization with Lucene
 

Último

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Último (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Search at Twitter

  • 2. Search @twitter Agenda ‣ Introduction - Search Architecture - Inverted Index 101 - Realtime Posting Lists 2
  • 4. Introduction Twitter has more than 230 million monthly active users. 4
  • 5. Introduction 500 million tweets are sent per day. 5
  • 6. Introduction More than 300 billion tweets have been sent since company founding in 2006. 6
  • 8. Introduction More than 2 billion search queries per day. 8
  • 9. Introduction 2008 Twitter acquires Summize (MySQL-based RT search engine) 2009 2010 Modified Lucene (Earlybird) ships and replaces MySQL indexes 2011 New Earlybird features: image/video search; index compression; efficient relevance search in time-sorted index 2012 2013 2014 Tweet archive search on SSD with vanilla Lucene New RT posting list format that supports arbitrary document lengths, but keeps performance optimizations for tweets 9
  • 10. Introduction 2008 Twitter acquires Summize (MySQL-based RT search engine) 2009 2010 Modified Lucene (Earlybird) ships and replaces MySQL indexes 2011 New Earlybird features: image/video search; index compression; efficient relevance search in time-sorted index 2012 2013 2014 Tweet archive search on SSD with vanilla Lucene New RT posting list format that supports arbitrary document lengths, but keeps performance optimizations for tweets 10
  • 11. Introduction 2008 Twitter acquires Summize (MySQL-based RT search engine) 2009 2010 Modified Lucene (Earlybird) ships and replaces MySQL indexes 2011 New Earlybird features: image/video search; index compression; efficient relevance search in time-sorted index 2012 2013 2014 Tweet archive search on SSD with vanilla Lucene New RT posting list format that supports arbitrary document lengths, but keeps performance optimizations for tweets 11
  • 12. Introduction 2008 Twitter acquires Summize (MySQL-based RT search engine) 2009 2010 Modified Lucene (Earlybird) ships and replaces MySQL indexes 2011 New Earlybird features: image/video search; index compression; efficient relevance search in time-sorted index 2012 2013 2014 Tweet archive search on SSD with vanilla Lucene New RT posting list format that supports arbitrary document lengths, but keeps performance optimizations for tweets 12
  • 13. Introduction 2008 Twitter acquires Summize (MySQL-based RT search engine) 2009 2010 Modified Lucene (Earlybird) ships and replaces MySQL indexes 2011 New Earlybird features: image/video search; index compression; efficient relevance search in time-sorted index 2012 2013 2014 Tweet archive search on SSD with vanilla Lucene New RT posting list format that supports arbitrary document lengths, but keeps performance optimizations for tweets 13
  • 14. Realtime Search @twitter Agenda - Introduction ‣ Search Architecture - Inverted Index 101 - Realtime Posting Lists 14
  • 16. Search Architecture RT stream raw tweets Analyzer/ Partitioner analyzed tweets RT index RT index (Earlybird) Blender Search requests Tweet archive HDFS raw Mapreduce tweets Analyzer analyzed tweets Archive RT index index writes searches 16
  • 17. Search Architecture Analyzer/ Partitioner • Pre-processes Tweets for indexing • Analyzing (tokenization/normalization) of text • Geo-coding, URL expansion, etc. • Hash partitioning 17
  • 18. Search Architecture RT stream raw tweets Analyzer/ Partitioner analyzed tweets RT index RT index (Earlybird) Blender Search requests Tweet archive HDFS raw Mapreduce tweets Analyzer analyzed tweets Archive RT index index writes searches 18
  • 19. Search Architecture RT index RT index (Earlybird) • Modified Lucene index implementation optimized for realtime search • IndexWriter buffer is searchable (no need to flush to allow searching) • In-memory • Hash-partitioned, static layout 19
  • 21. Cluster layout n hash partitions (docId % n) Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird ... Earlybird Earlybird Earlybird Replicas 21
  • 22. Cluster layout n hash partitions (docId % n) Earlybird Earlybird Earlybird Timeslices Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird ... Earlybird Earlybird Earlybird ... Earlybird Earlybird Earlybird ... Earlybird Earlybird Earlybird ... Earlybird Earlybird Earlybird ... Earlybird Earlybird Earlybird ... ... Earlybird Earlybird Earlybird Replicas 22
  • 24. Search Architecture RT index RT index (Earlybird) • Modified Lucene index implementation optimized for realtime search • IndexWriter buffer is searchable (no need to flush to allow searching) • In-memory • Hash-partitioned, static layout 24
  • 25. Search Architecture RT stream raw tweets Analyzer/ Partitioner analyzed tweets RT index RT index (Earlybird) Blender Search requests Tweet archive HDFS raw Mapreduce tweets Analyzer analyzed tweets Archive RT index index writes searches 25
  • 26. Search Architecture Mapreduce Analyzer • Daily jobs that process raw tweets • Analyzes text • Aggregates metadata and signals 26
  • 27. Search Architecture RT stream raw tweets Analyzer/ Partitioner analyzed tweets RT index RT index (Earlybird) Blender Search requests Tweet archive HDFS raw Mapreduce tweets Analyzer analyzed tweets Archive RT index index writes searches 27
  • 28. Search Architecture Archive RT index index • Standard Lucene (4.4) indexes • Reverse time-sorted (new to old) • Cluster layout similar to realtime search cluster 28
  • 29. Search Architecture Archive RT index index • Two tiers: In-memory and on SSD In-memory index SSD index 29
  • 30. Search Architecture Archive RT index index • Two tiers: In-memory and on SSD Contains small number of best tweets of all time In-memory index SSD index 30
  • 31. Search Architecture Archive RT index index • Two tiers: In-memory and on SSD In-memory index Much bigger index with more tweets, less max. QPS, limited by SSD IOPS. Only needs to be queried if inmemory index did not yield enough results SSD index 31
  • 32. Search Architecture RT stream raw tweets Analyzer/ Partitioner analyzed tweets RT index RT index (Earlybird) Blender Search requests Tweet archive HDFS raw Mapreduce tweets Analyzer analyzed tweets Archive RT index index writes searches 32
  • 33. Search Architecture RT index RT index (Earlybird) • Blender is our Thrift service aggregator Blender • Queries multiple Earlybirds, merges results Search requests Archive RT index index writes searches 33
  • 34. Search Architecture RT stream raw tweets Analyzer/ Partitioner analyzed tweets RT index RT index (Earlybird) Blender Search requests Tweet archive HDFS raw Mapreduce tweets Analyzer analyzed tweets Archive RT index index writes searches 34
  • 35. Search Architecture Tweets Analyzer/ Partitioner RT index RT index (Earlybird) queue Updates HDFS Deletes/ Engagement (e.g. retweets/favs) Mapreduce Analyzer Blender Search requests Archive RT index index writes searches 35
  • 36. Realtime Search @twitter Agenda - Introduction - Search Architecture ‣ Inverted Index 101 - Realtime Posting Lists 36
  • 38. Inverted Index 101 1 The old night keeper keeps the keep in the town 2 In the big old house in the big old gown. 3 The house in the town had the big old keep 4 Where the old night keeper never did sleep. 5 The night keeper keeps the keep in the night 6 And keeps in the dark and sleeps in the light. Table with 6 documents Example from: Justin Zobel , Alistair Moffat, Inverted files for text search engines, ACM Computing Surveys (CSUR) v.38 n.2, p.6-es, 2006 38
  • 39. Inverted Index 101 1 The old night keeper keeps the keep in the town 2 In the big old house in the big old gown. 3 The house in the town had the big old keep 4 Where the old night keeper never did sleep. 5 The night keeper keeps the keep in the night 6 And keeps in the dark and sleeps in the light. Table with 6 documents term and big dark did gown had house in keep keeper keeps light never night old sleep sleeps the town where freq 1 2 1 1 1 1 2 5 3 3 3 1 1 3 4 1 1 6 2 1 <6> <2> <3> <6> <4> <2> <3> <2> <3> <1> <2> <3> <5> <6> <1> <3> <5> <1> <4> <5> <1> <5> <6> <6> <4> <1> <4> <5> <1> <2> <3> <4> <4> <6> <1> <2> <3> <4> <5> <6> <1> <3> <4> Dictionary and posting lists 39
  • 40. Inverted Index 101 1 The old night keeper keeps the keep in the town 2 In the big old house in the big old gown. 3 The house in the town had the big old keep 4 Where the old night keeper never did sleep. 5 The night keeper keeps the keep in the night 6 And keeps in the dark and sleeps in the light. Table with 6 documents Query: keeper term and big dark did gown had house in keep keeper keeps light never night old sleep sleeps the town where freq 1 2 1 1 1 1 2 5 3 3 3 1 1 3 4 1 1 6 2 1 <6> <2> <3> <6> <4> <2> <3> <2> <3> <1> <2> <3> <5> <6> <1> <3> <5> <1> <4> <5> <1> <5> <6> <6> <4> <1> <4> <5> <1> <2> <3> <4> <4> <6> <1> <2> <3> <4> <5> <6> <1> <3> <4> Dictionary and posting lists 40
  • 41. Inverted Index 101 1 The old night keeper keeps the keep in the town 2 In the big old house in the big old gown. 3 The house in the town had the big old keep 4 Where the old night keeper never did sleep. 5 The night keeper keeps the keep in the night 6 And keeps in the dark and sleeps in the light. Table with 6 documents Query: keeper term and big dark did gown had house in keep keeper keeps light never night old sleep sleeps the town where freq 1 2 1 1 1 1 2 5 3 3 3 1 1 3 4 1 1 6 2 1 <6> <2> <3> <6> <4> <2> <3> <2> <3> <1> <2> <3> <5> <6> <1> <3> <5> <1> <4> <5> <1> <5> <6> <6> <4> <1> <4> <5> <1> <2> <3> <4> <4> <6> <1> <2> <3> <4> <5> <6> <1> <3> <4> Dictionary and posting lists 41
  • 42. Posting list encoding Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 42
  • 43. Posting list encoding Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Delta encoding: 5 10 8985 2 90998 90 43
  • 44. Posting list encoding Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Delta encoding: 5 10 8985 VInt compression: 00000101 2 90998 90 Values 0 <= delta <= 127 need one byte 44
  • 45. Posting list encoding Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Delta encoding: VInt compression: 5 10 8985 2 90998 90 11000110 00011001 Values 128 <= delta <= 16384 need two bytes 45
  • 46. Posting list encoding Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Delta encoding: VInt compression: 5 10 8985 2 90998 90 11000110 00011001 First bit indicates whether next byte belongs to the same value 46
  • 47. Posting list encoding Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Delta encoding: VInt compression: 5 10 8985 2 90998 90 11000110 00011001 • Variable number of bytes - a VInt-encoded posting can not be written as a primitive Java type; therefore it can not be written atomically 47
  • 48. Posting list encoding Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Delta encoding: 5 10 8985 2 90998 90 Read direction • Each posting depends on previous one; decoding only possible in old-to-new direction • With recency ranking (new-to-old) no early termination is possible 48
  • 49. Posting list encoding • By default Lucene uses a combination of delta encoding and VInt compression • VInts are expensive to decode • Problem 1: How to traverse posting lists backwards? • Problem 2: How to write a posting atomically? 49
  • 50. Realtime Search @twitter Agenda - Introduction - Search Architecture - Inverted Index 101 ‣ Realtime Posting Lists 50
  • 52. Posting list encoding in Earlybird v1 int (32 bits) docID 24 bits max. 16.7M textPosition 8 bits max. 255 • Tweet text can only have 140 chars 52
  • 53. Posting list encoding in Earlybird v1 Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Earlybird encoding: 5 15 9000 9002 100000 100090 Read direction 53
  • 54. Early query termination Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Earlybird encoding: 5 15 9000 9002 100000 100090 Read direction E.g. 3 result are requested: Here we can terminate after reading 3 postings 54
  • 55. Inverted index components Posting list storage ? Dictionary Parallel arrays pointer to the most recently indexed posting for a term 55
  • 56. Inverted index components Posting list storage ? Dictionary Parallel arrays pointer to the most recently indexed posting for a term 56
  • 57. Posting lists storage - Objectives • Store many single-linked lists of different lengths space-efficiently • The number of java objects should be independent of the number of lists or number of items in the lists • Every item should be a possible entry point into the lists for iterators, i.e. items should not be dependent on other items (e.g. no delta encoding) • Append and read possible by multiple threads in a lock-free fashion (single append thread, multiple reader threads) • Traversal in backwards order 57
  • 59. Memory management 4 int[] pools = 32K int[] Each pool can be grown individually by adding 32K blocks 59
  • 60. Memory management 4 int[] pools • For simplicity we can forget about the blocks for now and think of the pools as continuous, unbounded int[] arrays • Small total number of Java objects (each 32K block is one object) 60
  • 61. Memory management slice size 211 27 24 21 • Slices can be allocated in each pool • Each pool has a different, but fixed slice size 61
  • 62. Adding and appending to a list slice size 211 27 available 24 allocated 21 current list 62
  • 63. Adding and appending to a list slice size 211 27 available 24 allocated 21 current list Store first two postings in this slice 63
  • 64. Adding and appending to a list slice size 211 27 available 24 allocated 21 current list When first slice is full, allocate another one in second pool 64
  • 65. Adding and appending to a list slice size 211 27 available 24 allocated 21 current list Allocate a slice on each level as list grows 65
  • 66. Adding and appending to a list slice size 211 27 available 24 allocated 21 current list On upper most level one list can own multiple slices 66
  • 67. Posting list format v1 int (32 bits) docID 24 bits max. 16.7M textPosition 8 bits max. 255 • Tweet text can only have 140 chars 67
  • 68. Addressing items • Use 32 bit (int) pointers to address any item in any list unambiguously: int (32 bits) poolIndex 2 bits 0-3 sliceIndex 19-29 bits depends on pool offset in slice 1-11 bits depends on pool • Nice symmetry: Postings and address pointers both fit into a 32 bit int 68
  • 69. Linking the slices slice size 211 27 available 24 allocated 21 current list 69
  • 70. Linking the slices slice size 211 27 available 24 allocated 21 current list Dictionary Parallel arrays pointer to the last posting indexed for a term 70
  • 71. Posting list encoding - Summary • ints can be written atomically in Java • Backwards traversal easy on absolute docIDs (not deltas) • Every posting is a possible entry point for a searcher • Skipping can be done without additional data structures as binary search, though there are better approaches (skip lists) • Repeating docIDs if a term occurs multiple times in the same document only works for small docs • Max. segment size: 2^24 = 16.7M tweets 71
  • 72. New posting list encoding • Objectives: • 32 bit positions and variable-length payloads • Store term frequency (TF) instead of repeating docIDs • Keep: • Concurrency model • Space-efficiency for short documents • Performance 72
  • 73. New posting list encoding DocID, termFreq Position, Payload 73
  • 74. New posting list encoding DocID, termFreq Position, Payload Fixed length for each posting 74
  • 75. New posting list encoding DocID, termFreq Position, Payload Variable length 75
  • 76. New posting list encoding DocID, termFreq Position, Payload 76
  • 77. New posting list encoding ... DocID, termFreq DocID, termFreq DocID, termFreq Position, Payload Position, Payload, Position ... Position, Payload 77
  • 78. New posting list encoding ... DocID, termFreq DocID, termFreq DocID, termFreq Position, Payload Position, Payload, Position ... Position, Payload • Store TF instead of repeating the same DocID • Store DocID/TF pairs separately from position/payloads • Find a way to synchronously decode the two streams without storing a pointer for each posting (expensive) 78
  • 79. New posting list encoding ... DocID, termFreq DocID, termFreq DocID, termFreq Position, Payload Position, Payload, Position ... Position, Payload Fixed length for each posting (32 bits) • Store TF instead of repeating the same DocID • Store DocID/TF pairs separately from position/payloads • Find a way to synchronously decode the two streams without storing a pointer for each posting (expensive) 79
  • 80. New posting list encoding • Idea: Use an embedded skip list as periodical “synchronization points” • Keeps memory overhead for pointers low and improves search performance 80
  • 81. New posting list encoding slice size 211 27 available 24 allocated 21 current list 81
  • 82. New posting list encoding Slice header • Header contains: • Back-pointer to previous slice (as before) • Skip list • Slice id 82
  • 83. New posting list encoding int (32 bits) docID 24 bits max. 16.7M textPosition 8 bits max. 255 • Observation: Most tweets don’t need all 8 bits for text position • Idea: Use the position “inlining” approach for short documents, but support Lucene’s 32-bit positions and variable length payloads 83
  • 84. New posting list encoding int (32 bits) docID 24 bits max. 16.7M textPosition or termFreq 7 bits max. 127 0=textPosition 1=termFreq 1 bit As a storage optimization, the text position is stored with the docID if: o termFreq == 1 (term occurs once only in the doc) AND o textPosition <= 127 AND o Posting has no payload AND o Posting is not at a skip point of the docID posting list (see later). 84
  • 85. New posting list encoding - Summary • Support for 32 bit positions and arbitrary length payloads stored in separate data structure • Performance and space consumption very similar compared to previous encoding for tweet search • Skip lists used for speed and synchronization points • For short documents positions can still be inlined 85