Pairwise document similarity in large collections with map reduce

•Descargar como PPTX, PDF•

2 recomendaciones•2,661 vistas

nivedalk

Tecnología

Tamer Elsayed, Jimmy Lin, and Douglas Oard

Niveda Krishnamoorthy

 PairwiseSimilarity
 MapReduce Framework
 Proposed algorithm
• Inverted Index Construction
• Pairwise document similarity calculation
 Results

 PubMed – “More like this”
 Similar blog posts
 Google – Similar pages

 Framework that supports distributed
computing on clusters of computers
 Introduced by Google in 2004
 Map step
 Reduce step
 Combine step (Optional)
 Applications

 Consider two files:

Hello Hello
Hello ,2
World Hadoop World ,2
Bye Goodbye Bye,1
Hadoop ,2
World Hadoop Goodbye ,1

Hello <Hello,1>

World <World,1>
Map 1
Bye <Bye,1>

World <World,1>

Hello <Hello,1>

Hadoop <Hadoop,1>
Map 2
Goodbye <Goodbye,1>

Hadoop <Hadoop,1>

<Hello,1>
S <Hello (1,1)> Reduce 1 Hello ,2
<World,1>
H
U
<Bye,1> <World(1,1)> Reduce 2 World ,2
F
F
<World,1>
L <Bye(1)> Reduce 3 Bye,1
E
<Hello,1> <Hadoop(1,1)> Reduce 4 Hadoop ,2
&
<Hadoop,1>
S <Goodbye(1)> Reduce 5 Goodbye ,1
<Goodbye,1> O
R
<Hadoop,1> T

MAPREDUCE ALGORITHM Scalable
•Inverted Index Computation and
•Pairwise Similarity Efficient

Document 1
A <A,(d1,2)>
A
B Map 1 <B,(d1,1)>
C
<C,(d1,1)>
Document 2
B <B,(d2,1)>
D
D Map 2
<D,(d2,2)>

Document 1 <A,(d3,1)>
A
B <B,(d3,2)>
Map 3
B
E <E,(d3,1)>

<A,(d1,2)>
S <A,[(d1,2), <A,[(d1,2),
<B,(d1,1)> H (d3,1)]> Reduce 1 (d3,1)]>
U
<C,(d1,1)> F <B,[(d1,1), (d2, <B,[(d1,1), (d2,
F Reduce 2
1),(d3,2)]> 1),(d3,2)]>
L
<B,(d2,1)> E <C,[(d1,1)]> Reduce 3 <C,[(d1,1)]>

<D,(d2,2)> &
<D,[(d2,2)]> Reduce 4 <D,[(d2,2)]>
S
<A,(d3,1)> O
R <E,[(d3,1)]> Reduce 5 <E,[(d3,1)]>
<B,(d3,2)> T

<E,(d3,1)>

 Group by document ID, not pairs

 Golomb’s compression for postings
 Individual Postings
 List of Postings

<(d1,d3),2>
<A,[(d1,2), Map 1
(d3,1)]>
<(d1,d2),1
<B,[(d1,1),
Map 2 (d2,d3),2
(d2,1),(d3,2)]>
(d1,d3),2>
<C,[(d1,1)]>

<D,[(d2,2)]>

<E,[(d3,1)]>

S
H
<(d1,d3),2> U
F <(d1,d2)[1]> <(d1,d2)[1]>
Reduce 1
F
<(d1,d2),1 L
E <(d2,d3)[2]> Reduce 2 <(d2,d3)[2]>
(d2,d3),2
(d1,d3),2>
&
Reduce 3
<(d1,d3)[2,2]> <(d1,d3)[4]>
S
O
R
T

 Hadoop 0.16.0
 20 machine (4GB memory, 100GB disk)
 Similarity function - BM25
 Dataset: AQUAINT-2 (newswire text)
• 2.5 GB
• 906k documents

 Tokenization
 Stop word removal
 Stemming
 Df-cut
• Fraction of terms with highest document
frequency is eliminated – 99% cut (9093)

Linear space and time complexity

• 3.7 billion pairs (vs) 81. trillion pairs

 Complexity: O(n2)

 Df-cut
of 99 percent eliminates meaning bearing
terms and some irrelevant terms
• Cornell, arthritis
• sleek, frail
 Df-cut can be relaxed to 99.9 percent

 Exact algorithms used for inverted index
construction and pair-wise document
similarity are not specified.
 Df-cut – Does a df-cut of 99 percent affect
the quality of the results significantly?
 The results have not been evaluated.

Pairwise document similarity in large collections with map reduce

Más contenido relacionado

La actualidad más candente

On First-Order Meta-Learning AlgorithmsYoonho Lee

クラシックな機械学習の入門　　8. クラスタリングHiroshi Nakagawa

金融理論における深層学習の活用についてKodai Ito

Zipf? (ジップ則のひみつ？) #DSIRNLPShuyo Nakatani

ICASSP2019音声＆音響論文読み会論文紹介（認識系）貴史益子

教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...cvpaper. challenge

ECCV2020 Oral論文完全読破(1/2)cvpaper. challenge

パターン認識第12章正則化とパス追跡アルゴリズムMiyoshi Yuya

ベイズ推定とDeep Learningを使用したレコメンドエンジン開発LINE Corporation

Encoder-decoder 翻訳 (TISハンズオン資料)Yusuke Oda

静岡Developers勉強会コンピュータビジョンvol2niku9Tenhou

Shunsuke HoriiSuurist

[DL輪読会]Deep Learning 第3章確率と情報理論Deep Learning JP

20180427 arXivtimes 勉強会: Cascade R-CNN: Delving into High Quality Object Det...grafi_tt

Semi supervised, weakly-supervised, unsupervised, and active learningYusuke Uchida

[DL輪読会]Focal Loss for Dense Object DetectionDeep Learning JP

Aulão de espanholPaulo Alexandre

remote Docker over SSHが熱いHiroyuki Ohnaka

ML Visuals.pptxLiuMingJian

【DL輪読会】Aspect-based Analysis of Advertising Appeals for Search Engine Advert...Deep Learning JP

La actualidad más candente (20)

On First-Order Meta-Learning Algorithms

クラシックな機械学習の入門　　8. クラスタリング

金融理論における深層学習の活用について

Zipf? (ジップ則のひみつ？) #DSIRNLP

ICASSP2019音声＆音響論文読み会論文紹介（認識系）

教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...

ECCV2020 Oral論文完全読破(1/2)

パターン認識第12章正則化とパス追跡アルゴリズム

ベイズ推定とDeep Learningを使用したレコメンドエンジン開発

Encoder-decoder 翻訳 (TISハンズオン資料)

静岡Developers勉強会コンピュータビジョンvol2

Shunsuke Horii

[DL輪読会]Deep Learning 第3章確率と情報理論

20180427 arXivtimes 勉強会: Cascade R-CNN: Delving into High Quality Object Det...

Semi supervised, weakly-supervised, unsupervised, and active learning

[DL輪読会]Focal Loss for Dense Object Detection

Aulão de espanhol

remote Docker over SSHが熱い

ML Visuals.pptx

【DL輪読会】Aspect-based Analysis of Advertising Appeals for Search Engine Advert...

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Pairwise document similarity in large collections with map reduce

1. Tamer Elsayed, Jimmy Lin, and Douglas Oard Niveda Krishnamoorthy

2.  PairwiseSimilarity  MapReduce Framework  Proposed algorithm • Inverted Index Construction • Pairwise document similarity calculation  Results

3.  PubMed – “More like this”  Similar blog posts  Google – Similar pages

4.  Framework that supports distributed computing on clusters of computers  Introduced by Google in 2004  Map step  Reduce step  Combine step (Optional)  Applications

6.  Consider two files: Hello Hello Hello ,2 World Hadoop World ,2 Bye Goodbye Bye,1 Hadoop ,2 World Hadoop Goodbye ,1

7. Hello <Hello,1> World <World,1> Map 1 Bye <Bye,1> World <World,1> Hello <Hello,1> Hadoop <Hadoop,1> Map 2 Goodbye <Goodbye,1> Hadoop <Hadoop,1>

8. <Hello,1> S <Hello (1,1)> Reduce 1 Hello ,2 <World,1> H U <Bye,1> <World(1,1)> Reduce 2 World ,2 F F <World,1> L <Bye(1)> Reduce 3 Bye,1 E <Hello,1> <Hadoop(1,1)> Reduce 4 Hadoop ,2 & <Hadoop,1> S <Goodbye(1)> Reduce 5 Goodbye ,1 <Goodbye,1> O R <Hadoop,1> T

9. MAPREDUCE ALGORITHM Scalable •Inverted Index Computation and •Pairwise Similarity Efficient

10. Document 1 A <A,(d1,2)> A B Map 1 <B,(d1,1)> C <C,(d1,1)> Document 2 B <B,(d2,1)> D D Map 2 <D,(d2,2)> Document 1 <A,(d3,1)> A B <B,(d3,2)> Map 3 B E <E,(d3,1)>

11. <A,(d1,2)> S <A,[(d1,2), <A,[(d1,2), <B,(d1,1)> H (d3,1)]> Reduce 1 (d3,1)]> U <C,(d1,1)> F <B,[(d1,1), (d2, <B,[(d1,1), (d2, F Reduce 2 1),(d3,2)]> 1),(d3,2)]> L <B,(d2,1)> E <C,[(d1,1)]> Reduce 3 <C,[(d1,1)]> <D,(d2,2)> & <D,[(d2,2)]> Reduce 4 <D,[(d2,2)]> S <A,(d3,1)> O R <E,[(d3,1)]> Reduce 5 <E,[(d3,1)]> <B,(d3,2)> T <E,(d3,1)>

12.  Group by document ID, not pairs  Golomb’s compression for postings  Individual Postings  List of Postings

13. <(d1,d3),2> <A,[(d1,2), Map 1 (d3,1)]> <(d1,d2),1 <B,[(d1,1), Map 2 (d2,d3),2 (d2,1),(d3,2)]> (d1,d3),2> <C,[(d1,1)]> <D,[(d2,2)]> <E,[(d3,1)]>

14. S H <(d1,d3),2> U F <(d1,d2)[1]> <(d1,d2)[1]> Reduce 1 F <(d1,d2),1 L E <(d2,d3)[2]> Reduce 2 <(d2,d3)[2]> (d2,d3),2 (d1,d3),2> & Reduce 3 <(d1,d3)[2,2]> <(d1,d3)[4]> S O R T

15.  Hadoop 0.16.0  20 machine (4GB memory, 100GB disk)  Similarity function - BM25  Dataset: AQUAINT-2 (newswire text) • 2.5 GB • 906k documents

16.  Tokenization  Stop word removal  Stemming  Df-cut • Fraction of terms with highest document frequency is eliminated – 99% cut (9093) Linear space and time complexity • 3.7 billion pairs (vs) 81. trillion pairs

17.

18.

19.  Complexity: O(n2)  Df-cut of 99 percent eliminates meaning bearing terms and some irrelevant terms • Cornell, arthritis • sleek, frail  Df-cut can be relaxed to 99.9 percent

20.  Exact algorithms used for inverted index construction and pair-wise document similarity are not specified.  Df-cut – Does a df-cut of 99 percent affect the quality of the results significantly?  The results have not been evaluated.

Pairwise document similarity in large collections with map reduce

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Pairwise document similarity in large collections with map reduce

Similar a Pairwise document similarity in large collections with map reduce (19)

Último

Último (20)

Pairwise document similarity in large collections with map reduce