SlideShare una empresa de Scribd logo
1 de 21
Tamer Elsayed, Jimmy Lin, and Douglas Oard


         Niveda Krishnamoorthy
 PairwiseSimilarity
 MapReduce Framework
 Proposed algorithm
  • Inverted Index Construction
  • Pairwise document similarity calculation
 Results
 PubMed   – “More like this”
 Similar blog posts
 Google – Similar pages
 Framework   that supports distributed
  computing on clusters of computers
 Introduced by Google in 2004
 Map step
 Reduce step
 Combine step (Optional)
 Applications
 Consider    two files:

      Hello                Hello
                                      Hello ,2
      World                Hadoop     World ,2
      Bye                  Goodbye     Bye,1
                                     Hadoop ,2
      World                Hadoop    Goodbye ,1
Hello             <Hello,1>

World             <World,1>
          Map 1
Bye               <Bye,1>

World             <World,1>


Hello             <Hello,1>

Hadoop            <Hadoop,1>
          Map 2
Goodbye           <Goodbye,1>

Hadoop            <Hadoop,1>
<Hello,1>
              S   <Hello (1,1)>   Reduce 1    Hello ,2
<World,1>
              H
              U
<Bye,1>           <World(1,1)>    Reduce 2    World ,2
              F
              F
<World,1>
              L    <Bye(1)>       Reduce 3     Bye,1
              E
<Hello,1>         <Hadoop(1,1)>   Reduce 4   Hadoop ,2
              &
<Hadoop,1>
              S   <Goodbye(1)>    Reduce 5   Goodbye ,1
<Goodbye,1>   O
              R
<Hadoop,1>    T
MAPREDUCE ALGORITHM           Scalable
•Inverted Index Computation      and
•Pairwise Similarity          Efficient
Document 1
A                    <A,(d1,2)>
A
B            Map 1   <B,(d1,1)>
C
                     <C,(d1,1)>
Document 2
B                    <B,(d2,1)>
D
D            Map 2
                     <D,(d2,2)>


Document 1           <A,(d3,1)>
A
B                    <B,(d3,2)>
             Map 3
B
E                    <E,(d3,1)>
<A,(d1,2)>
             S     <A,[(d1,2),                   <A,[(d1,2),
<B,(d1,1)>   H      (d3,1)]>        Reduce 1      (d3,1)]>
             U
<C,(d1,1)>   F   <B,[(d1,1), (d2,              <B,[(d1,1), (d2,
             F                      Reduce 2
                 1),(d3,2)]>                   1),(d3,2)]>
             L
<B,(d2,1)>   E     <C,[(d1,1)]>     Reduce 3    <C,[(d1,1)]>

<D,(d2,2)>   &
                   <D,[(d2,2)]>     Reduce 4    <D,[(d2,2)]>
             S
<A,(d3,1)>   O
             R     <E,[(d3,1)]>     Reduce 5    <E,[(d3,1)]>
<B,(d3,2)>   T

<E,(d3,1)>
 Group   by document ID, not pairs




 Golomb’s   compression for postings
 Individual Postings
 List of Postings
<(d1,d3),2>
  <A,[(d1,2),     Map 1
   (d3,1)]>
                          <(d1,d2),1
<B,[(d1,1),
                  Map 2   (d2,d3),2
(d2,1),(d3,2)]>
                          (d1,d3),2>
 <C,[(d1,1)]>


 <D,[(d2,2)]>


 <E,[(d3,1)]>
S
              H
<(d1,d3),2>   U
              F   <(d1,d2)[1]>                <(d1,d2)[1]>
                                   Reduce 1
              F
<(d1,d2),1    L
              E   <(d2,d3)[2]>     Reduce 2   <(d2,d3)[2]>
(d2,d3),2
(d1,d3),2>
              &
                                   Reduce 3
                  <(d1,d3)[2,2]>              <(d1,d3)[4]>
              S
              O
              R
              T
 Hadoop   0.16.0
 20 machine (4GB memory, 100GB disk)
 Similarity function - BM25
 Dataset: AQUAINT-2 (newswire text)
  • 2.5 GB
  • 906k documents
 Tokenization
 Stop word removal
 Stemming
 Df-cut
  • Fraction of terms with highest document
   frequency is eliminated – 99% cut (9093)

            Linear space and time complexity

  • 3.7 billion pairs (vs) 81. trillion pairs
 Complexity:      O(n2)



 Df-cut
       of 99 percent eliminates meaning bearing
 terms and some irrelevant terms
  • Cornell, arthritis
  • sleek, frail
 Df-cut   can be relaxed to 99.9 percent
 Exact  algorithms used for inverted index
  construction and pair-wise document
  similarity are not specified.
 Df-cut – Does a df-cut of 99 percent affect
  the quality of the results significantly?
 The results have not been evaluated.
Pairwise document similarity in large collections with map reduce

Más contenido relacionado

La actualidad más candente

On First-Order Meta-Learning Algorithms
On First-Order Meta-Learning AlgorithmsOn First-Order Meta-Learning Algorithms
On First-Order Meta-Learning AlgorithmsYoonho Lee
 
クラシックな機械学習の入門  8. クラスタリング
クラシックな機械学習の入門  8. クラスタリングクラシックな機械学習の入門  8. クラスタリング
クラシックな機械学習の入門  8. クラスタリングHiroshi Nakagawa
 
金融理論における深層学習の活用について
金融理論における深層学習の活用について金融理論における深層学習の活用について
金融理論における深層学習の活用についてKodai Ito
 
Zipf? (ジップ則のひみつ?) #DSIRNLP
Zipf? (ジップ則のひみつ?) #DSIRNLPZipf? (ジップ則のひみつ?) #DSIRNLP
Zipf? (ジップ則のひみつ?) #DSIRNLPShuyo Nakatani
 
ICASSP2019音声&音響論文読み会 論文紹介(認識系)
ICASSP2019音声&音響論文読み会 論文紹介(認識系)ICASSP2019音声&音響論文読み会 論文紹介(認識系)
ICASSP2019音声&音響論文読み会 論文紹介(認識系)貴史 益子
 
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...cvpaper. challenge
 
ECCV2020 Oral論文 完全読破(1/2)
ECCV2020 Oral論文 完全読破(1/2)ECCV2020 Oral論文 完全読破(1/2)
ECCV2020 Oral論文 完全読破(1/2)cvpaper. challenge
 
パターン認識 第12章 正則化とパス追跡アルゴリズム
パターン認識 第12章 正則化とパス追跡アルゴリズムパターン認識 第12章 正則化とパス追跡アルゴリズム
パターン認識 第12章 正則化とパス追跡アルゴリズムMiyoshi Yuya
 
ベイズ推定とDeep Learningを使用したレコメンドエンジン開発
ベイズ推定とDeep Learningを使用したレコメンドエンジン開発ベイズ推定とDeep Learningを使用したレコメンドエンジン開発
ベイズ推定とDeep Learningを使用したレコメンドエンジン開発LINE Corporation
 
Encoder-decoder 翻訳 (TISハンズオン資料)
Encoder-decoder 翻訳 (TISハンズオン資料)Encoder-decoder 翻訳 (TISハンズオン資料)
Encoder-decoder 翻訳 (TISハンズオン資料)Yusuke Oda
 
静岡Developers勉強会コンピュータビジョンvol2
静岡Developers勉強会コンピュータビジョンvol2静岡Developers勉強会コンピュータビジョンvol2
静岡Developers勉強会コンピュータビジョンvol2niku9Tenhou
 
Shunsuke Horii
Shunsuke HoriiShunsuke Horii
Shunsuke HoriiSuurist
 
[DL輪読会]Deep Learning 第3章 確率と情報理論
[DL輪読会]Deep Learning 第3章 確率と情報理論[DL輪読会]Deep Learning 第3章 確率と情報理論
[DL輪読会]Deep Learning 第3章 確率と情報理論Deep Learning JP
 
20180427 arXivtimes 勉強会: Cascade R-CNN: Delving into High Quality Object Det...
20180427 arXivtimes 勉強会:  Cascade R-CNN: Delving into High Quality Object Det...20180427 arXivtimes 勉強会:  Cascade R-CNN: Delving into High Quality Object Det...
20180427 arXivtimes 勉強会: Cascade R-CNN: Delving into High Quality Object Det...grafi_tt
 
Semi supervised, weakly-supervised, unsupervised, and active learning
Semi supervised, weakly-supervised, unsupervised, and active learningSemi supervised, weakly-supervised, unsupervised, and active learning
Semi supervised, weakly-supervised, unsupervised, and active learningYusuke Uchida
 
[DL輪読会]Focal Loss for Dense Object Detection
[DL輪読会]Focal Loss for Dense Object Detection[DL輪読会]Focal Loss for Dense Object Detection
[DL輪読会]Focal Loss for Dense Object DetectionDeep Learning JP
 
remote Docker over SSHが熱い
remote Docker over SSHが熱いremote Docker over SSHが熱い
remote Docker over SSHが熱いHiroyuki Ohnaka
 
【DL輪読会】Aspect-based Analysis of Advertising Appeals for Search Engine Advert...
【DL輪読会】Aspect-based Analysis of Advertising Appeals for Search  Engine Advert...【DL輪読会】Aspect-based Analysis of Advertising Appeals for Search  Engine Advert...
【DL輪読会】Aspect-based Analysis of Advertising Appeals for Search Engine Advert...Deep Learning JP
 

La actualidad más candente (20)

On First-Order Meta-Learning Algorithms
On First-Order Meta-Learning AlgorithmsOn First-Order Meta-Learning Algorithms
On First-Order Meta-Learning Algorithms
 
クラシックな機械学習の入門  8. クラスタリング
クラシックな機械学習の入門  8. クラスタリングクラシックな機械学習の入門  8. クラスタリング
クラシックな機械学習の入門  8. クラスタリング
 
金融理論における深層学習の活用について
金融理論における深層学習の活用について金融理論における深層学習の活用について
金融理論における深層学習の活用について
 
Zipf? (ジップ則のひみつ?) #DSIRNLP
Zipf? (ジップ則のひみつ?) #DSIRNLPZipf? (ジップ則のひみつ?) #DSIRNLP
Zipf? (ジップ則のひみつ?) #DSIRNLP
 
ICASSP2019音声&音響論文読み会 論文紹介(認識系)
ICASSP2019音声&音響論文読み会 論文紹介(認識系)ICASSP2019音声&音響論文読み会 論文紹介(認識系)
ICASSP2019音声&音響論文読み会 論文紹介(認識系)
 
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...
 
ECCV2020 Oral論文 完全読破(1/2)
ECCV2020 Oral論文 完全読破(1/2)ECCV2020 Oral論文 完全読破(1/2)
ECCV2020 Oral論文 完全読破(1/2)
 
パターン認識 第12章 正則化とパス追跡アルゴリズム
パターン認識 第12章 正則化とパス追跡アルゴリズムパターン認識 第12章 正則化とパス追跡アルゴリズム
パターン認識 第12章 正則化とパス追跡アルゴリズム
 
ベイズ推定とDeep Learningを使用したレコメンドエンジン開発
ベイズ推定とDeep Learningを使用したレコメンドエンジン開発ベイズ推定とDeep Learningを使用したレコメンドエンジン開発
ベイズ推定とDeep Learningを使用したレコメンドエンジン開発
 
Encoder-decoder 翻訳 (TISハンズオン資料)
Encoder-decoder 翻訳 (TISハンズオン資料)Encoder-decoder 翻訳 (TISハンズオン資料)
Encoder-decoder 翻訳 (TISハンズオン資料)
 
静岡Developers勉強会コンピュータビジョンvol2
静岡Developers勉強会コンピュータビジョンvol2静岡Developers勉強会コンピュータビジョンvol2
静岡Developers勉強会コンピュータビジョンvol2
 
Shunsuke Horii
Shunsuke HoriiShunsuke Horii
Shunsuke Horii
 
[DL輪読会]Deep Learning 第3章 確率と情報理論
[DL輪読会]Deep Learning 第3章 確率と情報理論[DL輪読会]Deep Learning 第3章 確率と情報理論
[DL輪読会]Deep Learning 第3章 確率と情報理論
 
20180427 arXivtimes 勉強会: Cascade R-CNN: Delving into High Quality Object Det...
20180427 arXivtimes 勉強会:  Cascade R-CNN: Delving into High Quality Object Det...20180427 arXivtimes 勉強会:  Cascade R-CNN: Delving into High Quality Object Det...
20180427 arXivtimes 勉強会: Cascade R-CNN: Delving into High Quality Object Det...
 
Semi supervised, weakly-supervised, unsupervised, and active learning
Semi supervised, weakly-supervised, unsupervised, and active learningSemi supervised, weakly-supervised, unsupervised, and active learning
Semi supervised, weakly-supervised, unsupervised, and active learning
 
[DL輪読会]Focal Loss for Dense Object Detection
[DL輪読会]Focal Loss for Dense Object Detection[DL輪読会]Focal Loss for Dense Object Detection
[DL輪読会]Focal Loss for Dense Object Detection
 
Aulão de espanhol
Aulão de espanholAulão de espanhol
Aulão de espanhol
 
remote Docker over SSHが熱い
remote Docker over SSHが熱いremote Docker over SSHが熱い
remote Docker over SSHが熱い
 
ML Visuals.pptx
ML Visuals.pptxML Visuals.pptx
ML Visuals.pptx
 
【DL輪読会】Aspect-based Analysis of Advertising Appeals for Search Engine Advert...
【DL輪読会】Aspect-based Analysis of Advertising Appeals for Search  Engine Advert...【DL輪読会】Aspect-based Analysis of Advertising Appeals for Search  Engine Advert...
【DL輪読会】Aspect-based Analysis of Advertising Appeals for Search Engine Advert...
 

Similar a Pairwise document similarity in large collections with map reduce

Geoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman
 
Introduction to HADOOP
Introduction to HADOOPIntroduction to HADOOP
Introduction to HADOOPShital Kat
 
10th Maths model3 question paper
10th Maths model3 question paper10th Maths model3 question paper
10th Maths model3 question papersingarls19
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Graph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraphGraph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraphAndrew Yongjoon Kong
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Distributed batch processing with Hadoop
Distributed batch processing with HadoopDistributed batch processing with Hadoop
Distributed batch processing with HadoopFerran Galí Reniu
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
A gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojureA gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojurePaul Lam
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
 
Visual Api Training
Visual Api TrainingVisual Api Training
Visual Api TrainingSpark Summit
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizationsscottcrespo
 

Similar a Pairwise document similarity in large collections with map reduce (19)

Geoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel Processing
 
Intro to Map Reduce
Intro to Map ReduceIntro to Map Reduce
Intro to Map Reduce
 
LalitBDA2015V3
LalitBDA2015V3LalitBDA2015V3
LalitBDA2015V3
 
Introduction to HADOOP
Introduction to HADOOPIntroduction to HADOOP
Introduction to HADOOP
 
Maths`
Maths`Maths`
Maths`
 
10th Maths model3 question paper
10th Maths model3 question paper10th Maths model3 question paper
10th Maths model3 question paper
 
10th Maths
10th Maths10th Maths
10th Maths
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Graph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraphGraph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraph
 
End sem solution
End sem solutionEnd sem solution
End sem solution
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Distributed batch processing with Hadoop
Distributed batch processing with HadoopDistributed batch processing with Hadoop
Distributed batch processing with Hadoop
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
A gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojureA gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojure
 
MapReduce
MapReduceMapReduce
MapReduce
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Visual Api Training
Visual Api TrainingVisual Api Training
Visual Api Training
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 

Último

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Último (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Pairwise document similarity in large collections with map reduce

  • 1. Tamer Elsayed, Jimmy Lin, and Douglas Oard Niveda Krishnamoorthy
  • 2.  PairwiseSimilarity  MapReduce Framework  Proposed algorithm • Inverted Index Construction • Pairwise document similarity calculation  Results
  • 3.  PubMed – “More like this”  Similar blog posts  Google – Similar pages
  • 4.  Framework that supports distributed computing on clusters of computers  Introduced by Google in 2004  Map step  Reduce step  Combine step (Optional)  Applications
  • 5.
  • 6.  Consider two files: Hello Hello Hello ,2 World Hadoop World ,2 Bye Goodbye Bye,1 Hadoop ,2 World Hadoop Goodbye ,1
  • 7. Hello <Hello,1> World <World,1> Map 1 Bye <Bye,1> World <World,1> Hello <Hello,1> Hadoop <Hadoop,1> Map 2 Goodbye <Goodbye,1> Hadoop <Hadoop,1>
  • 8. <Hello,1> S <Hello (1,1)> Reduce 1 Hello ,2 <World,1> H U <Bye,1> <World(1,1)> Reduce 2 World ,2 F F <World,1> L <Bye(1)> Reduce 3 Bye,1 E <Hello,1> <Hadoop(1,1)> Reduce 4 Hadoop ,2 & <Hadoop,1> S <Goodbye(1)> Reduce 5 Goodbye ,1 <Goodbye,1> O R <Hadoop,1> T
  • 9. MAPREDUCE ALGORITHM Scalable •Inverted Index Computation and •Pairwise Similarity Efficient
  • 10. Document 1 A <A,(d1,2)> A B Map 1 <B,(d1,1)> C <C,(d1,1)> Document 2 B <B,(d2,1)> D D Map 2 <D,(d2,2)> Document 1 <A,(d3,1)> A B <B,(d3,2)> Map 3 B E <E,(d3,1)>
  • 11. <A,(d1,2)> S <A,[(d1,2), <A,[(d1,2), <B,(d1,1)> H (d3,1)]> Reduce 1 (d3,1)]> U <C,(d1,1)> F <B,[(d1,1), (d2, <B,[(d1,1), (d2, F Reduce 2 1),(d3,2)]> 1),(d3,2)]> L <B,(d2,1)> E <C,[(d1,1)]> Reduce 3 <C,[(d1,1)]> <D,(d2,2)> & <D,[(d2,2)]> Reduce 4 <D,[(d2,2)]> S <A,(d3,1)> O R <E,[(d3,1)]> Reduce 5 <E,[(d3,1)]> <B,(d3,2)> T <E,(d3,1)>
  • 12.  Group by document ID, not pairs  Golomb’s compression for postings  Individual Postings  List of Postings
  • 13. <(d1,d3),2> <A,[(d1,2), Map 1 (d3,1)]> <(d1,d2),1 <B,[(d1,1), Map 2 (d2,d3),2 (d2,1),(d3,2)]> (d1,d3),2> <C,[(d1,1)]> <D,[(d2,2)]> <E,[(d3,1)]>
  • 14. S H <(d1,d3),2> U F <(d1,d2)[1]> <(d1,d2)[1]> Reduce 1 F <(d1,d2),1 L E <(d2,d3)[2]> Reduce 2 <(d2,d3)[2]> (d2,d3),2 (d1,d3),2> & Reduce 3 <(d1,d3)[2,2]> <(d1,d3)[4]> S O R T
  • 15.  Hadoop 0.16.0  20 machine (4GB memory, 100GB disk)  Similarity function - BM25  Dataset: AQUAINT-2 (newswire text) • 2.5 GB • 906k documents
  • 16.  Tokenization  Stop word removal  Stemming  Df-cut • Fraction of terms with highest document frequency is eliminated – 99% cut (9093) Linear space and time complexity • 3.7 billion pairs (vs) 81. trillion pairs
  • 17.
  • 18.
  • 19.  Complexity: O(n2)  Df-cut of 99 percent eliminates meaning bearing terms and some irrelevant terms • Cornell, arthritis • sleek, frail  Df-cut can be relaxed to 99.9 percent
  • 20.  Exact algorithms used for inverted index construction and pair-wise document similarity are not specified.  Df-cut – Does a df-cut of 99 percent affect the quality of the results significantly?  The results have not been evaluated.