SlideShare una empresa de Scribd logo
1 de 23
Descargar para leer sin conexión
∂u∂u Multi-Tenanted Framework:
Distributed Near Duplicate Detection for Big Data
Pradeeban Kathiravelu, Helena Galhardas, Lu´ıs Veiga
INESC-ID Lisboa
Instituto Superior T´ecnico, Universidade de Lisboa
Lisbon, Portugal
23rd International Conference on Cooperative Information Systems (CoopIS 2015)
28-30 October 2015, Rhodes, Greece.
Distributed Near Duplicate Detection ∂u∂u 1 / 23
Introduction
Introduction
Data cleaning is essential for enterprise information systems.
Finding near duplicates is an important task in data cleaning.
Near duplicate detection algorithms to find “almost” identical
entries.
Massive datasets require large memory and processing power.
Distributed Near Duplicate Detection ∂u∂u 2 / 23
Introduction
Motivation
Most data cleaning algorithms are sequential.
Recent use of MapReduce frameworks in near duplicate detection.
In-Memory Data Grids (IMDG) offer a view of a large computer by
unifying the resources across a distributed computer cluster.
What if..?
Distributed Near Duplicate Detection ∂u∂u 3 / 23
Introduction
∂u∂u
A distributed architecture for near duplicate detection.
An efficient distribution strategy for the blocks over IMDGs.
Adapting the existing algorithms.
To execute on a computer cluster or a public/private cloud.
Leverage MapReduce framework offered by the IMDG.
In identifying the blocks.
Distributed Near Duplicate Detection ∂u∂u 4 / 23
∂u∂u Architecture
Contributions
Faster near duplicate detection over massive datasets.
which may not have been possible to execute on the utility computers.
High speedup and lower communication and coordination overhead.
Multi-tenanted parallel processing architecture.
Coordinated for multi-pass over multiple keys.
More accurate and precise duplicate detection.
Strategy and algorithms loosely coupled to the base algorithms.
Potential to distribute more algorithms.
Configuring based on user preferences.
Adaptively involving the instances in near duplicate detection.
Distributed Near Duplicate Detection ∂u∂u 5 / 23
∂u∂u Architecture
Distributed Near Duplicate Detection
Distributed Near Duplicate Detection ∂u∂u 6 / 23
∂u∂u Architecture
Distributed Near Duplicate Detection
Distributed Near Duplicate Detection ∂u∂u 7 / 23
∂u∂u Architecture
Distributed Near Duplicate Detection
Distributed Near Duplicate Detection ∂u∂u 8 / 23
∂u∂u Architecture
Deployment Architecture
Distributed Near Duplicate Detection ∂u∂u 9 / 23
∂u∂u Architecture
Efficient Data Distribution
Distributed Near Duplicate Detection ∂u∂u 10 / 23
∂u∂u Architecture
Partition of storage and execution across the instances
Distributed Near Duplicate Detection ∂u∂u 11 / 23
∂u∂u Architecture
Tenant-Aware Parallel Execution for Multiple Composite
Blocking Keys
Distributed Near Duplicate Detection ∂u∂u 12 / 23
∂u∂u Architecture
Matrix Notation
Distributed Near Duplicate Detection ∂u∂u 13 / 23
∂u∂u Architecture
Software Architecture
Distributed Near Duplicate Detection ∂u∂u 14 / 23
∂u∂u Prototype
Prototype Implementation
Java 1.8.0 as the programming language.
Hazelcast 3.4 as the in-memory data grid.
Data sources connected through their respective Java driver APIs.
MongoDB 2.4.9.
MySQL 5.5.41-0ubuntu0.14.04.1.
PPJoin as the base near duplicate detection algorithm.
Extended for distributed execution on Hazelcast.
Distributed Near Duplicate Detection ∂u∂u 15 / 23
Evaluation
Prototype Deployment
Intel R CoreTM i7-4700MQ
CPU @ 2.40GHz 8 processor.
8 GB memory.
Ubuntu 14.04 LTS 64 bit operating system.
Two Mongo databases connected as the data sources.
Having the potential duplicate pairs.
Hadoop HDFS to store the detected duplicate pairs.
Distributed Near Duplicate Detection ∂u∂u 16 / 23
Evaluation
Evaluation System Configurations
Around 100 datasets of varying sizes above 1 GB.
With varying number of nodes configured to execute in a cluster.
Each cluster configured to have an executor instance.
Fairness in evaluations.
Number of iterations and the blocking keys maintained to be same
across all the experiments.
Distributed Near Duplicate Detection ∂u∂u 17 / 23
Evaluation
Preliminary Assessments
Performance and speed up
With multi-pass in 4 different execution clusters.
Compared to the sequential execution of PPJoin in a single computer.
Efficiency in distributing the storage and execution.
With multiple instances in the execution cluster.
Distributed Near Duplicate Detection ∂u∂u 18 / 23
Evaluation
Variations of Speedup with the Number of nodes
Super-linear speedup.
up to c ∗ n2
; c - number of clusters; n - number of nodes.
c = 4, as 4 clusters were used.
n ⇒ 1
n2 search space in each blocks.
Distributed Near Duplicate Detection ∂u∂u 19 / 23
Evaluation
Variations of Memory Consumption with the Number of
Nodes
Distributed Near Duplicate Detection ∂u∂u 20 / 23
Conclusion
Related Work
MapReduce frameworks for near duplicate detection.
MapDupReducer [CW 2010], Dedoop [LK 2012], . . .
Generalizing the existing algorithms to execute in a MapReduce
framework.
Do not consider all aspects of the near duplicate detection.
Coupled to the MapReduce framework or the near duplicate detection
algorithms.
In-Memory Data Grids such as Hazelcast and Infinispan are not
leveraged in existing data cleaning approaches.
Distributed Near Duplicate Detection ∂u∂u 21 / 23
Conclusion
Conclusion
Conclusions
In-memory data grids for a scalable near duplicate detection.
Adoption of the existing algorithms for a distributed environment.
Multi-tenanted environment for accurate near duplicate detection.
with parallel usage of multiple blocking keys.
Future Work
Extending and leveraging ∂u∂u distributed execution approach for data
warehouse construction and other data cleaning processes.
Distributed Near Duplicate Detection ∂u∂u 22 / 23
Conclusion
References
CX 2011 Xiao, C., Wang, W., Lin, X., Yu, J. X., & Wang, G. (2011). Efficient similarity joins for near-duplicate detection. ACM
Transactions on Database Systems (TODS), 36(3), 15.
LK 2012 Kolb, L., Thor, A., & Rahm, E. (2012). Dedoop: efficient deduplication with Hadoop. Proceedings of the VLDB
Endowment, 5(12), 1878-1881.
CW 2010 Wang, C., Wang, J., Lin, X., Wang, W., Wang, H., Li, H., ... & Li, R. (2010, June). MapDupReducer: detecting near
duplicates over massive datasets. In Proceedings of the 2010 ACM SIGMOD International Conference on Management
of data (pp. 1119-1122). ACM.
RV 2010 Vernica, R., Carey, M. J., & Li, C. (2010, June). Efficient parallel set-similarity joins using MapReduce. In Proceedings
of the 2010 ACM SIGMOD International Conference on Management of data (pp. 495-506). ACM.
PK 2014 Kathiravelu, P. & L. Veiga (2014). An Adaptive Distributed Simulator for Cloud and MapReduce Algorithms and
Architectures. In IEEE/ACM 7th International Conference on Utility and Cloud Computing (UCC 2014), London, UK.
pp. 79 – 88. IEEE Computer Society.
Thank you!
Questions?
Distributed Near Duplicate Detection ∂u∂u 23 / 23

Más contenido relacionado

La actualidad más candente

Daniel Preoţiuc-Pietro - 2013 - A temporal model of text periodicities using ...
Daniel Preoţiuc-Pietro - 2013 - A temporal model of text periodicities using ...Daniel Preoţiuc-Pietro - 2013 - A temporal model of text periodicities using ...
Daniel Preoţiuc-Pietro - 2013 - A temporal model of text periodicities using ...Association for Computational Linguistics
 
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...idescitation
 
OWL reasoning with WebPIE: calculating the closer of 100 billion triples
OWL reasoning with WebPIE: calculating the closer of 100 billion triplesOWL reasoning with WebPIE: calculating the closer of 100 billion triples
OWL reasoning with WebPIE: calculating the closer of 100 billion triplesMahdi Atawneh
 
An Introduction of Recent Research on MapReduce (2011)
An Introduction of Recent Research on MapReduce (2011)An Introduction of Recent Research on MapReduce (2011)
An Introduction of Recent Research on MapReduce (2011)Yu Liu
 
NNLO PDF fits with top-quark pair differential distributions
NNLO PDF fits with top-quark pair differential distributionsNNLO PDF fits with top-quark pair differential distributions
NNLO PDF fits with top-quark pair differential distributionsJuan Rojo
 
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...ijcsit
 
Document clustering for forensic analysis an approach for improving computer ...
Document clustering for forensic analysis an approach for improving computer ...Document clustering for forensic analysis an approach for improving computer ...
Document clustering for forensic analysis an approach for improving computer ...JPINFOTECH JAYAPRAKASH
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryIan Foster
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabVijay Srinivas Agneeswaran, Ph.D
 
Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010BOSC 2010
 
A tree cluster-based data-gathering algorithm for industrial ws ns with a mob...
A tree cluster-based data-gathering algorithm for industrial ws ns with a mob...A tree cluster-based data-gathering algorithm for industrial ws ns with a mob...
A tree cluster-based data-gathering algorithm for industrial ws ns with a mob...LogicMindtech Nologies
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Valery Tkachenko
 
The Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningThe Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningRafael Ferreira da Silva
 
Modelling Multi-Component Predictive Systems as Petri Nets
Modelling Multi-Component Predictive Systems as Petri NetsModelling Multi-Component Predictive Systems as Petri Nets
Modelling Multi-Component Predictive Systems as Petri NetsManuel Martín
 

La actualidad más candente (20)

A0360109
A0360109A0360109
A0360109
 
Daniel Preoţiuc-Pietro - 2013 - A temporal model of text periodicities using ...
Daniel Preoţiuc-Pietro - 2013 - A temporal model of text periodicities using ...Daniel Preoţiuc-Pietro - 2013 - A temporal model of text periodicities using ...
Daniel Preoţiuc-Pietro - 2013 - A temporal model of text periodicities using ...
 
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
 
C0312023
C0312023C0312023
C0312023
 
OWL reasoning with WebPIE: calculating the closer of 100 billion triples
OWL reasoning with WebPIE: calculating the closer of 100 billion triplesOWL reasoning with WebPIE: calculating the closer of 100 billion triples
OWL reasoning with WebPIE: calculating the closer of 100 billion triples
 
An Introduction of Recent Research on MapReduce (2011)
An Introduction of Recent Research on MapReduce (2011)An Introduction of Recent Research on MapReduce (2011)
An Introduction of Recent Research on MapReduce (2011)
 
B0330811
B0330811B0330811
B0330811
 
NNLO PDF fits with top-quark pair differential distributions
NNLO PDF fits with top-quark pair differential distributionsNNLO PDF fits with top-quark pair differential distributions
NNLO PDF fits with top-quark pair differential distributions
 
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...
 
Document clustering for forensic analysis an approach for improving computer ...
Document clustering for forensic analysis an approach for improving computer ...Document clustering for forensic analysis an approach for improving computer ...
Document clustering for forensic analysis an approach for improving computer ...
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
 
Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010
 
Spamcloud
SpamcloudSpamcloud
Spamcloud
 
A tree cluster-based data-gathering algorithm for industrial ws ns with a mob...
A tree cluster-based data-gathering algorithm for industrial ws ns with a mob...A tree cluster-based data-gathering algorithm for industrial ws ns with a mob...
A tree cluster-based data-gathering algorithm for industrial ws ns with a mob...
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...
 
The Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningThe Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource Provisioning
 
Harvard poster
Harvard posterHarvard poster
Harvard poster
 
Modelling Multi-Component Predictive Systems as Petri Nets
Modelling Multi-Component Predictive Systems as Petri NetsModelling Multi-Component Predictive Systems as Petri Nets
Modelling Multi-Component Predictive Systems as Petri Nets
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
 

Destacado

powerpoint feb
powerpoint febpowerpoint feb
powerpoint febimu409
 
Adaptive Intrusion Detection Using Learning Classifiers
Adaptive Intrusion Detection Using Learning ClassifiersAdaptive Intrusion Detection Using Learning Classifiers
Adaptive Intrusion Detection Using Learning ClassifiersPatrick Nicolas
 
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...Pradeeban Kathiravelu, Ph.D.
 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...Armando Vieira
 
Intrusion detection using data mining
Intrusion detection using data miningIntrusion detection using data mining
Intrusion detection using data miningbalbeerrawat
 
Analysis and Design for Intrusion Detection System Based on Data Mining
Analysis and Design for Intrusion Detection System Based on Data MiningAnalysis and Design for Intrusion Detection System Based on Data Mining
Analysis and Design for Intrusion Detection System Based on Data MiningPritesh Ranjan
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
Using Machine Learning in Networks Intrusion Detection Systems
Using Machine Learning in Networks Intrusion Detection SystemsUsing Machine Learning in Networks Intrusion Detection Systems
Using Machine Learning in Networks Intrusion Detection SystemsOmar Shaya
 
Data Mining and Intrusion Detection
Data Mining and Intrusion Detection Data Mining and Intrusion Detection
Data Mining and Intrusion Detection amiable_indian
 
NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique
NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique
NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique Sujeet Suryawanshi
 
Intrusion detection and prevention system
Intrusion detection and prevention systemIntrusion detection and prevention system
Intrusion detection and prevention systemNikhil Raj
 
Intrusion detection system ppt
Intrusion detection system pptIntrusion detection system ppt
Intrusion detection system pptSheetal Verma
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson
 

Destacado (14)

powerpoint feb
powerpoint febpowerpoint feb
powerpoint feb
 
Adaptive Intrusion Detection Using Learning Classifiers
Adaptive Intrusion Detection Using Learning ClassifiersAdaptive Intrusion Detection Using Learning Classifiers
Adaptive Intrusion Detection Using Learning Classifiers
 
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...
 
Intrusion detection using data mining
Intrusion detection using data miningIntrusion detection using data mining
Intrusion detection using data mining
 
Ids presentation
Ids presentationIds presentation
Ids presentation
 
Analysis and Design for Intrusion Detection System Based on Data Mining
Analysis and Design for Intrusion Detection System Based on Data MiningAnalysis and Design for Intrusion Detection System Based on Data Mining
Analysis and Design for Intrusion Detection System Based on Data Mining
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Using Machine Learning in Networks Intrusion Detection Systems
Using Machine Learning in Networks Intrusion Detection SystemsUsing Machine Learning in Networks Intrusion Detection Systems
Using Machine Learning in Networks Intrusion Detection Systems
 
Data Mining and Intrusion Detection
Data Mining and Intrusion Detection Data Mining and Intrusion Detection
Data Mining and Intrusion Detection
 
NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique
NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique
NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique
 
Intrusion detection and prevention system
Intrusion detection and prevention systemIntrusion detection and prevention system
Intrusion detection and prevention system
 
Intrusion detection system ppt
Intrusion detection system pptIntrusion detection system ppt
Intrusion detection system ppt
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 

Similar a ∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data

A Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersA Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersZac Darcy
 
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERSA MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERSZac Darcy
 
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersA Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersZac Darcy
 
Replication and Synchronization Algorithms for Distributed Databases - Lena W...
Replication and Synchronization Algorithms for Distributed Databases - Lena W...Replication and Synchronization Algorithms for Distributed Databases - Lena W...
Replication and Synchronization Algorithms for Distributed Databases - Lena W...distributed matters
 
Assessing data dissemination strategies
Assessing data dissemination strategiesAssessing data dissemination strategies
Assessing data dissemination strategiesOpen University, KMi
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
 
An Efficient Unsupervised AdaptiveAntihub Technique for Outlier Detection in ...
An Efficient Unsupervised AdaptiveAntihub Technique for Outlier Detection in ...An Efficient Unsupervised AdaptiveAntihub Technique for Outlier Detection in ...
An Efficient Unsupervised AdaptiveAntihub Technique for Outlier Detection in ...theijes
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
 
ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
 ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATANexgen Technology
 
Efficient Doubletree: An Algorithm for Large-Scale Topology Discovery
Efficient Doubletree: An Algorithm for Large-Scale Topology DiscoveryEfficient Doubletree: An Algorithm for Large-Scale Topology Discovery
Efficient Doubletree: An Algorithm for Large-Scale Topology DiscoveryIOSR Journals
 
Benefit based data caching in ad hoc networks (synopsis)
Benefit based data caching in ad hoc networks (synopsis)Benefit based data caching in ad hoc networks (synopsis)
Benefit based data caching in ad hoc networks (synopsis)Mumbai Academisc
 
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET Journal
 
An Empirical Study for Defect Prediction using Clustering
An Empirical Study for Defect Prediction using ClusteringAn Empirical Study for Defect Prediction using Clustering
An Empirical Study for Defect Prediction using Clusteringidescitation
 
A Novel Dencos Model For High Dimensional Data Using Genetic Algorithms
A Novel Dencos Model For High Dimensional Data Using Genetic Algorithms A Novel Dencos Model For High Dimensional Data Using Genetic Algorithms
A Novel Dencos Model For High Dimensional Data Using Genetic Algorithms ijcseit
 
Application-Aware Big Data Deduplication in Cloud Environment
Application-Aware Big Data Deduplication in Cloud EnvironmentApplication-Aware Big Data Deduplication in Cloud Environment
Application-Aware Big Data Deduplication in Cloud EnvironmentSafayet Hossain
 

Similar a ∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data (20)

A Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersA Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
 
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERSA MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
 
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersA Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
 
Replication and Synchronization Algorithms for Distributed Databases - Lena W...
Replication and Synchronization Algorithms for Distributed Databases - Lena W...Replication and Synchronization Algorithms for Distributed Databases - Lena W...
Replication and Synchronization Algorithms for Distributed Databases - Lena W...
 
Assessing data dissemination strategies
Assessing data dissemination strategiesAssessing data dissemination strategies
Assessing data dissemination strategies
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
 
An Efficient Unsupervised AdaptiveAntihub Technique for Outlier Detection in ...
An Efficient Unsupervised AdaptiveAntihub Technique for Outlier Detection in ...An Efficient Unsupervised AdaptiveAntihub Technique for Outlier Detection in ...
An Efficient Unsupervised AdaptiveAntihub Technique for Outlier Detection in ...
 
[IJET V2I3P11] Authors: Payal More, Rohini Pandit, Supriya Makude, Harsh Nirb...
[IJET V2I3P11] Authors: Payal More, Rohini Pandit, Supriya Makude, Harsh Nirb...[IJET V2I3P11] Authors: Payal More, Rohini Pandit, Supriya Makude, Harsh Nirb...
[IJET V2I3P11] Authors: Payal More, Rohini Pandit, Supriya Makude, Harsh Nirb...
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
 ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
 
G44093135
G44093135G44093135
G44093135
 
Efficient Doubletree: An Algorithm for Large-Scale Topology Discovery
Efficient Doubletree: An Algorithm for Large-Scale Topology DiscoveryEfficient Doubletree: An Algorithm for Large-Scale Topology Discovery
Efficient Doubletree: An Algorithm for Large-Scale Topology Discovery
 
Benefit based data caching in ad hoc networks (synopsis)
Benefit based data caching in ad hoc networks (synopsis)Benefit based data caching in ad hoc networks (synopsis)
Benefit based data caching in ad hoc networks (synopsis)
 
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
 
C04511822
C04511822C04511822
C04511822
 
An Empirical Study for Defect Prediction using Clustering
An Empirical Study for Defect Prediction using ClusteringAn Empirical Study for Defect Prediction using Clustering
An Empirical Study for Defect Prediction using Clustering
 
A Novel Dencos Model For High Dimensional Data Using Genetic Algorithms
A Novel Dencos Model For High Dimensional Data Using Genetic Algorithms A Novel Dencos Model For High Dimensional Data Using Genetic Algorithms
A Novel Dencos Model For High Dimensional Data Using Genetic Algorithms
 
Application-Aware Big Data Deduplication in Cloud Environment
Application-Aware Big Data Deduplication in Cloud EnvironmentApplication-Aware Big Data Deduplication in Cloud Environment
Application-Aware Big Data Deduplication in Cloud Environment
 
Declarative data analysis
Declarative data analysisDeclarative data analysis
Declarative data analysis
 
SSBSE10.ppt
SSBSE10.pptSSBSE10.ppt
SSBSE10.ppt
 

Más de Pradeeban Kathiravelu, Ph.D.

Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.Pradeeban Kathiravelu, Ph.D.
 
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...Pradeeban Kathiravelu, Ph.D.
 
Data Services with Bindaas: RESTful Interfaces for Diverse Data Sources
Data Services with Bindaas: RESTful Interfaces for Diverse Data SourcesData Services with Bindaas: RESTful Interfaces for Diverse Data Sources
Data Services with Bindaas: RESTful Interfaces for Diverse Data SourcesPradeeban Kathiravelu, Ph.D.
 
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degreeThe UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degreePradeeban Kathiravelu, Ph.D.
 
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
 My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos... My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...Pradeeban Kathiravelu, Ph.D.
 
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...Pradeeban Kathiravelu, Ph.D.
 
Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Software-Defined Systems for Network-Aware Service Composition and Workflow P...Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Software-Defined Systems for Network-Aware Service Composition and Workflow P...Pradeeban Kathiravelu, Ph.D.
 
Moving bits with a fleet of shared virtual routers
Moving bits with a fleet of shared virtual routersMoving bits with a fleet of shared virtual routers
Moving bits with a fleet of shared virtual routersPradeeban Kathiravelu, Ph.D.
 
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...Pradeeban Kathiravelu, Ph.D.
 
On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...
On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...
On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...Pradeeban Kathiravelu, Ph.D.
 
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Pradeeban Kathiravelu, Ph.D.
 
Software-Defined Inter-Cloud Composition of Big Services
Software-Defined Inter-Cloud Composition of Big ServicesSoftware-Defined Inter-Cloud Composition of Big Services
Software-Defined Inter-Cloud Composition of Big ServicesPradeeban Kathiravelu, Ph.D.
 
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Pradeeban Kathiravelu, Ph.D.
 

Más de Pradeeban Kathiravelu, Ph.D. (20)

Google Summer of Code_2023.pdf
Google Summer of Code_2023.pdfGoogle Summer of Code_2023.pdf
Google Summer of Code_2023.pdf
 
Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022
 
Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022
 
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
 
Google summer of code (GSoC) 2021
Google summer of code (GSoC) 2021Google summer of code (GSoC) 2021
Google summer of code (GSoC) 2021
 
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
 
Google Summer of Code (GSoC) 2020 for mentors
Google Summer of Code (GSoC) 2020 for mentorsGoogle Summer of Code (GSoC) 2020 for mentors
Google Summer of Code (GSoC) 2020 for mentors
 
Google Summer of Code (GSoC) 2020
Google Summer of Code (GSoC) 2020Google Summer of Code (GSoC) 2020
Google Summer of Code (GSoC) 2020
 
Data Services with Bindaas: RESTful Interfaces for Diverse Data Sources
Data Services with Bindaas: RESTful Interfaces for Diverse Data SourcesData Services with Bindaas: RESTful Interfaces for Diverse Data Sources
Data Services with Bindaas: RESTful Interfaces for Diverse Data Sources
 
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degreeThe UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
 
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
 My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos... My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
 
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
 
UCL Ph.D. Confirmation 2018
UCL Ph.D. Confirmation 2018UCL Ph.D. Confirmation 2018
UCL Ph.D. Confirmation 2018
 
Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Software-Defined Systems for Network-Aware Service Composition and Workflow P...Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Software-Defined Systems for Network-Aware Service Composition and Workflow P...
 
Moving bits with a fleet of shared virtual routers
Moving bits with a fleet of shared virtual routersMoving bits with a fleet of shared virtual routers
Moving bits with a fleet of shared virtual routers
 
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
 
On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...
On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...
On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...
 
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
 
Software-Defined Inter-Cloud Composition of Big Services
Software-Defined Inter-Cloud Composition of Big ServicesSoftware-Defined Inter-Cloud Composition of Big Services
Software-Defined Inter-Cloud Composition of Big Services
 
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Último (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data

  • 1. ∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data Pradeeban Kathiravelu, Helena Galhardas, Lu´ıs Veiga INESC-ID Lisboa Instituto Superior T´ecnico, Universidade de Lisboa Lisbon, Portugal 23rd International Conference on Cooperative Information Systems (CoopIS 2015) 28-30 October 2015, Rhodes, Greece. Distributed Near Duplicate Detection ∂u∂u 1 / 23
  • 2. Introduction Introduction Data cleaning is essential for enterprise information systems. Finding near duplicates is an important task in data cleaning. Near duplicate detection algorithms to find “almost” identical entries. Massive datasets require large memory and processing power. Distributed Near Duplicate Detection ∂u∂u 2 / 23
  • 3. Introduction Motivation Most data cleaning algorithms are sequential. Recent use of MapReduce frameworks in near duplicate detection. In-Memory Data Grids (IMDG) offer a view of a large computer by unifying the resources across a distributed computer cluster. What if..? Distributed Near Duplicate Detection ∂u∂u 3 / 23
  • 4. Introduction ∂u∂u A distributed architecture for near duplicate detection. An efficient distribution strategy for the blocks over IMDGs. Adapting the existing algorithms. To execute on a computer cluster or a public/private cloud. Leverage MapReduce framework offered by the IMDG. In identifying the blocks. Distributed Near Duplicate Detection ∂u∂u 4 / 23
  • 5. ∂u∂u Architecture Contributions Faster near duplicate detection over massive datasets. which may not have been possible to execute on the utility computers. High speedup and lower communication and coordination overhead. Multi-tenanted parallel processing architecture. Coordinated for multi-pass over multiple keys. More accurate and precise duplicate detection. Strategy and algorithms loosely coupled to the base algorithms. Potential to distribute more algorithms. Configuring based on user preferences. Adaptively involving the instances in near duplicate detection. Distributed Near Duplicate Detection ∂u∂u 5 / 23
  • 6. ∂u∂u Architecture Distributed Near Duplicate Detection Distributed Near Duplicate Detection ∂u∂u 6 / 23
  • 7. ∂u∂u Architecture Distributed Near Duplicate Detection Distributed Near Duplicate Detection ∂u∂u 7 / 23
  • 8. ∂u∂u Architecture Distributed Near Duplicate Detection Distributed Near Duplicate Detection ∂u∂u 8 / 23
  • 9. ∂u∂u Architecture Deployment Architecture Distributed Near Duplicate Detection ∂u∂u 9 / 23
  • 10. ∂u∂u Architecture Efficient Data Distribution Distributed Near Duplicate Detection ∂u∂u 10 / 23
  • 11. ∂u∂u Architecture Partition of storage and execution across the instances Distributed Near Duplicate Detection ∂u∂u 11 / 23
  • 12. ∂u∂u Architecture Tenant-Aware Parallel Execution for Multiple Composite Blocking Keys Distributed Near Duplicate Detection ∂u∂u 12 / 23
  • 13. ∂u∂u Architecture Matrix Notation Distributed Near Duplicate Detection ∂u∂u 13 / 23
  • 14. ∂u∂u Architecture Software Architecture Distributed Near Duplicate Detection ∂u∂u 14 / 23
  • 15. ∂u∂u Prototype Prototype Implementation Java 1.8.0 as the programming language. Hazelcast 3.4 as the in-memory data grid. Data sources connected through their respective Java driver APIs. MongoDB 2.4.9. MySQL 5.5.41-0ubuntu0.14.04.1. PPJoin as the base near duplicate detection algorithm. Extended for distributed execution on Hazelcast. Distributed Near Duplicate Detection ∂u∂u 15 / 23
  • 16. Evaluation Prototype Deployment Intel R CoreTM i7-4700MQ CPU @ 2.40GHz 8 processor. 8 GB memory. Ubuntu 14.04 LTS 64 bit operating system. Two Mongo databases connected as the data sources. Having the potential duplicate pairs. Hadoop HDFS to store the detected duplicate pairs. Distributed Near Duplicate Detection ∂u∂u 16 / 23
  • 17. Evaluation Evaluation System Configurations Around 100 datasets of varying sizes above 1 GB. With varying number of nodes configured to execute in a cluster. Each cluster configured to have an executor instance. Fairness in evaluations. Number of iterations and the blocking keys maintained to be same across all the experiments. Distributed Near Duplicate Detection ∂u∂u 17 / 23
  • 18. Evaluation Preliminary Assessments Performance and speed up With multi-pass in 4 different execution clusters. Compared to the sequential execution of PPJoin in a single computer. Efficiency in distributing the storage and execution. With multiple instances in the execution cluster. Distributed Near Duplicate Detection ∂u∂u 18 / 23
  • 19. Evaluation Variations of Speedup with the Number of nodes Super-linear speedup. up to c ∗ n2 ; c - number of clusters; n - number of nodes. c = 4, as 4 clusters were used. n ⇒ 1 n2 search space in each blocks. Distributed Near Duplicate Detection ∂u∂u 19 / 23
  • 20. Evaluation Variations of Memory Consumption with the Number of Nodes Distributed Near Duplicate Detection ∂u∂u 20 / 23
  • 21. Conclusion Related Work MapReduce frameworks for near duplicate detection. MapDupReducer [CW 2010], Dedoop [LK 2012], . . . Generalizing the existing algorithms to execute in a MapReduce framework. Do not consider all aspects of the near duplicate detection. Coupled to the MapReduce framework or the near duplicate detection algorithms. In-Memory Data Grids such as Hazelcast and Infinispan are not leveraged in existing data cleaning approaches. Distributed Near Duplicate Detection ∂u∂u 21 / 23
  • 22. Conclusion Conclusion Conclusions In-memory data grids for a scalable near duplicate detection. Adoption of the existing algorithms for a distributed environment. Multi-tenanted environment for accurate near duplicate detection. with parallel usage of multiple blocking keys. Future Work Extending and leveraging ∂u∂u distributed execution approach for data warehouse construction and other data cleaning processes. Distributed Near Duplicate Detection ∂u∂u 22 / 23
  • 23. Conclusion References CX 2011 Xiao, C., Wang, W., Lin, X., Yu, J. X., & Wang, G. (2011). Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS), 36(3), 15. LK 2012 Kolb, L., Thor, A., & Rahm, E. (2012). Dedoop: efficient deduplication with Hadoop. Proceedings of the VLDB Endowment, 5(12), 1878-1881. CW 2010 Wang, C., Wang, J., Lin, X., Wang, W., Wang, H., Li, H., ... & Li, R. (2010, June). MapDupReducer: detecting near duplicates over massive datasets. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (pp. 1119-1122). ACM. RV 2010 Vernica, R., Carey, M. J., & Li, C. (2010, June). Efficient parallel set-similarity joins using MapReduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (pp. 495-506). ACM. PK 2014 Kathiravelu, P. & L. Veiga (2014). An Adaptive Distributed Simulator for Cloud and MapReduce Algorithms and Architectures. In IEEE/ACM 7th International Conference on Utility and Cloud Computing (UCC 2014), London, UK. pp. 79 – 88. IEEE Computer Society. Thank you! Questions? Distributed Near Duplicate Detection ∂u∂u 23 / 23