SlideShare una empresa de Scribd logo
1 de 18
Descargar para leer sin conexión
A Brief Discussion on: Hadoop
       MapReduce, Pig,
JavaFlume,Cascading & Dremel




               Presented By: Somnath Mazumdar
                        29th Nov 2011
MapReduce
è  Based on Google's MapReduce Programming Framework
è  FileSystem: GFS for MapReduce ... HDFS for Hadoop
è  Language: MapReduce is written in C++ but Hadoop is in Java
è  Basic Functions : Map and Reduce inspired by similar primitives in
    LISP and other languages...
Why we should use ???
                  l  Automatic parallelization and distribution
                l    Fault-tolerance
                l    I/O scheduling
                l    Status and monitoring
MapReduce
Map Function:                     Reduce Function:
(1)   Processes input key/value   (1)   Combines all intermediate values
      pair                              for a particular key

                                  (2)    Produces a set of merged output
(2)    Produces set of                   values
       intermediate pairs
                                  Syntax:
Syntax:
                                  reduce (out_key, list(inter_value)) ->
map (key,value)-                        list(out_value)
       >list(key,inter_value)
Programming Model
                           (Hello, 1)
Hello World, Bye           (Bye, 1)
     World!
                      M1
                           (World, 1)               (Hello, 2)
                            (World, 1)              (Bye, 1)
                                               R1   (Welcome, 1)
                                                    (to, 3)
                           (Welcome, 1)
                           (to, 1)
Welcome to UCD,            (to, 1)
Goodbye to UCD.
                      M2
                           (Goodbye, 1)
                           (UCD, 1)
                           (UCD, 1)
                                                    (World, 2)
                                                    (UCD, 2)
     Hello
                            (Hello, 1)         R2   (Goodbye, 2)
                            (to, 1)                 (MapReduce,
  MapReduce,
  Goodbye to
                      M3    (Goodbye, 1)            2)
  MapReduce.                (MapReduce,
                            1)
                            (MapReduce,
                            1)

  HDFS              Map    Intermediate    Reduce       HDFS
                   Phase      Result       Phase
MapReduce
Applications:
(1)    Distributed grep & Distributed sort
(2)    Web link-graph reversal, 
(3)     Web access log stats, 
(4)     Document clustering,
(5)     Machine Learning and so on...



To know more:

è     MapReduce: Simplified Data Processing on Large Clusters
       by Jeffrey Dean and Sanjay Ghemawat, Google, Inc.

è     Hadoop: The Definitive Guide - O'Reilly Media
PIG
è    First Pig developed at Yahoo Research around 2006 later moved to
      Apache Software Foundation
è    Pig is a data flow programming environment for processing large files
      based on MapReduce / Hadoop.
è    High-level platform for creating MapReduce programs used
      with Hadoop and HDFS
è    Apache library that interprets scripts written in Pig Latin and runs
      them on a Hadoop cluster.



                 At Yahoo! 40% of all Hadoop jobs are run with Pig
PIG
WorkFlow:
First step: Load input data. 
   Second step: Manipulate  data with functions like filtering, using
   foreach, distinct or any user defined functions.
   Third step: Group the data. Final stage: Writing data into the DFS or
   repeating the step if another dataset arrives.


Scripts written in PigLatin------------------->Hadoop ready jobs
   Pig Library/Engine




        Take Away Point:: Do more with data not with functions..
Cascading
Query API and Query Planner for defining, sharing, and executing data
  processing workflows.

Supports to create and execute complex data processing workflows on a
   Hadoop cluster using any JVM-based language (Java, JRuby, Clojure,
   etc.).

Originally authored by Chris Wensel (founder of Concurrent, Inc.)
What it offers??
            Data Processing API (core)
            Process Planner
            Process Scheduler
How to use?? 1. Install Hadoop
            2. Put Hadoop job .jar which must contain cascading .jars.
Cascading:‘Source-Pipe-Sink’
How it works??
Source: Data is captured from sources.
Pipes: are created independent from the data they will process. Supports
   reusable ‘pipes’ concept.
Sinks: Results are stored in output files or ‘sinks’.
Data Processing API provides Source-Pipe-Sink mechanism.
Once tied to data sources and sinks, it is called a ‘flow’(Topological
  Scheduler). These flows can be grouped into a
  ‘cascade’(CascadeConnector class), and the process scheduler will
  ensure a given flow does not execute until all its dependencies are
  satisfied.
Cascading
Pipe Assembly------MR Job Planner---->graph of dependent MapReduce
   jobs.
Also provides External Data Interfaces for data...


It efficiently supports splits, joins, grouping, and sorting.


Usages: log file analysis, bioinformatics, machine learning, predictive
   analytics, web content mining etc.


Cascading is cited as one of the top five most powerful Hadoop projects
                            by SD Times in 2011.
FlumeJava
Java Library API that makes easy to develop,test and run
  efficient data parallel pipelines.
Born on May 2009 @ Google Lab
Library is a collection of immutable parallel classes.
Flumejava:
1. abstracts how data is presented as in memory data structure or
    as file
2. abstracts away the implementation details like local loop or
   remote MR job.
3. Implements parallel job using deferred evaluation
FlumeJava
How it works???
Step1: invoke the parallel operation.
Step2: Do not run. Do the following ..
       2.1. Records the operation and the arguments.
       2.2. save them into an internal execution plan graph
  structure.
       2.3. Construct the execution plan for whole computation.
Step3: Optimizes the execution plan.
Step4: Execute them.
Faster than typical MR pipeline with same logical struct. & easier.
FlumeJava
Data Model:
Pcollection<T>: central class, an immutable bag of elements of type T
Can be unordered (collection(efficient)) or ordered (sequence).
PTable<K, V>:Second central class
Immutable multi-map with keys of class K and values of class V
Operators:
parallelDo(PCollection<T>): Core parallel primitives
groupByKey(PTable<Pair<K,V>>)
combineValues(PTable<Pair<K, Collection<V>>):
flatten(): logical view of multiple PCollections as one Pcollection
Join()
Dremel
A distributed system for interactive analysis of large datasets since
  2006 in Google.
Provides custom, scalable data management solution built over shared
   clusters of commodity machines.
Three Features/Key aspects:
1. Storage Format: column-striped storage representation for non
    relational nested data (lossless representation).
Why nested?
It backs a platform-neutral, extensible mechanism for serializing
   structured data at Google.
What is main aim??
Store all values of a given field consecutively to improve retrieval
  efficiency.
Dremel
2. Query Language: Provides a high-level, SQL-like language to express
   ad hoc queries.
    It efficiently implementable on columnar nested storage.
      Fields are referenced using path expressions.
Supports nested subqueries, inter and intra-record aggregation, joins
  etc.
3. Execution:Multi-level serving tree concept (distributed search engine)
      Several queries can execute simultaneously.
             Query dispatcher schedules queries based on priorities and
   balances load
I am lost..Are MR and Dremel
            same??
    Features           MapReduce aka MR                Dremel
Birth Year & Place Since 2004 @ Google lab Since 2006 @ Google lab
      Type            Distributed & parallel   Distributed interactive
                     programming framework      ad hoc query system
Scalable & Fault              Yes                        Yes
    Tolerant
 Data processing        Record oriented           Column oriented
Batch processing              Yes                        No
In situ processing             No                        Yes

  Take away point:: Dremel it complements MapReduce-based
                            computing.
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra

Más contenido relacionado

La actualidad más candente

Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkMaximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Flink Forward
 
Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015
Andra Lungu
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 

La actualidad más candente (20)

Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph Processing
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Universal metrics with Apache Beam
Universal metrics with Apache BeamUniversal metrics with Apache Beam
Universal metrics with Apache Beam
 
Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming
 
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkMaximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one System
 
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesChapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flink
 
Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Automated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureAutomated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative Infrastructure
 

Destacado

How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Amr Awadallah
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
MapR Technologies
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza
 

Destacado (20)

Elasticsearch - Devoxx France 2012 - English version
Elasticsearch - Devoxx France 2012 - English versionElasticsearch - Devoxx France 2012 - English version
Elasticsearch - Devoxx France 2012 - English version
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
 
제2회 한글형태소분석기 기술 세니마 발표(solr 활용 입문) by 김지훈
제2회 한글형태소분석기 기술 세니마 발표(solr 활용 입문) by 김지훈제2회 한글형태소분석기 기술 세니마 발표(solr 활용 입문) by 김지훈
제2회 한글형태소분석기 기술 세니마 발표(solr 활용 입문) by 김지훈
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and ElasticsearchElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
 
March 2011 HUG: HDFS Federation
March 2011 HUG: HDFS FederationMarch 2011 HUG: HDFS Federation
March 2011 HUG: HDFS Federation
 
Introduction to Redis - LA Hacker News
Introduction to Redis - LA Hacker NewsIntroduction to Redis - LA Hacker News
Introduction to Redis - LA Hacker News
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on ReadBig Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big Data
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
Redis And python at pycon_2011
Redis And python at pycon_2011Redis And python at pycon_2011
Redis And python at pycon_2011
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
 

Similar a Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra

Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
pappupassindia
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
Dmitry Makarchuk
 

Similar a Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra (20)

Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Tutorial5
Tutorial5Tutorial5
Tutorial5
 
Hadoop
HadoopHadoop
Hadoop
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Hadoop
HadoopHadoop
Hadoop
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra

  • 1. A Brief Discussion on: Hadoop MapReduce, Pig, JavaFlume,Cascading & Dremel Presented By: Somnath Mazumdar 29th Nov 2011
  • 2. MapReduce è  Based on Google's MapReduce Programming Framework è  FileSystem: GFS for MapReduce ... HDFS for Hadoop è  Language: MapReduce is written in C++ but Hadoop is in Java è  Basic Functions : Map and Reduce inspired by similar primitives in LISP and other languages... Why we should use ??? l  Automatic parallelization and distribution l  Fault-tolerance l  I/O scheduling l  Status and monitoring
  • 3. MapReduce Map Function: Reduce Function: (1)  Processes input key/value (1)  Combines all intermediate values pair for a particular key (2)  Produces a set of merged output (2)  Produces set of values intermediate pairs Syntax: Syntax: reduce (out_key, list(inter_value)) -> map (key,value)- list(out_value) >list(key,inter_value)
  • 4. Programming Model (Hello, 1) Hello World, Bye (Bye, 1) World! M1 (World, 1) (Hello, 2) (World, 1) (Bye, 1) R1 (Welcome, 1) (to, 3) (Welcome, 1) (to, 1) Welcome to UCD, (to, 1) Goodbye to UCD. M2 (Goodbye, 1) (UCD, 1) (UCD, 1) (World, 2) (UCD, 2) Hello (Hello, 1) R2 (Goodbye, 2) (to, 1) (MapReduce, MapReduce, Goodbye to M3 (Goodbye, 1) 2) MapReduce. (MapReduce, 1) (MapReduce, 1) HDFS Map Intermediate Reduce HDFS Phase Result Phase
  • 5. MapReduce Applications: (1)  Distributed grep & Distributed sort (2)  Web link-graph reversal,  (3)   Web access log stats,  (4)   Document clustering, (5)   Machine Learning and so on... To know more: è  MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat, Google, Inc. è  Hadoop: The Definitive Guide - O'Reilly Media
  • 6.
  • 7. PIG è  First Pig developed at Yahoo Research around 2006 later moved to Apache Software Foundation è  Pig is a data flow programming environment for processing large files based on MapReduce / Hadoop. è  High-level platform for creating MapReduce programs used with Hadoop and HDFS è  Apache library that interprets scripts written in Pig Latin and runs them on a Hadoop cluster. At Yahoo! 40% of all Hadoop jobs are run with Pig
  • 8. PIG WorkFlow: First step: Load input data.  Second step: Manipulate  data with functions like filtering, using foreach, distinct or any user defined functions. Third step: Group the data. Final stage: Writing data into the DFS or repeating the step if another dataset arrives. Scripts written in PigLatin------------------->Hadoop ready jobs Pig Library/Engine Take Away Point:: Do more with data not with functions..
  • 9. Cascading Query API and Query Planner for defining, sharing, and executing data processing workflows. Supports to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language (Java, JRuby, Clojure, etc.). Originally authored by Chris Wensel (founder of Concurrent, Inc.) What it offers?? Data Processing API (core) Process Planner Process Scheduler How to use?? 1. Install Hadoop 2. Put Hadoop job .jar which must contain cascading .jars.
  • 10. Cascading:‘Source-Pipe-Sink’ How it works?? Source: Data is captured from sources. Pipes: are created independent from the data they will process. Supports reusable ‘pipes’ concept. Sinks: Results are stored in output files or ‘sinks’. Data Processing API provides Source-Pipe-Sink mechanism. Once tied to data sources and sinks, it is called a ‘flow’(Topological Scheduler). These flows can be grouped into a ‘cascade’(CascadeConnector class), and the process scheduler will ensure a given flow does not execute until all its dependencies are satisfied.
  • 11. Cascading Pipe Assembly------MR Job Planner---->graph of dependent MapReduce jobs. Also provides External Data Interfaces for data... It efficiently supports splits, joins, grouping, and sorting. Usages: log file analysis, bioinformatics, machine learning, predictive analytics, web content mining etc. Cascading is cited as one of the top five most powerful Hadoop projects by SD Times in 2011.
  • 12. FlumeJava Java Library API that makes easy to develop,test and run efficient data parallel pipelines. Born on May 2009 @ Google Lab Library is a collection of immutable parallel classes. Flumejava: 1. abstracts how data is presented as in memory data structure or as file 2. abstracts away the implementation details like local loop or remote MR job. 3. Implements parallel job using deferred evaluation
  • 13. FlumeJava How it works??? Step1: invoke the parallel operation. Step2: Do not run. Do the following .. 2.1. Records the operation and the arguments. 2.2. save them into an internal execution plan graph structure. 2.3. Construct the execution plan for whole computation. Step3: Optimizes the execution plan. Step4: Execute them. Faster than typical MR pipeline with same logical struct. & easier.
  • 14. FlumeJava Data Model: Pcollection<T>: central class, an immutable bag of elements of type T Can be unordered (collection(efficient)) or ordered (sequence). PTable<K, V>:Second central class Immutable multi-map with keys of class K and values of class V Operators: parallelDo(PCollection<T>): Core parallel primitives groupByKey(PTable<Pair<K,V>>) combineValues(PTable<Pair<K, Collection<V>>): flatten(): logical view of multiple PCollections as one Pcollection Join()
  • 15. Dremel A distributed system for interactive analysis of large datasets since 2006 in Google. Provides custom, scalable data management solution built over shared clusters of commodity machines. Three Features/Key aspects: 1. Storage Format: column-striped storage representation for non relational nested data (lossless representation). Why nested? It backs a platform-neutral, extensible mechanism for serializing structured data at Google. What is main aim?? Store all values of a given field consecutively to improve retrieval efficiency.
  • 16. Dremel 2. Query Language: Provides a high-level, SQL-like language to express ad hoc queries. It efficiently implementable on columnar nested storage. Fields are referenced using path expressions. Supports nested subqueries, inter and intra-record aggregation, joins etc. 3. Execution:Multi-level serving tree concept (distributed search engine) Several queries can execute simultaneously. Query dispatcher schedules queries based on priorities and balances load
  • 17. I am lost..Are MR and Dremel same?? Features MapReduce aka MR Dremel Birth Year & Place Since 2004 @ Google lab Since 2006 @ Google lab Type Distributed & parallel Distributed interactive programming framework ad hoc query system Scalable & Fault Yes Yes Tolerant Data processing Record oriented Column oriented Batch processing Yes No In situ processing No Yes Take away point:: Dremel it complements MapReduce-based computing.