SlideShare una empresa de Scribd logo
1 de 8
Hadoop
Ecosystem
ACM Bay Area Data Mining Camp 2011
Patrick Nicolas
September 19, 2011
http://patricknicolas.blogspot.com
http://www.slideshare.net/pnicolas
https://github.com/prnicolas

Copyright 2011 Patrick Nicolas - All rights reserved

1
Overview
Beside providing developers and analysts with an open source
implementation of map-reduce functional model, the Hadoop
ecosystem incorporates analytical algorithms, tasks/workflow
managers and NoSQL stores.
Client code, Scripts
NoSQL

Analytics

Key-Values stores Mahout
Document stores
Multi-column stores
Graph databases

Configuration
Zookeeper

Workflow
Hive
Pig
Cascading

Map/Reduce framework
HDFS
Java Virtual Machine

Copyright 2011 Patrick Nicolas - All rights reserved

2
Key Components
The Hadoop ecosystem can be described as a data centric
taxonomy to analyze, aggregate, store and report data.
Admin.
File System

GFS,HDFS

MapReduce

K-V Stores

Redis, Memcache, Kyoto Cabinet

Doc Stores

Hadoop

Zookeeper

MongoDB, CouchDB

NoSQL

Multi-column
stores

HBase, Hypertable, BigData,
Cassandra, BerkeleyDB

Graph DB
Script
Workflow

Neo4j, GraphDB, InfiniteGraph
Pig
Cascading

SQL
Analytics

API

Hive

Mahout, Chunkwa

Copyright 2011 Patrick Nicolas - All rights reserved

3
NoSQL: Overview

Non relational data stores allow large amount of data to be
collected very efficiently. Contrary to RDBMS, NoSQL
schemas are optimized for sequential writes and therefore are
not appropriate for querying and reporting.

Key

Value

Column families, nested structures

NoSQL storages share the same basic key-value schema but
provide different method to describe values.

Copyright 2011 Patrick Nicolas - All rights reserved

4
NoSQL: Document Stores
Key-Value files (HDFS)
<key, value>
Distributed replicable blocks of sequential key-value string pairs

Key-Value stores (Redis, Memcache)
<key*, value>
Language independent, distributed, sorted key value pairs (keys
are list, sets or hashes) with in-memory caching and support for
atomic operations.

Document stores (MongoDB, CouchDB)
{ “k1”:val1, “k2”:val2 }
Fault-tolerant, document centric using dynamic schema of sorted
javascript objects and supports limited SQL like syntax.

Copyright 2011 Patrick Nicolas - All rights reserved

5
NoSQL: Tuples & Graphs

Sorted, ordered tuples(Cassandra, HBase,..)
{ name:x value: { key1: {name:key1, value:v1, tstamp:x}, key2:x}}

Fault-tolerant, distributed sorted, ordered, grouped (family)
‘super-column’ (map of unbounded number of columns)

Graph databases(Neo4j, GraphDB, InfiniteGraph,..)
Efficient transactional, traversal & storage of entity (vertice),
attribute & relationship (edge)

Copyright 2011 Patrick Nicolas - All rights reserved

6
Data Flow Managers
Map & Reduce tasks can be abstracted to a tasks or workflow
managers using high level language such as scripts, SQL or
UNIX-pipe like API. Those data flow tools hide the functional
complexity of Map-Reduce from domain experts.
Scripting

Pig

SQL

Hive

API: Pipes & flows

Cascading

API

Map
Map
Map
Map
Map

Combine
Combine

Reduce
Reduce
Reduce
Reduce

Copyright 2011 Patrick Nicolas - All rights reserved

7
Data Flow Code Samples
Pig Latin
A = LOAD „mydata' USING PigStorage() AS (f1:int, name:string);
B = GROUP A BY f1;
C = FOREACH B GENERATE COUNT ($0);

Hive
LOAD DATA LOCAL INPATH „xxx' OVERWRITE INTO TABLE z;
INSERT OVERWRITE TABLE z SELECT count(*) FROM y GROUP BY f1;

Cascading
Scheme srcScheme = new TextLine( new Fields( “line”));
Tap src = new Hfs(srcScheme, inpath);
Pipe counter = new Pipe (“count”);
counter = new GroupBy( counter, new Fields(“f1”);
FlowConnector connector = new FlowConnector(props);
Flow flow = connector.connect( “count”, src, sink, pipe);
flow.complete();

Copyright 2011 Patrick Nicolas - All rights reserved

8

Más contenido relacionado

La actualidad más candente

Payment Gateway Live hadoop project
Payment Gateway Live hadoop projectPayment Gateway Live hadoop project
Payment Gateway Live hadoop project
Kamal A
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
DataWorks Summit
 

La actualidad más candente (20)

Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Payment Gateway Live hadoop project
Payment Gateway Live hadoop projectPayment Gateway Live hadoop project
Payment Gateway Live hadoop project
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015
 
Apache drill
Apache drillApache drill
Apache drill
 
Summer Shorts: Big Data Integration
Summer Shorts: Big Data IntegrationSummer Shorts: Big Data Integration
Summer Shorts: Big Data Integration
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQL
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deployment
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft Platform
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 

Destacado (6)

Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Creating an Ecosystem Platform with Vertical PaaS
Creating an Ecosystem Platform with Vertical PaaSCreating an Ecosystem Platform with Vertical PaaS
Creating an Ecosystem Platform with Vertical PaaS
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Media Buying Platform Ecosystem
Media Buying Platform EcosystemMedia Buying Platform Ecosystem
Media Buying Platform Ecosystem
 
Understanding the Online Advertising Technology Landscape
Understanding the Online Advertising Technology Landscape Understanding the Online Advertising Technology Landscape
Understanding the Online Advertising Technology Landscape
 
Business Ecosystem Design
Business Ecosystem DesignBusiness Ecosystem Design
Business Ecosystem Design
 

Similar a Hadoop Ecosystem

Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosql
Khanderao Kand
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
Dmitry Makarchuk
 

Similar a Hadoop Ecosystem (20)

Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosql
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
 
Meetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management TrendsMeetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management Trends
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaSunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
 

Más de Patrick Nicolas

Más de Patrick Nicolas (12)

Autonomous medical coding with discriminative transformers
Autonomous medical coding with discriminative transformersAutonomous medical coding with discriminative transformers
Autonomous medical coding with discriminative transformers
 
Open Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learningOpen Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learning
 
AI for electronic health records
AI for electronic health recordsAI for electronic health records
AI for electronic health records
 
Monadic genetic kernels in Scala
Monadic genetic kernels in ScalaMonadic genetic kernels in Scala
Monadic genetic kernels in Scala
 
Scala for Machine Learning
Scala for Machine LearningScala for Machine Learning
Scala for Machine Learning
 
Stock Market Prediction using Hidden Markov Models and Investor sentiment
Stock Market Prediction using Hidden Markov Models and Investor sentimentStock Market Prediction using Hidden Markov Models and Investor sentiment
Stock Market Prediction using Hidden Markov Models and Investor sentiment
 
Advanced Functional Programming in Scala
Advanced Functional Programming in ScalaAdvanced Functional Programming in Scala
Advanced Functional Programming in Scala
 
Adaptive Intrusion Detection Using Learning Classifiers
Adaptive Intrusion Detection Using Learning ClassifiersAdaptive Intrusion Detection Using Learning Classifiers
Adaptive Intrusion Detection Using Learning Classifiers
 
Data Modeling using Symbolic Regression
Data Modeling using Symbolic RegressionData Modeling using Symbolic Regression
Data Modeling using Symbolic Regression
 
Semantic Analysis using Wikipedia Taxonomy
Semantic Analysis using Wikipedia TaxonomySemantic Analysis using Wikipedia Taxonomy
Semantic Analysis using Wikipedia Taxonomy
 
Taxonomy-based Contextual Ads Targeting
Taxonomy-based Contextual Ads TargetingTaxonomy-based Contextual Ads Targeting
Taxonomy-based Contextual Ads Targeting
 
Multi-tenancy in Private Clouds
Multi-tenancy in Private CloudsMulti-tenancy in Private Clouds
Multi-tenancy in Private Clouds
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Hadoop Ecosystem

  • 1. Hadoop Ecosystem ACM Bay Area Data Mining Camp 2011 Patrick Nicolas September 19, 2011 http://patricknicolas.blogspot.com http://www.slideshare.net/pnicolas https://github.com/prnicolas Copyright 2011 Patrick Nicolas - All rights reserved 1
  • 2. Overview Beside providing developers and analysts with an open source implementation of map-reduce functional model, the Hadoop ecosystem incorporates analytical algorithms, tasks/workflow managers and NoSQL stores. Client code, Scripts NoSQL Analytics Key-Values stores Mahout Document stores Multi-column stores Graph databases Configuration Zookeeper Workflow Hive Pig Cascading Map/Reduce framework HDFS Java Virtual Machine Copyright 2011 Patrick Nicolas - All rights reserved 2
  • 3. Key Components The Hadoop ecosystem can be described as a data centric taxonomy to analyze, aggregate, store and report data. Admin. File System GFS,HDFS MapReduce K-V Stores Redis, Memcache, Kyoto Cabinet Doc Stores Hadoop Zookeeper MongoDB, CouchDB NoSQL Multi-column stores HBase, Hypertable, BigData, Cassandra, BerkeleyDB Graph DB Script Workflow Neo4j, GraphDB, InfiniteGraph Pig Cascading SQL Analytics API Hive Mahout, Chunkwa Copyright 2011 Patrick Nicolas - All rights reserved 3
  • 4. NoSQL: Overview Non relational data stores allow large amount of data to be collected very efficiently. Contrary to RDBMS, NoSQL schemas are optimized for sequential writes and therefore are not appropriate for querying and reporting. Key Value Column families, nested structures NoSQL storages share the same basic key-value schema but provide different method to describe values. Copyright 2011 Patrick Nicolas - All rights reserved 4
  • 5. NoSQL: Document Stores Key-Value files (HDFS) <key, value> Distributed replicable blocks of sequential key-value string pairs Key-Value stores (Redis, Memcache) <key*, value> Language independent, distributed, sorted key value pairs (keys are list, sets or hashes) with in-memory caching and support for atomic operations. Document stores (MongoDB, CouchDB) { “k1”:val1, “k2”:val2 } Fault-tolerant, document centric using dynamic schema of sorted javascript objects and supports limited SQL like syntax. Copyright 2011 Patrick Nicolas - All rights reserved 5
  • 6. NoSQL: Tuples & Graphs Sorted, ordered tuples(Cassandra, HBase,..) { name:x value: { key1: {name:key1, value:v1, tstamp:x}, key2:x}} Fault-tolerant, distributed sorted, ordered, grouped (family) ‘super-column’ (map of unbounded number of columns) Graph databases(Neo4j, GraphDB, InfiniteGraph,..) Efficient transactional, traversal & storage of entity (vertice), attribute & relationship (edge) Copyright 2011 Patrick Nicolas - All rights reserved 6
  • 7. Data Flow Managers Map & Reduce tasks can be abstracted to a tasks or workflow managers using high level language such as scripts, SQL or UNIX-pipe like API. Those data flow tools hide the functional complexity of Map-Reduce from domain experts. Scripting Pig SQL Hive API: Pipes & flows Cascading API Map Map Map Map Map Combine Combine Reduce Reduce Reduce Reduce Copyright 2011 Patrick Nicolas - All rights reserved 7
  • 8. Data Flow Code Samples Pig Latin A = LOAD „mydata' USING PigStorage() AS (f1:int, name:string); B = GROUP A BY f1; C = FOREACH B GENERATE COUNT ($0); Hive LOAD DATA LOCAL INPATH „xxx' OVERWRITE INTO TABLE z; INSERT OVERWRITE TABLE z SELECT count(*) FROM y GROUP BY f1; Cascading Scheme srcScheme = new TextLine( new Fields( “line”)); Tap src = new Hfs(srcScheme, inpath); Pipe counter = new Pipe (“count”); counter = new GroupBy( counter, new Fields(“f1”); FlowConnector connector = new FlowConnector(props); Flow flow = connector.connect( “count”, src, sink, pipe); flow.complete(); Copyright 2011 Patrick Nicolas - All rights reserved 8