SlideShare una empresa de Scribd logo
1 de 27
The Hadoop Ecosystem


                       J Singh, DataThinks.org

                                   March 12, 2012
The Hadoop Ecosystem
• Introduction
   – What Hadoop is, and what it’s not
   – Origins and History
   – Hello Hadoop
• The Hadoop Bestiary
• The Hadoop Providers
• Hosted Hadoop Frameworks




© J Singh, 2011                          2
                                  2
What Hadoop is, and what it’s not
• A Framework for Map Reduce

• A Top-level Apache Project

• Hadoop is                               • Hadoop is not
    A Framework, not a “solution”             A painless replacement for SQL
        • Think Linux or J2EE


    Scalable                                  Uniformly fast or efficient

    Great for pipelining massive              Great for ad hoc Analysis
     amounts of data to achieve the
     end result

    Sometimes the only option


© J Singh, 2011                                                                 3
                                      3
You are ready for Hadoop when…
• You no longer get enthused by the prospect of more data
   – Rate of data accumulation is increasing
   – The idea of moving data from hither to yon is positively scary
   – A hit man threatens to delete your data in the middle of the night
        • And you want to pay him to do it


• Seriously, you are ready for Hadoop when analysis is the bottleneck
   –   Could   be   because   of data size
   –   Could   be   because   of the complexity of the data
   –   Could   be   because   of the level of analysis required
   –   Could   be   because   the analysis requirements are fluid




© J Singh, 2011                                                           4
                                             4
MapReduce Conceptual Underpinnings
• Based on Functional Programming model
   – From Lisp
        • (map square '(1 2 3 4))   (1 4 9 16)
        • (reduce plus '(1 4 9 16))   30
   – From APL
        • +/ N    N  1 2 3 4


• Easy to distribute (based on each element of the vector)

• New for Map/Reduce: Nice failure/retry semantics
   – Hundreds and thousands of low-end servers are running at the
     same time



© J Singh, 2011                                                     5
                                  5
MapReduce Flow

                   Word Count Example




                     MapOut
                     foo 1
Lines                                   Result
                     bar 1
foo bar                                 foo 3
                     quux 1
quux foo                                labs 1
                     foo 1
foo labs                                quux 2
                     foo 1
quux                                    bar 1
                     labs 1
                     quux 1



 © J Singh, 2011                                 6
                              6
Hello Hadoop
• Word Count
   – Example with Unstructured Data
   – Load 5 books from Gutenberg.org
     into /tmp/gutenberg
   – Load them into HDFS
   – Run Hadoop
        • Results are put into HDFS
   – Copy results into file system

   – What could be simpler?

   – DIY instructions for Amazon EC2
     available on DataThinks.org blog




© J Singh, 2011                             7
                                        7
The Hadoop Ecosystem
• Introduction
• The Hadoop Bestiary
   –   Core: Hadoop Map Reduce and Hadoop Distributed File System
   –   Data Access: HBase, Pig, Hive
   –   Algorithms: Mahout
   –   Data Import: Flume, Sqoop and Nutch
• The Hadoop Providers
• Hosted Hadoop Frameworks




© J Singh, 2011                                                     8
                                  8
The Core: Hadoop and HDFS
• Hadoop                                     • Hadoop Distributed File System
   – One master, n slaves                       – Robust Data Storage across
   – Master                                       machines, insulating against
        • Schedules mappers & reducers            failure
        • Connects pipeline stages              – Keeps n copies of each file
        • Handles failure semantics                 • Configurable number of copies
                                                    • Distributes copies across racks
                                                      and locations




© J Singh, 2011                                                                         9
                                         9
Hadoop Bestiary (p1a): Hbase, Pig
• Database Primitives                   • Processing
   – Hbase                                  – Pig
        • Wide column data structure            • A high(-ish) level data-flow
          built on HDFS                           language and execution
                                                  framework for parallel
                                                  computation
                                                • Accesses HDFS and Hbase
                                                • Batch as well as Interactive
                                                • Integrates UDFs written in
                                                  Java, Python, JavaScript
                                                • Compiles to map & reduce
                                                  functions – not 100% efficiently




© J Singh, 2011                                                                  10
                                       10
In Pig (Latin)

   Users    = load ‘users’ as (name, age);
   Filtered = filter Users by
                     age >= 18 and age <= 25;
   Pages    = load ‘pages’ as (user, url);
   Joined   = join Filtered by name, Pages by user;
   Grouped = group Joined by url;
   Summed   = foreach Grouped generate group,
                      count(Joined) as clicks;
   Sorted   = order Summed by clicks desc;
   Top5     = limit Sorted 5;

   store Top5 into ‘top5sites’;


© J Singh, 2011                                                                                                               11
                                                     11
                  Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
Pig Translation into Map Reduce


 Load Users                       Load Pages
                                                                  Users = load …
 Filter by age
                                                                  Fltrd = filter …
                                                                  Pages = load …
  Job 1           Join on name                                    Joined = join …
                  Group on url
                                                                  Grouped = group …
                                                                  Summed = … count()…
          Job 2 Count clicks                                      Sorted = order …
                                                                  Top5 = limit …
              Order by clicks

          Job 3 Take top 5


© J Singh, 2011        Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt   12
                                                        12
Hadoop Bestiary (p1b): Hbase, Hive
• Database Primitives                   • Processing
   – Hbase                                  – Hive
        • Wide column data structure           • Data Warehouse Infrastructure
          built on HDFS                        • QL, a subset of SQL that
                                                 supports primitives supportable
                                                 by Map Reduce
                                               • Support for custom mappers
                                                 and reducers for more
                                                 sophisticated analysis
                                               • Compiles to map & reduce
                                                 functions – not 100% efficiently

            Hive Example
        CREATE TABLE page_view(viewTime INT, userid BIGINT,
                         page_url STRING, referrer_url STRING,
                         ip STRING COMMENT 'IP Address of the User')
        :: ::
        STORED AS SEQUENCEFILE;

© J Singh, 2011                                                                 13
                                       13
Hadoop Bestiary (p2): Mahout
• Algorithms                               • Examples
   – Mahout                                    – Clustering Algorithms
        • Scalable machine learning and            • Canopy Clustering
          data mining                              • K-Means Clustering
        • Runs on top of Hadoop                    • …
        • Written in Java
        • In active development                – Recommenders / Collaborative
            – Algorithms being added
                                                 Filtering Algorithms

                                               – Other
                                                   • Regression Algorithms
                                                   • Neural Networks
                                                   • Hidden Markov Models




© J Singh, 2011                                                                 14
                                          14
Hadoop Bestiary (p3): Data Import
• Data Import Mechanisms      • Data Import
   – Sqoop: Structured Data        – Sqoop
   – Flume: Streams                   • Import from RDBMS to HDFS
                                      • Export too
                                   – Flume
                                      • Import streams
                                         – Text Files
                                         – System Logs
                                   – Nutch
                                      • Import from Web
                                      • Note: Nutch + Hadoop = Lucene




© J Singh, 2011                                                         15
                              15
Hadoop Bestiary (p4): Complete Picture




© J Singh, 2011                          16
                        16
The Hadoop Ecosystem
• Introduction
• The Hadoop Bestiary
• The Hadoop Providers
   – Apache
   – Cloudera
   – Options when your data lives in a Database
• Hosted Hadoop Frameworks




© J Singh, 2011                                   17
                                  17
Apache Distribution
• The Definitive Repository
   – The hub for Code, Documentation, Tutorials

   – Many contributors, for example
        • Pig was a Yahoo! Contribution
        • Hive came from Facebook
        • Sqoop came from Cloudera


• Bare metal install option:
   – Download to your machine(s) from Apache
   – Install and Operate
        • Modify to fit your business better




© J Singh, 2011                                     18
                                               18
Cloudera
• Cloudera : Hadoop :: Red Hat : Linux

• Cloudera’s Distribution Including Apache Hadoop (CDH)
   – A packaged set of Hadoop modules that work together
   – Now at CDH3
   – Largest contributor of code to Apache Hadoop


• $76M in Venture funding so far




© J Singh, 2011                                            19
                                    19
When the data lives in a Database…

• Objective: keeping Analytics and Data as close as possible


• Options for RDBMS :                • Options for NoSQL Databases
   – Sqoop data to/from HDFS             – Sqoop-like connectors
        • Need to move the data              • Need to move the data
                                             • Can utilize all parts of Hadoop
   – In-database analytics
        • Available for TeraData,        – Built-in Map Reduce available
          Greenplum, etc.                  for most NoSQL databases
        • If you have the need               • Knows about and tuned to the
            – And the $$$                      storage mechanism
                                             • But typically only offers map
                                               and reduce
                                                 – No Pig, Hive, …



© J Singh, 2011                                                                  20
                                    20
The Hadoop Ecosystem
• Introduction
• The Hadoop Bestiary
• The Hadoop Providers
• Hadoop Platforms as a Service
   –   Amazon Elastic MapReduce
   –   Hadoop in Windows Azure
   –   Google App Engine
   –   Other
        • Infochimps
        • IBM SmartCloud




© J Singh, 2011                        21
                                  21
Amazon Elastic Map Reduce (EMR)
• Hosted Map Reduce
   – CLI on your laptop
        • Control over size of cluster
        • Automatic spin-up/down instances


   – Map & Reduce programs on S3
        • Pig, Hive or
        • Custom in Java, Ruby, Python,
          Perl, PHP, R, C++, Cascading


   – Data In/Out on S3 or
   – Data In/Out on DynamoDB


• Keep in mind:
   – Hadoop on EC2 is also an option

© J Singh, 2011                                22
                                          22
Hadoop in Windows Azure
• Basic Level
   – Hive Add-in for Excel
   – Hive ODBC Driver


• Hadoop-based Distribution for Windows Server and Azure
   – Strategic Partnership with HortonWorks
   – Windows-based CLI on your laptop


• Broadest Level
   – JavaScript framework for Hadoop
   – Hadoop connectors for SQL Server and Parallel Data Warehouse




© J Singh, 2011                                                     23
                                 23
Google App Engine MapReduce
• Map Reduce as a Service
   – Distinct from Google’s internal Map Reduce
   – Part of Google App Engine


• Works with Google Datastore
   – A Wide Column Store


• A “purely programmatic” environment
   – Write Map and Reduce functions in Python / Java




© J Singh, 2011                                        24
                                  24
Map Reduce Use at Google




© J Singh, 2011            25
                      25
Take Aways
• There are many flavors of
  Hadoop.
   – The important part is
     Functional Programming and
     Map Reduce

   – Don’t let the proliferation of
     choices stump you.

   – Experiment with it!




© J Singh, 2011                            26
                                      26
Thank you
• J Singh
   – President, Early Stage IT
        • Technology Services and Strategy for Startups


• DataThinks.org is a new service of Early Stage IT
   – “Big Data” analytics solutions




© J Singh, 2011                                           27
                                      27

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
Introduction to NOSQL databases
Introduction to NOSQL databasesIntroduction to NOSQL databases
Introduction to NOSQL databases
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in Python
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Nosql data models
Nosql data modelsNosql data models
Nosql data models
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDB
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
 
Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 

Destacado (6)

Media Buying Platform Ecosystem
Media Buying Platform EcosystemMedia Buying Platform Ecosystem
Media Buying Platform Ecosystem
 
Creating an Ecosystem Platform with Vertical PaaS
Creating an Ecosystem Platform with Vertical PaaSCreating an Ecosystem Platform with Vertical PaaS
Creating an Ecosystem Platform with Vertical PaaS
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Understanding the Online Advertising Technology Landscape
Understanding the Online Advertising Technology Landscape Understanding the Online Advertising Technology Landscape
Understanding the Online Advertising Technology Landscape
 
Business Ecosystem Design
Business Ecosystem DesignBusiness Ecosystem Design
Business Ecosystem Design
 

Similar a The Hadoop Ecosystem

Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
Jesus Rodriguez
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
elliando dias
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
NetajiGandi1
 

Similar a The Hadoop Ecosystem (20)

Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User Group
 
Hadoop
HadoopHadoop
Hadoop
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worlds
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Presentation
PresentationPresentation
Presentation
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
SpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache HadoopSpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache Hadoop
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010
 
Hadoop
HadoopHadoop
Hadoop
 

Más de J Singh

PaaS - google app engine
PaaS  - google app enginePaaS  - google app engine
PaaS - google app engine
J Singh
 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)
J Singh
 
Data Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and TradeoffsData Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and Tradeoffs
J Singh
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed Commit
J Singh
 
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlCS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency Control
J Singh
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
J Singh
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
J Singh
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
J Singh
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
J Singh
 
CS 542 Database Index Structures
CS 542 Database Index StructuresCS 542 Database Index Structures
CS 542 Database Index Structures
J Singh
 
CS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceCS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and Performance
J Singh
 
CS 542 Overview of query processing
CS 542 Overview of query processingCS 542 Overview of query processing
CS 542 Overview of query processing
J Singh
 

Más de J Singh (20)

OpenLSH - a framework for locality sensitive hashing
OpenLSH  - a framework for locality sensitive hashingOpenLSH  - a framework for locality sensitive hashing
OpenLSH - a framework for locality sensitive hashing
 
Designing analytics for big data
Designing analytics for big dataDesigning analytics for big data
Designing analytics for big data
 
Open LSH - september 2014 update
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 update
 
PaaS - google app engine
PaaS  - google app enginePaaS  - google app engine
PaaS - google app engine
 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)
 
Data Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and TradeoffsData Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and Tradeoffs
 
Facebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/ReduceFacebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/Reduce
 
Big Data Laboratory
Big Data LaboratoryBig Data Laboratory
Big Data Laboratory
 
Social Media Mining using GAE Map Reduce
Social Media Mining using GAE Map ReduceSocial Media Mining using GAE Map Reduce
Social Media Mining using GAE Map Reduce
 
High Throughput Data Analysis
High Throughput Data AnalysisHigh Throughput Data Analysis
High Throughput Data Analysis
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduce
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed Commit
 
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlCS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency Control
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
 
CS 542 Database Index Structures
CS 542 Database Index StructuresCS 542 Database Index Structures
CS 542 Database Index Structures
 
CS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceCS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and Performance
 
CS 542 Overview of query processing
CS 542 Overview of query processingCS 542 Overview of query processing
CS 542 Overview of query processing
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Último (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 

The Hadoop Ecosystem

  • 1. The Hadoop Ecosystem J Singh, DataThinks.org March 12, 2012
  • 2. The Hadoop Ecosystem • Introduction – What Hadoop is, and what it’s not – Origins and History – Hello Hadoop • The Hadoop Bestiary • The Hadoop Providers • Hosted Hadoop Frameworks © J Singh, 2011 2 2
  • 3. What Hadoop is, and what it’s not • A Framework for Map Reduce • A Top-level Apache Project • Hadoop is • Hadoop is not  A Framework, not a “solution” A painless replacement for SQL • Think Linux or J2EE  Scalable Uniformly fast or efficient  Great for pipelining massive Great for ad hoc Analysis amounts of data to achieve the end result  Sometimes the only option © J Singh, 2011 3 3
  • 4. You are ready for Hadoop when… • You no longer get enthused by the prospect of more data – Rate of data accumulation is increasing – The idea of moving data from hither to yon is positively scary – A hit man threatens to delete your data in the middle of the night • And you want to pay him to do it • Seriously, you are ready for Hadoop when analysis is the bottleneck – Could be because of data size – Could be because of the complexity of the data – Could be because of the level of analysis required – Could be because the analysis requirements are fluid © J Singh, 2011 4 4
  • 5. MapReduce Conceptual Underpinnings • Based on Functional Programming model – From Lisp • (map square '(1 2 3 4)) (1 4 9 16) • (reduce plus '(1 4 9 16)) 30 – From APL • +/ N N  1 2 3 4 • Easy to distribute (based on each element of the vector) • New for Map/Reduce: Nice failure/retry semantics – Hundreds and thousands of low-end servers are running at the same time © J Singh, 2011 5 5
  • 6. MapReduce Flow Word Count Example MapOut foo 1 Lines Result bar 1 foo bar foo 3 quux 1 quux foo labs 1 foo 1 foo labs quux 2 foo 1 quux bar 1 labs 1 quux 1 © J Singh, 2011 6 6
  • 7. Hello Hadoop • Word Count – Example with Unstructured Data – Load 5 books from Gutenberg.org into /tmp/gutenberg – Load them into HDFS – Run Hadoop • Results are put into HDFS – Copy results into file system – What could be simpler? – DIY instructions for Amazon EC2 available on DataThinks.org blog © J Singh, 2011 7 7
  • 8. The Hadoop Ecosystem • Introduction • The Hadoop Bestiary – Core: Hadoop Map Reduce and Hadoop Distributed File System – Data Access: HBase, Pig, Hive – Algorithms: Mahout – Data Import: Flume, Sqoop and Nutch • The Hadoop Providers • Hosted Hadoop Frameworks © J Singh, 2011 8 8
  • 9. The Core: Hadoop and HDFS • Hadoop • Hadoop Distributed File System – One master, n slaves – Robust Data Storage across – Master machines, insulating against • Schedules mappers & reducers failure • Connects pipeline stages – Keeps n copies of each file • Handles failure semantics • Configurable number of copies • Distributes copies across racks and locations © J Singh, 2011 9 9
  • 10. Hadoop Bestiary (p1a): Hbase, Pig • Database Primitives • Processing – Hbase – Pig • Wide column data structure • A high(-ish) level data-flow built on HDFS language and execution framework for parallel computation • Accesses HDFS and Hbase • Batch as well as Interactive • Integrates UDFs written in Java, Python, JavaScript • Compiles to map & reduce functions – not 100% efficiently © J Singh, 2011 10 10
  • 11. In Pig (Latin) Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, count(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into ‘top5sites’; © J Singh, 2011 11 11 Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
  • 12. Pig Translation into Map Reduce Load Users Load Pages Users = load … Filter by age Fltrd = filter … Pages = load … Job 1 Join on name Joined = join … Group on url Grouped = group … Summed = … count()… Job 2 Count clicks Sorted = order … Top5 = limit … Order by clicks Job 3 Take top 5 © J Singh, 2011 Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt 12 12
  • 13. Hadoop Bestiary (p1b): Hbase, Hive • Database Primitives • Processing – Hbase – Hive • Wide column data structure • Data Warehouse Infrastructure built on HDFS • QL, a subset of SQL that supports primitives supportable by Map Reduce • Support for custom mappers and reducers for more sophisticated analysis • Compiles to map & reduce functions – not 100% efficiently Hive Example CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') :: :: STORED AS SEQUENCEFILE; © J Singh, 2011 13 13
  • 14. Hadoop Bestiary (p2): Mahout • Algorithms • Examples – Mahout – Clustering Algorithms • Scalable machine learning and • Canopy Clustering data mining • K-Means Clustering • Runs on top of Hadoop • … • Written in Java • In active development – Recommenders / Collaborative – Algorithms being added Filtering Algorithms – Other • Regression Algorithms • Neural Networks • Hidden Markov Models © J Singh, 2011 14 14
  • 15. Hadoop Bestiary (p3): Data Import • Data Import Mechanisms • Data Import – Sqoop: Structured Data – Sqoop – Flume: Streams • Import from RDBMS to HDFS • Export too – Flume • Import streams – Text Files – System Logs – Nutch • Import from Web • Note: Nutch + Hadoop = Lucene © J Singh, 2011 15 15
  • 16. Hadoop Bestiary (p4): Complete Picture © J Singh, 2011 16 16
  • 17. The Hadoop Ecosystem • Introduction • The Hadoop Bestiary • The Hadoop Providers – Apache – Cloudera – Options when your data lives in a Database • Hosted Hadoop Frameworks © J Singh, 2011 17 17
  • 18. Apache Distribution • The Definitive Repository – The hub for Code, Documentation, Tutorials – Many contributors, for example • Pig was a Yahoo! Contribution • Hive came from Facebook • Sqoop came from Cloudera • Bare metal install option: – Download to your machine(s) from Apache – Install and Operate • Modify to fit your business better © J Singh, 2011 18 18
  • 19. Cloudera • Cloudera : Hadoop :: Red Hat : Linux • Cloudera’s Distribution Including Apache Hadoop (CDH) – A packaged set of Hadoop modules that work together – Now at CDH3 – Largest contributor of code to Apache Hadoop • $76M in Venture funding so far © J Singh, 2011 19 19
  • 20. When the data lives in a Database… • Objective: keeping Analytics and Data as close as possible • Options for RDBMS : • Options for NoSQL Databases – Sqoop data to/from HDFS – Sqoop-like connectors • Need to move the data • Need to move the data • Can utilize all parts of Hadoop – In-database analytics • Available for TeraData, – Built-in Map Reduce available Greenplum, etc. for most NoSQL databases • If you have the need • Knows about and tuned to the – And the $$$ storage mechanism • But typically only offers map and reduce – No Pig, Hive, … © J Singh, 2011 20 20
  • 21. The Hadoop Ecosystem • Introduction • The Hadoop Bestiary • The Hadoop Providers • Hadoop Platforms as a Service – Amazon Elastic MapReduce – Hadoop in Windows Azure – Google App Engine – Other • Infochimps • IBM SmartCloud © J Singh, 2011 21 21
  • 22. Amazon Elastic Map Reduce (EMR) • Hosted Map Reduce – CLI on your laptop • Control over size of cluster • Automatic spin-up/down instances – Map & Reduce programs on S3 • Pig, Hive or • Custom in Java, Ruby, Python, Perl, PHP, R, C++, Cascading – Data In/Out on S3 or – Data In/Out on DynamoDB • Keep in mind: – Hadoop on EC2 is also an option © J Singh, 2011 22 22
  • 23. Hadoop in Windows Azure • Basic Level – Hive Add-in for Excel – Hive ODBC Driver • Hadoop-based Distribution for Windows Server and Azure – Strategic Partnership with HortonWorks – Windows-based CLI on your laptop • Broadest Level – JavaScript framework for Hadoop – Hadoop connectors for SQL Server and Parallel Data Warehouse © J Singh, 2011 23 23
  • 24. Google App Engine MapReduce • Map Reduce as a Service – Distinct from Google’s internal Map Reduce – Part of Google App Engine • Works with Google Datastore – A Wide Column Store • A “purely programmatic” environment – Write Map and Reduce functions in Python / Java © J Singh, 2011 24 24
  • 25. Map Reduce Use at Google © J Singh, 2011 25 25
  • 26. Take Aways • There are many flavors of Hadoop. – The important part is Functional Programming and Map Reduce – Don’t let the proliferation of choices stump you. – Experiment with it! © J Singh, 2011 26 26
  • 27. Thank you • J Singh – President, Early Stage IT • Technology Services and Strategy for Startups • DataThinks.org is a new service of Early Stage IT – “Big Data” analytics solutions © J Singh, 2011 27 27

Notas del editor

  1. Sources: Top 5 Reasons Not to Use Hadoop for AnalyticsThe Dark Side of HadoopHadoopDon’t’s: What not to do to harvest Hadoop’s full potential
  2. Get started with Hadoop
  3. http://pig.apache.org/docs/r0.9.2/index.htmlApache HadoopCascading
  4. http://pig.apache.org/docs/r0.9.2/index.html
  5. Flume Users GuideThrift PaperThrift Paper
  6. Missing components:Cascading