SlideShare una empresa de Scribd logo
1 de 61
What is Big Data ?
●   How is big “Big Data” ?
    ●   Is 30 40 Terabyte big data ?
    ●   ….
●   Big data are datasets that grow so large that they
    become awkward to work with using on-hand
    database management tools
●   Today Terabyte, Petabyte, Exabyte
●   Tomorrow ?
Enterprises & Big Data
●   Most companies are currently using traditional tools to
    store data
●   Big data: The next frontier for innovation, competition,
    and productivity
●   The use of big data will become a key basis of competition
●   Organisations across the globe need to take the rising
    importance of big data more seriously
Hadoop is an ecosystem, not a single product.




When you deal with BigData, the data center is your computer.
•   A Brief History of Hadoop
•   Contributers and Development
•   What is Hadoop
•   Wyh Hadoop
•   Hadoop Ecosystem
A Brief History of Hadoop
•   Hadoop has its origins in Apache Nutch

•   Nutch was started in 2002

•   Challenge : The billions of pages on the Web ?

•   2003 GFS (Google File System)

•   2004 NDFS (Nutch File System)

•   2004 Google published the paper of MapReduce

•   2005 Nutch Developers getting started with development of
    MapReduce
•   A Brief History of Hadoop
•   Contributers and Development
•   What is Hadoop
•   Wyh Hadoop
•   Hadoop Ecosystem
Contributers and Development




Lifetime patches contributed for all Hadoop-related projects: community members by
current employer
* source : JIRA tickets
Contributers and Development
Contributers and Development




* Resource: Kerberos Konference (Yahoo) – 2010
Development in ASF/Hadoop
●   Resources
    ●   Mailing List
    ●   Wiki Pages , blogs
    ●   Issue Tracking – JIRA
    ●   Version Control SVN – Git
•   A Brief History of Hadoop
•   Contributers and Development
•   What is Hadoop
•   Wyh Hadoop
•   Hadoop Ecosystem
What is Hadoop
•   Open-source project administered by the ASF

•   Data Intensive Storage

•   and Massivly Paralel Processing(MPP)

•   Enables applications to work with thousands of nodes and
    petabytes of data

•   Suitable for application with large data sets
What is Hadoop ?

•   Scalable

•   Fault Tolerance

•   Reliable data storage using the Hadoop Distributed
    File System (HDFS)

•   High-performance parallel data processing using a
    technique called MapReduce
What is Hadoop ?

•   Hadoop Becoming defacto standard for large scale
    dataprocessing

•   Becoming more than just MapReduce

•   Ecosystem growing rapidly lot’s of great tools around it
What is Hadoop ?



 Yahoo Hadoop Cluster
38,000 machines
distributed across 20
different clusters.
Recource : Yahoo 2010

50,000 m : January 2012
Resource
http://www.computerworlduk.com/in-
depth/applications/3329092/hadoop-   SGI Hadoop Cluster
could-save-you-money-over-a-
traditional-rdbms/
•   A Brief History of Hadoop
•   Contributers and Development
•   What is Hadoop
•   Wyh Hadoop
•   Hadoop Ecosystem
Why Hadoop?
Why Hadoop?
Why Hadoop?
Why Hadoop?
•       Hadoop has its origins in Apache Nutch
•       Can Process Big Data (Petabytes and more..)
•       Unlimited Data Storage & Analyse
•       No licence cost - Apache License 2.0
•       Can be build out of the commodity hardware
•       IT Cost Reduction
    •        Results
         •      Be One Step Ahead of Competition
         •      Stay there
Is hadoop alternative for RDBMs ?
 •   At the moment Apache Hadoop is not a substitute for a database
 •   No Relation
 •   Key Value pairs
 •   Big Data
 •   unstructured (Text)
 •   semi structured (Seq / Binary Files)
 •   Structured (Hbase=Google BigTable)
 •   Works fine together with RDBMs
•   A Brief History of Hadoop
•   Contributers and Development
•   What is Hadoop
•   Wyh Hadoop
•   Hadoop Ecosystem
Hadoop Ecosystem
   ETL Tools           BI Reporting     RDBMS


Pig (Data   Flow)      Hive (SQL)        Sqoop


 MapReduce (Job     Scheduling/Execution System)

HBase (Key-Value store)



                        HDFS
        (Hadoop Distributed File System)
Hadoop Ecosystem
           Important components of Hadoop


•   HDFS: A distributed, fault tolerance file system
•   MapReduce: A paralel data processing framework
•   Hive : A query framework (like SQL)
•   PIG : A query scripting tool


•   HBase : realtime read/write access to your Big Data
Hadoop Ecosystem
Hadoop is a Distributed Data Computing Platform
HDFS
HDFS




NameNode /DataNode interaction in HDFS. The NameNode keeps track of the file
metadata—which files are in the system and how each file is broken down into blocks. The
DataNodes provide backup store of the blocks and constantly report to the NameNode to keep the
metadata current.»
Hadoop Cluster
Writing Files To HDFS


               •   Client consults NameNode
               •   Client writes block directly to
                   one DataNode
               •   DataNote replicates block
               •   Cycle repeats for next block
Reading Files From HDFS




•   Client consults NameNode
•   Client receives Data Node list for each block
•   Client picks first Data Node for each block
•   Client reads blocks sequentially
Rackawareness & Fault Tolerance

                                                        NameNode

                                                  Rack Aware       Metadata
                                                  Rack 1:          File.txt
                                                  DN1              Blk A:
                                                  DN2              DN1,DN5,DN6
                                                  DN3
                                                  DN5              Blk B:
                                                                   DN1,DN2,DN9
                                                  Rack 5:
                                                  DN5              BLKC:
                                                  DN6              DN5,DN9,DN10
                                                  DN7
                                                  DN8

                                                  Rack N
•   Never loose all data if entire rack fails
•   In Rack is higher bandwidth , lower latency
Cluster Healt
Hadoop Ecosystem
           Important components of Hadoop


•   HDFS: A distributed, fault tolerance file system
•   MapReduce: A paralel data processing framework
•   Hive : A query framework (like SQL)
•   PIG : A query scripting tool
•   HBase : A Column oriented Database for OLTP
MapReduce-Paradigm
•   Simplified Data Processing on Large Clusters
•   Splitting a Big Problem/Data into Little PiecesHive
•   Key-Value
MapReduce-Batch Processing
•       Phases
    •     Map
    •     Sort/Shuffle
    •     Reduce (Aggregation)
•       Coordination
    •     Job Tracker
    •     Task Tracker
MapReduce-Map
                           K   V
                               1
                               1
Datanode 1           MAP
                               1
                               1


                               1
Datanode 2           MAP
                               1
                               1
                               1


                               1
Datanode 3                     1
                     MAP
                               1
                               1
MapReduce-Sort/Shuffle
                          1
                          1




                   SORT
Datanode 1                1
                          1


                          1
Datanode 2                1



                   SORT
                          1
                          1
                          1


Datanode 3                1
                   SORT




                          1
                          1
MapReduce-Reduce
                      1
                                   K   V
                      1


               SORT
                          REDUCE       4
Datanode 1            1
                      1


                      1
                                   K   V
                      1
Datanode 2                             2
               SORT




                      1   REDUCE
                                       3
                      1
                      1


                      1            K   V
Datanode 3
               SORT




                          REDUCE       3
                      1
                      1
MapReduce-All Phases
         1
                    1
         1




             SORT
   MAP              1
         1              REDUCE   4
                    1
         1
                    1

         1          1
         1          1




             SORT
   MAP
                        REDUCE
                                 2
         1          1
                                 3
         1          1
                    1

         1
         1
             SORT   1
   MAP                  REDUCE
                                 3
         1          1
         1          1
MapReduce-Job & Task Tracker

                                                                                Namenode




                                                                                 Datanodes



JobTracker and TaskTracker interaction. After a client calls the JobTracker to begin a data
processing job, the JobTracker partitions the work and assigns different map and reduce tasks
to each TaskTracker in the cluster
Summary of HDFS and MR
Hadoop Ecosystem
           Important components of Hadoop


•   HDFS: A distributed, fault tolerance file system
•   MapReduce: A paralel data processing framework
•   Hive : A query framework (like SQL)
•   PIG : A query scripting tool
•   HBase : A Column oriented Database for OLTP
Hive
Hive
•   Data warehousing package built on top of Hadoop
•   It began its life at Facebook processing large amount of user
    and log data
•   Hadoop subproject with many contributors
•   Ad hoc queries , summarization , and data analysis on Hadoop-
    scale data
•   Directly query data from different formats (text/binary) and file
    formats (Flat/Sequence)
•   HiveQL - like SQL
Hive Components
Mgmt. Web UI



                                                                           Map Reduce   HDFS

                             Hive CLI
                Browsing        Queries          DDL


                Thrift API                       Parser
                                                                           Execution
                                                Planner
                                                          Hive QL



               MetaStore
                                    *Thrift : Interface Definition Lang.
Hadoop Ecosystem
           Important components of Hadoop


•   HDFS: A distributed, fault tolerance file system
•   MapReduce: A paralel data processing framework
•   Hive : A query framework (like SQL)
•   PIG : A query scripting tool
•   HBase : A Column oriented Database for OLTP
Pig
•       The language used to express data flows, called Pig Latin
•       Pig Latin can be extended using UDF (User Defined Functions)
•       was originally developed at Yahoo Research
•       PigPen is an Eclipse plug-in that provides an environment for
        developing Pig programs
•       Running Pig Programs
    •       Script ; script file that contains Pig commands
    •       Grunt ; interactive shell
    •       Embedded ; java
Pig
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
      AS (year:chararray, temperature:int, quality:int);

grunt> DUMP records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)

grunt> DESCRIBE records;
records: {year: chararray,temperature: int,quality: int}

grunt> filtered_records = FILTER records BY temperature != 22 );
grunt> DUMP filtered_records;

grunt> grouped_records = GROUP records BY year;
grunt> DUMP grouped_records;
(1949,{(1949,111,1),(1949,78,1)})
(1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})
Hadoop Ecosystem
           Important components of Hadoop


•   HDFS: A distributed, fault tolerance file system
•   MapReduce: A paralel data processing framework
•   Hive : A query framework (like SQL)
•   PIG : A query scripting tool
•   HBase : A Column oriented Database for OLTP
HBase
•   Random, realtime read/write access to your Big Data

•   Billions of rows X millions of columns

•   Column-oriented store modeled after Google's BigTable

•   provides Bigtable-like capabilities on top of Hadoop and HDFS

•   HBase is not a column-oriented database in the typical RDBMS

    sense, but utilizes an on-disk column storage format
HBase-Datamodel
    •        (Table, RowKey, Family,Column, Timestamp) → Value




•       Think of tags. Values any length, no predefined names or widths

•       Column names carry info (just like tags)
HBase-Datamodel
•   (Table, RowKey, Family,Column, Timestamp) → Value
HBase-Datamodel
•   (Table, RowKey, Family,Column, Timestamp) → Value
Create Sample Table
hbase(main):003:0> create 'test', 'cf'
hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value11'
hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value12'
hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2'
hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3'
hbase(main):007:0> scan 'test'
ROW       COLUMN+CELL
row1     column=cf:a, timestamp=1288380727188, value=value12
row2     column=cf:b, timestamp=1288380738440, value=value2
row3     column=cf:c, timestamp=1288380747365, value=value3
hbase(main):007:0> scan 'test', { VERSIONS => 3 }
ROW       COLUMN+CELL
row1     column=cf:a, timestamp=1288380727188, value=value12
row1     column=cf:a, timestamp=1288380727188, value=value11
row2     column=cf:b, timestamp=1288380738440, value=value2
row3     column=cf:c, timestamp=1288380747365, value=value3
Hbase-Architecture
•   Splits

•   Auto-Sharding

•   Master

•   Region Servers

•   HFile
Splits & RegionServers




•   Rows grouped in regions and served by different servers
•   Table dynamically split into “regions”
•   Each region contains values [startKey, endKey)
•   Regions hosted on a regionserver
Hbase-Architecture
Other Components
•   Flume

•   Sqoop
Commertial Products
•   Oracle Big Data Appliance

•   Microsoft Azure + Excel + MapReduce

•   Cloud Computing , Amazon elastic computing

•   IBM Hadoop-based InfoSphere BigInsights

•   VMWare Spring for Apache Hadoop

•   Toad for Cloud Database

•   Mapr , Cloudera , HortonWorks, Datameer
Thank You



Faruk Berksöz
fberksoz@gmail.com

Más contenido relacionado

La actualidad más candente

Dancing with the elephant h base1_final
Dancing with the elephant   h base1_finalDancing with the elephant   h base1_final
Dancing with the elephant h base1_finalasterix_smartplatf
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreModern Data Stack France
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Emilio Coppa
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive TuningAdam Muise
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseNick Dimiduk
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for ArchitectsNick Dimiduk
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache TezGetInData
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 ReleaseNick Dimiduk
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSpraveen bhat
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizationsSzehon Ho
 

La actualidad más candente (20)

Dancing with the elephant h base1_final
Dancing with the elephant   h base1_finalDancing with the elephant   h base1_final
Dancing with the elephant h base1_final
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
6.hive
6.hive6.hive
6.hive
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 Release
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizations
 

Similar a Hadoop hbase mapreduce

Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introductionFrans van Noort
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRclive boulton
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataDhanashri Yadav
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 

Similar a Hadoop hbase mapreduce (20)

Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Sector CloudSlam 09
Sector CloudSlam 09Sector CloudSlam 09
Sector CloudSlam 09
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Hadoop
HadoopHadoop
Hadoop
 

Último

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 

Último (20)

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 

Hadoop hbase mapreduce

  • 1.
  • 2. What is Big Data ? ● How is big “Big Data” ? ● Is 30 40 Terabyte big data ? ● …. ● Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools ● Today Terabyte, Petabyte, Exabyte ● Tomorrow ?
  • 3. Enterprises & Big Data ● Most companies are currently using traditional tools to store data ● Big data: The next frontier for innovation, competition, and productivity ● The use of big data will become a key basis of competition ● Organisations across the globe need to take the rising importance of big data more seriously
  • 4. Hadoop is an ecosystem, not a single product. When you deal with BigData, the data center is your computer.
  • 5. A Brief History of Hadoop • Contributers and Development • What is Hadoop • Wyh Hadoop • Hadoop Ecosystem
  • 6. A Brief History of Hadoop • Hadoop has its origins in Apache Nutch • Nutch was started in 2002 • Challenge : The billions of pages on the Web ? • 2003 GFS (Google File System) • 2004 NDFS (Nutch File System) • 2004 Google published the paper of MapReduce • 2005 Nutch Developers getting started with development of MapReduce
  • 7. A Brief History of Hadoop • Contributers and Development • What is Hadoop • Wyh Hadoop • Hadoop Ecosystem
  • 8. Contributers and Development Lifetime patches contributed for all Hadoop-related projects: community members by current employer * source : JIRA tickets
  • 10. Contributers and Development * Resource: Kerberos Konference (Yahoo) – 2010
  • 11. Development in ASF/Hadoop ● Resources ● Mailing List ● Wiki Pages , blogs ● Issue Tracking – JIRA ● Version Control SVN – Git
  • 12. A Brief History of Hadoop • Contributers and Development • What is Hadoop • Wyh Hadoop • Hadoop Ecosystem
  • 13. What is Hadoop • Open-source project administered by the ASF • Data Intensive Storage • and Massivly Paralel Processing(MPP) • Enables applications to work with thousands of nodes and petabytes of data • Suitable for application with large data sets
  • 14. What is Hadoop ? • Scalable • Fault Tolerance • Reliable data storage using the Hadoop Distributed File System (HDFS) • High-performance parallel data processing using a technique called MapReduce
  • 15. What is Hadoop ? • Hadoop Becoming defacto standard for large scale dataprocessing • Becoming more than just MapReduce • Ecosystem growing rapidly lot’s of great tools around it
  • 16. What is Hadoop ? Yahoo Hadoop Cluster 38,000 machines distributed across 20 different clusters. Recource : Yahoo 2010 50,000 m : January 2012 Resource http://www.computerworlduk.com/in- depth/applications/3329092/hadoop- SGI Hadoop Cluster could-save-you-money-over-a- traditional-rdbms/
  • 17. A Brief History of Hadoop • Contributers and Development • What is Hadoop • Wyh Hadoop • Hadoop Ecosystem
  • 21. Why Hadoop? • Hadoop has its origins in Apache Nutch • Can Process Big Data (Petabytes and more..) • Unlimited Data Storage & Analyse • No licence cost - Apache License 2.0 • Can be build out of the commodity hardware • IT Cost Reduction • Results • Be One Step Ahead of Competition • Stay there
  • 22. Is hadoop alternative for RDBMs ? • At the moment Apache Hadoop is not a substitute for a database • No Relation • Key Value pairs • Big Data • unstructured (Text) • semi structured (Seq / Binary Files) • Structured (Hbase=Google BigTable) • Works fine together with RDBMs
  • 23. A Brief History of Hadoop • Contributers and Development • What is Hadoop • Wyh Hadoop • Hadoop Ecosystem
  • 24. Hadoop Ecosystem ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) Sqoop MapReduce (Job Scheduling/Execution System) HBase (Key-Value store) HDFS (Hadoop Distributed File System)
  • 25. Hadoop Ecosystem Important components of Hadoop • HDFS: A distributed, fault tolerance file system • MapReduce: A paralel data processing framework • Hive : A query framework (like SQL) • PIG : A query scripting tool • HBase : realtime read/write access to your Big Data
  • 26. Hadoop Ecosystem Hadoop is a Distributed Data Computing Platform
  • 27. HDFS
  • 28. HDFS NameNode /DataNode interaction in HDFS. The NameNode keeps track of the file metadata—which files are in the system and how each file is broken down into blocks. The DataNodes provide backup store of the blocks and constantly report to the NameNode to keep the metadata current.»
  • 30. Writing Files To HDFS • Client consults NameNode • Client writes block directly to one DataNode • DataNote replicates block • Cycle repeats for next block
  • 31. Reading Files From HDFS • Client consults NameNode • Client receives Data Node list for each block • Client picks first Data Node for each block • Client reads blocks sequentially
  • 32. Rackawareness & Fault Tolerance NameNode Rack Aware Metadata Rack 1: File.txt DN1 Blk A: DN2 DN1,DN5,DN6 DN3 DN5 Blk B: DN1,DN2,DN9 Rack 5: DN5 BLKC: DN6 DN5,DN9,DN10 DN7 DN8 Rack N • Never loose all data if entire rack fails • In Rack is higher bandwidth , lower latency
  • 34. Hadoop Ecosystem Important components of Hadoop • HDFS: A distributed, fault tolerance file system • MapReduce: A paralel data processing framework • Hive : A query framework (like SQL) • PIG : A query scripting tool • HBase : A Column oriented Database for OLTP
  • 35. MapReduce-Paradigm • Simplified Data Processing on Large Clusters • Splitting a Big Problem/Data into Little PiecesHive • Key-Value
  • 36. MapReduce-Batch Processing • Phases • Map • Sort/Shuffle • Reduce (Aggregation) • Coordination • Job Tracker • Task Tracker
  • 37. MapReduce-Map K V 1 1 Datanode 1 MAP 1 1 1 Datanode 2 MAP 1 1 1 1 Datanode 3 1 MAP 1 1
  • 38. MapReduce-Sort/Shuffle 1 1 SORT Datanode 1 1 1 1 Datanode 2 1 SORT 1 1 1 Datanode 3 1 SORT 1 1
  • 39. MapReduce-Reduce 1 K V 1 SORT REDUCE 4 Datanode 1 1 1 1 K V 1 Datanode 2 2 SORT 1 REDUCE 3 1 1 1 K V Datanode 3 SORT REDUCE 3 1 1
  • 40. MapReduce-All Phases 1 1 1 SORT MAP 1 1 REDUCE 4 1 1 1 1 1 1 1 SORT MAP REDUCE 2 1 1 3 1 1 1 1 1 SORT 1 MAP REDUCE 3 1 1 1 1
  • 41. MapReduce-Job & Task Tracker Namenode Datanodes JobTracker and TaskTracker interaction. After a client calls the JobTracker to begin a data processing job, the JobTracker partitions the work and assigns different map and reduce tasks to each TaskTracker in the cluster
  • 42. Summary of HDFS and MR
  • 43. Hadoop Ecosystem Important components of Hadoop • HDFS: A distributed, fault tolerance file system • MapReduce: A paralel data processing framework • Hive : A query framework (like SQL) • PIG : A query scripting tool • HBase : A Column oriented Database for OLTP
  • 44. Hive
  • 45. Hive • Data warehousing package built on top of Hadoop • It began its life at Facebook processing large amount of user and log data • Hadoop subproject with many contributors • Ad hoc queries , summarization , and data analysis on Hadoop- scale data • Directly query data from different formats (text/binary) and file formats (Flat/Sequence) • HiveQL - like SQL
  • 46. Hive Components Mgmt. Web UI Map Reduce HDFS Hive CLI Browsing Queries DDL Thrift API Parser Execution Planner Hive QL MetaStore *Thrift : Interface Definition Lang.
  • 47. Hadoop Ecosystem Important components of Hadoop • HDFS: A distributed, fault tolerance file system • MapReduce: A paralel data processing framework • Hive : A query framework (like SQL) • PIG : A query scripting tool • HBase : A Column oriented Database for OLTP
  • 48. Pig • The language used to express data flows, called Pig Latin • Pig Latin can be extended using UDF (User Defined Functions) • was originally developed at Yahoo Research • PigPen is an Eclipse plug-in that provides an environment for developing Pig programs • Running Pig Programs • Script ; script file that contains Pig commands • Grunt ; interactive shell • Embedded ; java
  • 49. Pig grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year:chararray, temperature:int, quality:int); grunt> DUMP records; (1950,0,1) (1950,22,1) (1950,-11,1) (1949,111,1) (1949,78,1) grunt> DESCRIBE records; records: {year: chararray,temperature: int,quality: int} grunt> filtered_records = FILTER records BY temperature != 22 ); grunt> DUMP filtered_records; grunt> grouped_records = GROUP records BY year; grunt> DUMP grouped_records; (1949,{(1949,111,1),(1949,78,1)}) (1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})
  • 50. Hadoop Ecosystem Important components of Hadoop • HDFS: A distributed, fault tolerance file system • MapReduce: A paralel data processing framework • Hive : A query framework (like SQL) • PIG : A query scripting tool • HBase : A Column oriented Database for OLTP
  • 51. HBase • Random, realtime read/write access to your Big Data • Billions of rows X millions of columns • Column-oriented store modeled after Google's BigTable • provides Bigtable-like capabilities on top of Hadoop and HDFS • HBase is not a column-oriented database in the typical RDBMS sense, but utilizes an on-disk column storage format
  • 52. HBase-Datamodel • (Table, RowKey, Family,Column, Timestamp) → Value • Think of tags. Values any length, no predefined names or widths • Column names carry info (just like tags)
  • 53. HBase-Datamodel • (Table, RowKey, Family,Column, Timestamp) → Value
  • 54. HBase-Datamodel • (Table, RowKey, Family,Column, Timestamp) → Value
  • 55. Create Sample Table hbase(main):003:0> create 'test', 'cf' hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value11' hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value12' hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2' hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3' hbase(main):007:0> scan 'test' ROW COLUMN+CELL row1 column=cf:a, timestamp=1288380727188, value=value12 row2 column=cf:b, timestamp=1288380738440, value=value2 row3 column=cf:c, timestamp=1288380747365, value=value3 hbase(main):007:0> scan 'test', { VERSIONS => 3 } ROW COLUMN+CELL row1 column=cf:a, timestamp=1288380727188, value=value12 row1 column=cf:a, timestamp=1288380727188, value=value11 row2 column=cf:b, timestamp=1288380738440, value=value2 row3 column=cf:c, timestamp=1288380747365, value=value3
  • 56. Hbase-Architecture • Splits • Auto-Sharding • Master • Region Servers • HFile
  • 57. Splits & RegionServers • Rows grouped in regions and served by different servers • Table dynamically split into “regions” • Each region contains values [startKey, endKey) • Regions hosted on a regionserver
  • 59. Other Components • Flume • Sqoop
  • 60. Commertial Products • Oracle Big Data Appliance • Microsoft Azure + Excel + MapReduce • Cloud Computing , Amazon elastic computing • IBM Hadoop-based InfoSphere BigInsights • VMWare Spring for Apache Hadoop • Toad for Cloud Database • Mapr , Cloudera , HortonWorks, Datameer