SlideShare una empresa de Scribd logo
1 de 30
Intro to Big Data using Hadoop




                       Sergejus Barinovas
                       sergejus.blogas.lt
                       fb.com/ITishnikai
                       @sergejusb
Information is powerful…
but it is how we use it that will define us
Data Explosion


                                      text
                                   audio
                                    video
                                  images



                              relational



                 picture from Big Data Integration
Big Data (globally)


– creates over 30 billion pieces of content per day

– stores 30 petabytes of data



– produces over 90 million tweets per day
Big Data (our example)



– logs over 300 gigabytes of transactions per day

– stores more than 1,5 terabyte of aggregated data
4 Vs of Big Data


              volume
               volume
              velocity
               velocity
              variety
               variety
               variability
              variability
Big Data Challenges


Sort 10TB on 1 node = 2,5 days

   100-node cluster = 35 mins
Big Data Challenges

“Fat” servers implies high cost
 – use cheap commodity nodes instead


Large # of cheap nodes implies often failures
 – leverage automatic fault-tolerance
                      fault-tolerance
Big Data Challenges



We need new data-parallel programming
model for clusters of commodity machines
MapReduce
to the rescue!
MapReduce


Published in 2004 by Google
 – MapReduce: Simplified Data Processing on Large Clusters




Popularized by Apache Hadoop project
 – used by Yahoo!, Facebook, Twitter, Amazon, …
MapReduce
Word Count Example
 Input       Map   Shuffle & Sort   Reduce   Output


the quick                                    the, 3
  brown      Map                             brown, 2
   fox                              Reduce   fox, 2
                                             how, 1
                                             now, 1
 the fox
 ate the     Map
 mouse
                                             quick, 1
                                             ate, 1
                                    Reduce   mouse, 1
how now
 brown       Map                             cow, 1
  cow
Word Count Example
 Input      Map      Shuffle & Sort       Reduce   Output

                  the, 1       the, 1
the quick         quick, 1     brown, 1
  brown     Map   brown, 1     fox, 1
   fox            fox, 1       the, 1     Reduce
                               fox, 1
                               the, 1
                  the, 1       how, 1
 the fox          fox, 1       now, 1
 ate the    Map   ate, 1       brown, 1
 mouse            the, 1
                  mouse, 1
                               quick, 1
                               ate, 1
                  how, 1       mouse, 1   Reduce
how now           now, 1       cow, 1
 brown      Map   brown, 1
  cow             cow, 1
Word Count Example
 Input      Map   Shuffle & Sort           Reduce   Output

                            the, [1,1,1]
the quick                   brown, [1,1]            the, 3
  brown     Map             fox, [1,1]              brown, 2
   fox                      how, [1]       Reduce   fox, 2
                            now, [1]                how, 1
                                                    now, 1
 the fox
 ate the    Map
 mouse
                            quick, [1]              quick, 1
                            ate, [1]                ate, 1
                            mouse, [1]     Reduce   mouse, 1
how now
                            cow, [1]                cow, 1
 brown      Map
  cow
MapReduce philosophy
 – hide complexity

 – make it scalable

 – make it cheap
MapReduce popularized by

 Apache Hadoop project
Hadoop Overview

Open source implementation of
 – Google MapReduce paper

 – Google File System (GFS) paper


First release in 2008 by Yahoo!
 – wide adoption by Facebook, Twitter, Amazon, etc.
Hadoop Core



MapReduce (Job Scheduling / Execution System)


    Hadoop Distributed File System (HDFS)
Hadoop Core (HDFS)



     MapReduce (Job Scheduling / Execution System)

          Hadoop Distributed File System (HDFS)

• Name Node stores file metadata
• files split into 64 MB blocks
• blocks replicated across 3 Data Nodes
Hadoop Core (HDFS)



MapReduce (Job Scheduling / Execution System)

    Hadoop Distributed File System (HDFS)



  Name Node                    Data Node
Hadoop Core (MapReduce)
• Job Tracker distributes tasks and handles failures
• tasks are assigned based on data locality
• Task Trackers can execute multiple tasks


     MapReduce (Job Scheduling / Execution System)
          Hadoop Distributed File System (HDFS)



       Name Node                      Data Node
Hadoop Core (MapReduce)

  Job Tracker                 Task Tracker




MapReduce (Job Scheduling / Execution System)
    Hadoop Distributed File System (HDFS)



  Name Node                    Data Node
Hadoop Core (Job submission)

                           Task Tracker
Client




            Job Tracker




            Name Node      Data Node
Hadoop Ecosystem

            Pig (ETL)          Hive (BI)       Sqoop (RDBMS)


            MapReduce (Job Scheduling / Execution System)
Zookeeper




                                                               Avro
                  HBase


                 Hadoop Distributed File System (HDFS)
JavaScript MapReduce
var map = function (key, value, context) {
    var words = value.split(/[^a-zA-Z]/);
    for (var i = 0; i < words.length; i++) {
        if (words[i] !== "") {
            context.write(words[i].toLowerCase(), 1);
        }
    }
};
var reduce = function (key, values, context) {
    var sum = 0;
    while (values.hasNext()) {
        sum += parseInt(values.next());
    }
    context.write(key, sum);
};
Pig

words = LOAD '/example/count' AS (
      word: chararray,
      count: int
);
popular_words = ORDER words BY count DESC;
top_popular_words = LIMIT popular_words 10;
DUMP top_popular_words;
Hive
CREATE EXTERNAL TABLE WordCount (
      word string,
      count int
)
ROW FORMAT DELIMITED
      FIELDS TERMINATED BY 't'
      LINES TERMINATED BY 'n'
STORED AS TEXTFILE
LOCATION "/example/count";

SELECT * FROM WordCount ORDER BY count DESC LIMIT 10;
Über Demo
Demo
Hadoop in the Cloud
Thanks!
Questions?

Más contenido relacionado

La actualidad más candente

InfiniCortex and the Renaissance in Polish Supercomputing
InfiniCortex and the Renaissance in Polish Supercomputing InfiniCortex and the Renaissance in Polish Supercomputing
InfiniCortex and the Renaissance in Polish Supercomputing inside-BigData.com
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopbigdatasyd
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceDenis Shestakov
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsChien Chung Shen
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using PigDavid Wellman
 
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Zekeriya Besiroglu
 
Distributed computing the Google way
Distributed computing the Google wayDistributed computing the Google way
Distributed computing the Google wayEduard Hildebrandt
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache HadoopOleksiy Krotov
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of HadoopNam Nham
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandasPurna Chander K
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Cloudera, Inc.
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zingzingopen
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleDataWorks Summit
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 

La actualidad más candente (20)

InfiniCortex and the Renaissance in Polish Supercomputing
InfiniCortex and the Renaissance in Polish Supercomputing InfiniCortex and the Renaissance in Polish Supercomputing
InfiniCortex and the Renaissance in Polish Supercomputing
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
 
Distributed computing the Google way
Distributed computing the Google wayDistributed computing the Google way
Distributed computing the Google way
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of Hadoop
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
 
Hadoop at Rakuten, 2011/07/06
Hadoop at Rakuten, 2011/07/06Hadoop at Rakuten, 2011/07/06
Hadoop at Rakuten, 2011/07/06
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 

Destacado

YARN - Hadoop's Resource Manager
YARN - Hadoop's Resource ManagerYARN - Hadoop's Resource Manager
YARN - Hadoop's Resource ManagerVertiCloud Inc
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in SearchAmund Tveit
 
How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014James Chittenden
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)Romain Jacotin
 

Destacado (6)

YARN - Hadoop's Resource Manager
YARN - Hadoop's Resource ManagerYARN - Hadoop's Resource Manager
YARN - Hadoop's Resource Manager
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)
 

Similar a Intro to Big Data using Hadoop

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
Tackling Big Data with the Elephant in the Room
Tackling Big Data with the Elephant in the RoomTackling Big Data with the Elephant in the Room
Tackling Big Data with the Elephant in the RoomBTI360
 
WOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph MiningWOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph Miningaravindan_raghu
 
データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)Takumi Asai
 
Elephant in the cloud
Elephant in the cloudElephant in the cloud
Elephant in the cloudrhatr
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkThoughtWorks
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueShay Sofer
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Mapreduse model
Mapreduse modelMapreduse model
Mapreduse modelKalyaniwan
 
12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptxShree Shree
 

Similar a Intro to Big Data using Hadoop (20)

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
Tackling Big Data with the Elephant in the Room
Tackling Big Data with the Elephant in the RoomTackling Big Data with the Elephant in the Room
Tackling Big Data with the Elephant in the Room
 
WOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph MiningWOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph Mining
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)
 
Elephant in the cloud
Elephant in the cloudElephant in the cloud
Elephant in the cloud
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Mapreduse model
Mapreduse modelMapreduse model
Mapreduse model
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
 
12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx
 

Más de Sergejus Barinovas

Bringing Developers to the Next Level
Bringing Developers to the Next LevelBringing Developers to the Next Level
Bringing Developers to the Next LevelSergejus Barinovas
 
True story of re architecting website for scale on windows azure
True story of re architecting website for scale on windows azureTrue story of re architecting website for scale on windows azure
True story of re architecting website for scale on windows azureSergejus Barinovas
 
Continuous Happiness by Continuous Delivery
Continuous Happiness by Continuous DeliveryContinuous Happiness by Continuous Delivery
Continuous Happiness by Continuous DeliverySergejus Barinovas
 
Windows Azure from practical point of view
Windows Azure from practical point of viewWindows Azure from practical point of view
Windows Azure from practical point of viewSergejus Barinovas
 
Flashback: QCon San Francisco 2012
Flashback: QCon San Francisco 2012Flashback: QCon San Francisco 2012
Flashback: QCon San Francisco 2012Sergejus Barinovas
 
Optimizing ASP.NET application performance: tough but necessary
Optimizing ASP.NET application performance: tough but necessaryOptimizing ASP.NET application performance: tough but necessary
Optimizing ASP.NET application performance: tough but necessarySergejus Barinovas
 
Kaip Agile skatina gerųjų praktikų panaudojimą
Kaip Agile skatina gerųjų praktikų panaudojimąKaip Agile skatina gerųjų praktikų panaudojimą
Kaip Agile skatina gerųjų praktikų panaudojimąSergejus Barinovas
 
Introduction to Windows Azure Platform
Introduction to Windows Azure PlatformIntroduction to Windows Azure Platform
Introduction to Windows Azure PlatformSergejus Barinovas
 
Moving applications to the cloud
Moving applications to the cloudMoving applications to the cloud
Moving applications to the cloudSergejus Barinovas
 
Cloud Computing and Microsoft Azure Platform
Cloud Computing and Microsoft Azure PlatformCloud Computing and Microsoft Azure Platform
Cloud Computing and Microsoft Azure PlatformSergejus Barinovas
 

Más de Sergejus Barinovas (15)

Bringing Developers to the Next Level
Bringing Developers to the Next LevelBringing Developers to the Next Level
Bringing Developers to the Next Level
 
True story of re architecting website for scale on windows azure
True story of re architecting website for scale on windows azureTrue story of re architecting website for scale on windows azure
True story of re architecting website for scale on windows azure
 
Continuous Happiness by Continuous Delivery
Continuous Happiness by Continuous DeliveryContinuous Happiness by Continuous Delivery
Continuous Happiness by Continuous Delivery
 
Windows Azure from practical point of view
Windows Azure from practical point of viewWindows Azure from practical point of view
Windows Azure from practical point of view
 
Flashback: QCon San Francisco 2012
Flashback: QCon San Francisco 2012Flashback: QCon San Francisco 2012
Flashback: QCon San Francisco 2012
 
Optimizing ASP.NET application performance: tough but necessary
Optimizing ASP.NET application performance: tough but necessaryOptimizing ASP.NET application performance: tough but necessary
Optimizing ASP.NET application performance: tough but necessary
 
Release Often Release Safely
Release Often Release SafelyRelease Often Release Safely
Release Often Release Safely
 
Kaip Agile skatina gerųjų praktikų panaudojimą
Kaip Agile skatina gerųjų praktikų panaudojimąKaip Agile skatina gerųjų praktikų panaudojimą
Kaip Agile skatina gerųjų praktikų panaudojimą
 
Introduction to Windows Azure Platform
Introduction to Windows Azure PlatformIntroduction to Windows Azure Platform
Introduction to Windows Azure Platform
 
Web Scale with NoSQL
Web Scale with NoSQLWeb Scale with NoSQL
Web Scale with NoSQL
 
Moving applications to the cloud
Moving applications to the cloudMoving applications to the cloud
Moving applications to the cloud
 
NoSQL - what's that
NoSQL - what's thatNoSQL - what's that
NoSQL - what's that
 
Demystifying HTML5
Demystifying HTML5Demystifying HTML5
Demystifying HTML5
 
Architecting Windows Azure
Architecting Windows AzureArchitecting Windows Azure
Architecting Windows Azure
 
Cloud Computing and Microsoft Azure Platform
Cloud Computing and Microsoft Azure PlatformCloud Computing and Microsoft Azure Platform
Cloud Computing and Microsoft Azure Platform
 

Último

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 

Último (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Intro to Big Data using Hadoop

  • 1. Intro to Big Data using Hadoop Sergejus Barinovas sergejus.blogas.lt fb.com/ITishnikai @sergejusb
  • 2. Information is powerful… but it is how we use it that will define us
  • 3. Data Explosion text audio video images relational picture from Big Data Integration
  • 4. Big Data (globally) – creates over 30 billion pieces of content per day – stores 30 petabytes of data – produces over 90 million tweets per day
  • 5. Big Data (our example) – logs over 300 gigabytes of transactions per day – stores more than 1,5 terabyte of aggregated data
  • 6. 4 Vs of Big Data volume volume velocity velocity variety variety variability variability
  • 7. Big Data Challenges Sort 10TB on 1 node = 2,5 days 100-node cluster = 35 mins
  • 8. Big Data Challenges “Fat” servers implies high cost – use cheap commodity nodes instead Large # of cheap nodes implies often failures – leverage automatic fault-tolerance fault-tolerance
  • 9. Big Data Challenges We need new data-parallel programming model for clusters of commodity machines
  • 11. MapReduce Published in 2004 by Google – MapReduce: Simplified Data Processing on Large Clusters Popularized by Apache Hadoop project – used by Yahoo!, Facebook, Twitter, Amazon, …
  • 13. Word Count Example Input Map Shuffle & Sort Reduce Output the quick the, 3 brown Map brown, 2 fox Reduce fox, 2 how, 1 now, 1 the fox ate the Map mouse quick, 1 ate, 1 Reduce mouse, 1 how now brown Map cow, 1 cow
  • 14. Word Count Example Input Map Shuffle & Sort Reduce Output the, 1 the, 1 the quick quick, 1 brown, 1 brown Map brown, 1 fox, 1 fox fox, 1 the, 1 Reduce fox, 1 the, 1 the, 1 how, 1 the fox fox, 1 now, 1 ate the Map ate, 1 brown, 1 mouse the, 1 mouse, 1 quick, 1 ate, 1 how, 1 mouse, 1 Reduce how now now, 1 cow, 1 brown Map brown, 1 cow cow, 1
  • 15. Word Count Example Input Map Shuffle & Sort Reduce Output the, [1,1,1] the quick brown, [1,1] the, 3 brown Map fox, [1,1] brown, 2 fox how, [1] Reduce fox, 2 now, [1] how, 1 now, 1 the fox ate the Map mouse quick, [1] quick, 1 ate, [1] ate, 1 mouse, [1] Reduce mouse, 1 how now cow, [1] cow, 1 brown Map cow
  • 16. MapReduce philosophy – hide complexity – make it scalable – make it cheap
  • 17. MapReduce popularized by Apache Hadoop project
  • 18. Hadoop Overview Open source implementation of – Google MapReduce paper – Google File System (GFS) paper First release in 2008 by Yahoo! – wide adoption by Facebook, Twitter, Amazon, etc.
  • 19. Hadoop Core MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS)
  • 20. Hadoop Core (HDFS) MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) • Name Node stores file metadata • files split into 64 MB blocks • blocks replicated across 3 Data Nodes
  • 21. Hadoop Core (HDFS) MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) Name Node Data Node
  • 22. Hadoop Core (MapReduce) • Job Tracker distributes tasks and handles failures • tasks are assigned based on data locality • Task Trackers can execute multiple tasks MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) Name Node Data Node
  • 23. Hadoop Core (MapReduce) Job Tracker Task Tracker MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) Name Node Data Node
  • 24. Hadoop Core (Job submission) Task Tracker Client Job Tracker Name Node Data Node
  • 25. Hadoop Ecosystem Pig (ETL) Hive (BI) Sqoop (RDBMS) MapReduce (Job Scheduling / Execution System) Zookeeper Avro HBase Hadoop Distributed File System (HDFS)
  • 26. JavaScript MapReduce var map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") { context.write(words[i].toLowerCase(), 1); } } }; var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); };
  • 27. Pig words = LOAD '/example/count' AS ( word: chararray, count: int ); popular_words = ORDER words BY count DESC; top_popular_words = LIMIT popular_words 10; DUMP top_popular_words;
  • 28. Hive CREATE EXTERNAL TABLE WordCount ( word string, count int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE LOCATION "/example/count"; SELECT * FROM WordCount ORDER BY count DESC LIMIT 10;

Notas del editor

  1. So, which is really the “enterprise” now?
  2. Volume – exceeds physical limits of vertical scalabilityVelocity – decision window small compared to data change rateVariety – many different formats makes integration expensiveVariability – many options or variable interpretations confound analysis
  3. --run MapReducerunJs(&quot;/example/mr/WordCount.js&quot;, &quot;/example/data/davinci.txt&quot;, &quot;/example/count&quot;);--create Hive table for the existing dataCREATE EXTERNAL TABLE WordCount ( word string, count int)ROW FORMAT DELIMITED FIELDS TERMINATED BY &apos;\\t&apos; LINES TERMINATED BY &apos;\\n&apos; STORED AS TEXTFILELOCATION &quot;/example/count&quot;;--select top wordsSELECT * FROM WordCountORDER BY count DESC LIMIT 10;--execute Hive selecthive.exec(&quot;select * from WordCount order by count desc limit 10;&quot;);--execute LINQ style Hive queryhive.from(&quot;WordCount&quot;).orderBy(&quot;count DESC&quot;).take(10).run();--execute Pig scriptwords = LOAD &apos;/example/count&apos; AS ( word: chararray, count: int);popular_words = ORDER words by count DESC; top_popular_words = LIMIT popular_words 10;DUMP top_popular_words;--execute LINQ style Pig scriptpig.from(&quot;/example/count&quot;, &quot;word: chararray, count: int&quot;).orderBy(&quot;count DESC&quot;).take(10).run();
  4. --run MapReducerunJs(&quot;/example/mr/WordCount.js&quot;, &quot;/example/data/davinci.txt&quot;, &quot;/example/count&quot;);--create Hive table for the existing dataCREATE EXTERNAL TABLE WordCount ( word string, count int)ROW FORMAT DELIMITED FIELDS TERMINATED BY &apos;\\t&apos; LINES TERMINATED BY &apos;\\n&apos; STORED AS TEXTFILELOCATION &quot;/example/count&quot;;--select top wordsSELECT * FROM WordCountORDER BY count DESC LIMIT 10;--execute Hive selecthive.exec(&quot;select * from WordCount order by count desc limit 10;&quot;);--execute LINQ style Hive queryhive.from(&quot;WordCount&quot;).orderBy(&quot;count DESC&quot;).take(10).run();--execute Pig scriptwords = LOAD &apos;/example/count&apos; AS ( word: chararray, count: int);popular_words = ORDER words by count DESC; top_popular_words = LIMIT popular_words 10;DUMP top_popular_words;--execute LINQ style Pig scriptpig.from(&quot;/example/count&quot;, &quot;word: chararray, count: int&quot;).orderBy(&quot;count DESC&quot;).take(10).run();