SlideShare a Scribd company logo
1 of 42
Big Data / Hadoop 
Ciclo de Seminários 
MO655B – Gerência de Redes de Computadores 
Alunos: Flavio Vit 
Marco Aurelio Wolf 
Professor: Edmundo Madeira 
Dez/2014
Agenda 
 Big data 
 Hadoop 
 MapReduce 
 HDFS 
 Hadoop Ecosystem 
 Conclusion
Big Data Hot Topic 
Big Buzz Word
Big Data Hot Topic 
Big Buzz Word 
Web 
2.0 
SOA 
Social 
Networks 
Cloud 
Computing 
Big Data
Why is data getting bigger? 
 New devices generating data 
 Decreasing costs with storage 
 Increasing processors speed 
 Use of hardware commodity 
 Open Source code usage
Data Sources 
 From Humans: 
 Blogs 
 Forums 
 Web Sites 
 Documents 
 Social Networks 
 From machines 
 Sensors 
 App logs 
 Web site tracking info 
 House hood appliances 
 Hadoop MapReduce application results 
 Internet of Things (Computers, cell phones, cars …)
Big Data Drivers 
 Science (CERN 40TB / second) 
 Financial (Risk analysis) 
 Web (logs, online retail, cookies) 
 Social Medias (Facebook, LinkedIn, Twitter) 
 Mobile devices (~6 Billion cell phones / Sensory data) 
 Internet of Thinks (Wearables / Sensors / Home Automation) 
 You!!!
Big Data Examples 
Click Trails Books
Big Data 3V Model 
Variety 
Volume Velocity
Velocity 
 Data Concurrent access 
 Real time requirements 
 Illness detection 
 Traffic congestion for bus routes 
 Patient care – brain signals analyzes 
 Huge amount of new data is generated: 
• 500 millions tweets/day 
• 1 million transactions per hour – Walmart 
• Every 60 seconds on Facebook: 
• 510 comments are posted 
• 293,000 statuses are updated 
• 136,000 photos are uploaded
Volume 
 Big data implies enormous 
volumes of data 
 Terabytes / Petabytes / … 
 Transactions: Walmart’s database 
estimated in 2.5+ petabytes. 
 100 terabytes of data uploaded daily to 
Facebook 
Data never sleeps…
Variety 
 Structured Data 
 Tables and well defined schemas (RDB) 
 Regular structures 
 Semi Structured Data 
 Irregular structures (xml) 
 Schemas are not mandatory 
 Unstructured Data 
 No specific data model (free text, emails, logs) 
 heterogeneous data (audio, video) 
 All the above
Storage Scale 
 Storage now cheep or free 
 More devices kicking off more data all the time 
 Year average cost per GB 
US$ 
500,000.00 
450,000.00 
400,000.00 
350,000.00 
300,000.00 
250,000.00 
200,000.00 
150,000.00 
100,000.00 
50,000.00 
0.00 
1980 1990 2000 2005 2010 2013 2014 
Year 
 2014 $0.03 
 2013 $0.05 
 2010 $0.09 
 2005 $1.24 
 2000 $11.00 
 1990 $11,200 
 1980 $437,500
Data Volume vs 
Disk Speed 
90s 00s 10s 
Capacity 2.1 GB 200 GB 3000 GB 
Price US$ 157/GB US$ 1.05/GB US$ 0,05/GB 
Speed 16.6 Mb/s 56,5 Mb/s 210 Mb/s 
Time to Read 
126 sec 58 min 4 hours 
Whole Disk
Processing Scale 
 Analyzing Large datasets requires distributed 
processing 
 Multiple concurrent access to a given dataset is 
required 
 Organizations sitting on decades of raw data 
 How to process huge amount of data?
How Big will it get? 
 Nobody knows! 
 Systems need to: 
 Use horizontal linear scale 
 Distributed from the start 
 Cost effective 
 Easy to use
Hadoop History 
 2003 Doug Cutting was creating Nutch 
 Open Source “Google” 
 Web Crawler 
 Indexer 
 Crawler and Indexing processing was difficult 
 Massive storage and processing problem 
 In 2003 Google publishes GFS paper and in 2004 
MapReduce paper 
 Based in Google’s paper, Doug redesign Nutch
What is Hadoop? 
 Framework of tools 
 Open source maintained by and under Apache 
License 
 Support running apps for BigData 
 Addressing the BigData challenges: 
Variety 
Volume Velocity
Hadoop Main Attributes 
 Distributed Master/Slave Architecture 
 Fault-tolerant 
 Commodity Hardware 
 Written in Java 
 Mature language 
 Each daemon runs in a dedicated JVM 
 Abstract away all infrastructure from Developer 
 Developers think in and codes for processing individual 
records, or “Key->Value pairs”
Hadoop Architecture - Main 
Components 
Hadoop Ecosystem 
MapReduce 
File System 
(HDFS)
Hadoop Architecture 
 Slaves 
 Task Tracker: execute small piece of main global task 
 Data Node: store small piece of the total data 
 Master, same as Slave plus: 
 Job Tracker: break the higher task coming from 
application and send them to the appropriate task tracker. 
 Name Node: keep and index to track where, or on which 
Data Node, is residing each piece of the total data.
Hadoop Architecture 
Task 
Tracker 
Data 
Node 
Job 
Tracker 
Name 
Node 
Task 
Tracker 
Data 
Node 
Task 
Tracker 
Data 
Node 
Queue Application 
Task 
Tracker 
Data 
Node 
Task 
Tracker 
Data 
Node 
MapReduce 
HDFS 
Master 
Slaves
Hadoop Daemons 
 Dedicated JVMs are created Hadoop Daemons (Data 
Nodes, Task Trackers, Name Nodes and Job Tracker) as 
well as for developer’s algorithm code. 
 Task tracker daemons are responsible for instantiating and 
populating these JVMs with the Mapping and Reducing 
code. 
 Hadoop Daemons and developer tasks are isolated from 
one another 
 Problems like “stack overflow”, “out of memory” are isolated 
and do not jump out of containers 
 Each has dedicated memory / independently tunable 
 Automatically “garbage collected”
Hadoop 
Easier life for programmers 
 Programmers don’t need to worry about: 
 Where files are located 
 How to manage failures 
 How to break computation into pieces 
 How to program for scaling
Why Hadoop? 
 Scalable 
 Breaks data into smaller equal pieces (blocks, typically 
64/128 Mb) 
 Breaks big computation task down into smaller individual 
tasks 
 More slaves, more processing and storage power 
 Cheap 
 Commodity hardware, open source software 
 Extremely fault tolerant 
 “Easy” too use
MapReduce 
 Programming Model initially developed by Google 
 Large Data sets processing and generation 
 Parallel and distributed algorithm on Clusters 
 Easy to use by programmers hiding details of 
 parallelization 
 fault-tolerance 
 locality optimization 
 load balancing
MapReduce 
 Scales to large clusters of machines (thousands of 
machines) 
 Easy to parallelize and distribute computations 
 Turns computations fault-tolerant 
 Task are executed at same place where data is located: 
 Optimizations for reducing the amount of data sent 
across the network
The Map 
 Master Node orchestrate the distributed work 
 Data is split and sent to Worker nodes 
 Workers apply the map() function over the data 
 Output is written to intermediate storage
The Map
Shuffling 
 The intermediate result is sorted and redistributed 
among the Workers
Reducing 
 Shuffled and sorted data is processed by Workers per 
Key in parallel
MapReduce Usage 
 Distributed Pattern based search 
 Distributed sorting 
 Inverted index (word belonging to which documents?) 
 Web access log statistics (URL access frequency) 
 Machine learning 
 Data mining
Many Ways to MapReduce 
 Raw Java Code 
 Hard to write well!!! 
 Best performance if well written 
 Hadoop Streaming 
 Uses Standard In / Out 
 Written in any language 
 25% lower performance than Java 
 Hive or Pig 
 Further Processing Abstraction (SQL and scripts data access) 
 10% lower performance than Java
HDFS 
Hadoop Distributed File System 
 Distributed File System for large Data Sets 
 Focused on Batch processing execution 
 High throughput data access rather than low latency 
 Uses Native File System 
 Scalable and Fault Tolerant 
 Simple Coherency Model => write once, read many 
 Portable across heterogeneous HW/SW
HDFS Architecture 
Rack 1 
NameNode NN 
2T Bytes 
DN 2T 
Bytes 
DN 
2T Bytes 
Rack 2 
Checkpoint 
NameNode 
2T Bytes 
DN 
2T Bytes 
DN 
2T Bytes 
Data storage 
Data replication 
Read Write Access 
Rack 3 
Datanode DN 
2T Bytes 
DN 
2T Bytes 
DN 
2T Bytes 
Master Node 
HDFS Data and Metadata 
management 
Metadata snapshots
Hadoop HDFS Accessibility 
 Natively => FileSystem Java API // or wrapper in C 
 HTTP browser over a HDFS instance 
 FS Shell => Command line: 
bin/hadoop dfs -mkdir /foodir 
bin/hadoop dfs -rmr /foodir 
bin/hadoop dfs -cat /foodir/myfile.txt
Hadoop Ecosystem 
ZooKeeper • Resources management 
Mahout • Algorithms for Machine Learning 
• Streaming of Data (pull real time data from 
HDFS) 
Flume 
Sqoop • Pull/Push data from/to RDBMS 
Avro • Data in JSON format 
• MR abstraction via functional programming 
interface 
Pig 
Hive • MR abstraction via SQL like data support 
MapReduce • Distributed Data 
• Distributed File System 
HBase and 
HDFS
Hadoop Usage 
 Retail: Amazon, eBay, American Airlines 
 Social Networks: Facebook, Yahoo 
 Financial: Federal Reserve Board 
 Search tools: Yahoo 
 Government
Conclusion 
 We live in the information era where everything is connected 
and generates huge amount of data. Such data, if well 
analyzed, could aggregate value to society. 
 Hadoop addresses the Big Data challenges, proving to be 
an efficient framework of tools. 
 Hadoop is: 
 Scalable 
 Cost Effective 
 Flexible 
 Fast 
 Resilient to failures
Question 
 What is the overall flow of a 
MapReduce operation proposed 
by Google? 
 http://goo.gl/O5he92
References 
References: 
 http://hadoop.apache.org/ 
 “MapReduce: Simplified Data Processing on Large 
Clusters” - Jeffrey Dean and Sanjay Ghemawat 
 http://www.statisticbrain.com/average-cost-of-hard-drive- 
storage/ 
 http://zerotoprotraining.com 
 https://zephoria.com/

More Related Content

What's hot

Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisVincenzo Gulisano
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 

What's hot (20)

Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data Analysis
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 

Similar to Big Data and Hadoop

Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampSpotle.ai
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 

Similar to Big Data and Hadoop (20)

Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Big data
Big dataBig data
Big data
 
Hadoop
HadoopHadoop
Hadoop
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
HADOOP
HADOOPHADOOP
HADOOP
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 

Recently uploaded

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Recently uploaded (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Big Data and Hadoop

  • 1. Big Data / Hadoop Ciclo de Seminários MO655B – Gerência de Redes de Computadores Alunos: Flavio Vit Marco Aurelio Wolf Professor: Edmundo Madeira Dez/2014
  • 2. Agenda  Big data  Hadoop  MapReduce  HDFS  Hadoop Ecosystem  Conclusion
  • 3. Big Data Hot Topic Big Buzz Word
  • 4. Big Data Hot Topic Big Buzz Word Web 2.0 SOA Social Networks Cloud Computing Big Data
  • 5. Why is data getting bigger?  New devices generating data  Decreasing costs with storage  Increasing processors speed  Use of hardware commodity  Open Source code usage
  • 6. Data Sources  From Humans:  Blogs  Forums  Web Sites  Documents  Social Networks  From machines  Sensors  App logs  Web site tracking info  House hood appliances  Hadoop MapReduce application results  Internet of Things (Computers, cell phones, cars …)
  • 7. Big Data Drivers  Science (CERN 40TB / second)  Financial (Risk analysis)  Web (logs, online retail, cookies)  Social Medias (Facebook, LinkedIn, Twitter)  Mobile devices (~6 Billion cell phones / Sensory data)  Internet of Thinks (Wearables / Sensors / Home Automation)  You!!!
  • 8. Big Data Examples Click Trails Books
  • 9. Big Data 3V Model Variety Volume Velocity
  • 10. Velocity  Data Concurrent access  Real time requirements  Illness detection  Traffic congestion for bus routes  Patient care – brain signals analyzes  Huge amount of new data is generated: • 500 millions tweets/day • 1 million transactions per hour – Walmart • Every 60 seconds on Facebook: • 510 comments are posted • 293,000 statuses are updated • 136,000 photos are uploaded
  • 11. Volume  Big data implies enormous volumes of data  Terabytes / Petabytes / …  Transactions: Walmart’s database estimated in 2.5+ petabytes.  100 terabytes of data uploaded daily to Facebook Data never sleeps…
  • 12. Variety  Structured Data  Tables and well defined schemas (RDB)  Regular structures  Semi Structured Data  Irregular structures (xml)  Schemas are not mandatory  Unstructured Data  No specific data model (free text, emails, logs)  heterogeneous data (audio, video)  All the above
  • 13. Storage Scale  Storage now cheep or free  More devices kicking off more data all the time  Year average cost per GB US$ 500,000.00 450,000.00 400,000.00 350,000.00 300,000.00 250,000.00 200,000.00 150,000.00 100,000.00 50,000.00 0.00 1980 1990 2000 2005 2010 2013 2014 Year  2014 $0.03  2013 $0.05  2010 $0.09  2005 $1.24  2000 $11.00  1990 $11,200  1980 $437,500
  • 14. Data Volume vs Disk Speed 90s 00s 10s Capacity 2.1 GB 200 GB 3000 GB Price US$ 157/GB US$ 1.05/GB US$ 0,05/GB Speed 16.6 Mb/s 56,5 Mb/s 210 Mb/s Time to Read 126 sec 58 min 4 hours Whole Disk
  • 15. Processing Scale  Analyzing Large datasets requires distributed processing  Multiple concurrent access to a given dataset is required  Organizations sitting on decades of raw data  How to process huge amount of data?
  • 16. How Big will it get?  Nobody knows!  Systems need to:  Use horizontal linear scale  Distributed from the start  Cost effective  Easy to use
  • 17.
  • 18. Hadoop History  2003 Doug Cutting was creating Nutch  Open Source “Google”  Web Crawler  Indexer  Crawler and Indexing processing was difficult  Massive storage and processing problem  In 2003 Google publishes GFS paper and in 2004 MapReduce paper  Based in Google’s paper, Doug redesign Nutch
  • 19. What is Hadoop?  Framework of tools  Open source maintained by and under Apache License  Support running apps for BigData  Addressing the BigData challenges: Variety Volume Velocity
  • 20. Hadoop Main Attributes  Distributed Master/Slave Architecture  Fault-tolerant  Commodity Hardware  Written in Java  Mature language  Each daemon runs in a dedicated JVM  Abstract away all infrastructure from Developer  Developers think in and codes for processing individual records, or “Key->Value pairs”
  • 21. Hadoop Architecture - Main Components Hadoop Ecosystem MapReduce File System (HDFS)
  • 22. Hadoop Architecture  Slaves  Task Tracker: execute small piece of main global task  Data Node: store small piece of the total data  Master, same as Slave plus:  Job Tracker: break the higher task coming from application and send them to the appropriate task tracker.  Name Node: keep and index to track where, or on which Data Node, is residing each piece of the total data.
  • 23. Hadoop Architecture Task Tracker Data Node Job Tracker Name Node Task Tracker Data Node Task Tracker Data Node Queue Application Task Tracker Data Node Task Tracker Data Node MapReduce HDFS Master Slaves
  • 24. Hadoop Daemons  Dedicated JVMs are created Hadoop Daemons (Data Nodes, Task Trackers, Name Nodes and Job Tracker) as well as for developer’s algorithm code.  Task tracker daemons are responsible for instantiating and populating these JVMs with the Mapping and Reducing code.  Hadoop Daemons and developer tasks are isolated from one another  Problems like “stack overflow”, “out of memory” are isolated and do not jump out of containers  Each has dedicated memory / independently tunable  Automatically “garbage collected”
  • 25. Hadoop Easier life for programmers  Programmers don’t need to worry about:  Where files are located  How to manage failures  How to break computation into pieces  How to program for scaling
  • 26. Why Hadoop?  Scalable  Breaks data into smaller equal pieces (blocks, typically 64/128 Mb)  Breaks big computation task down into smaller individual tasks  More slaves, more processing and storage power  Cheap  Commodity hardware, open source software  Extremely fault tolerant  “Easy” too use
  • 27. MapReduce  Programming Model initially developed by Google  Large Data sets processing and generation  Parallel and distributed algorithm on Clusters  Easy to use by programmers hiding details of  parallelization  fault-tolerance  locality optimization  load balancing
  • 28. MapReduce  Scales to large clusters of machines (thousands of machines)  Easy to parallelize and distribute computations  Turns computations fault-tolerant  Task are executed at same place where data is located:  Optimizations for reducing the amount of data sent across the network
  • 29. The Map  Master Node orchestrate the distributed work  Data is split and sent to Worker nodes  Workers apply the map() function over the data  Output is written to intermediate storage
  • 31. Shuffling  The intermediate result is sorted and redistributed among the Workers
  • 32. Reducing  Shuffled and sorted data is processed by Workers per Key in parallel
  • 33. MapReduce Usage  Distributed Pattern based search  Distributed sorting  Inverted index (word belonging to which documents?)  Web access log statistics (URL access frequency)  Machine learning  Data mining
  • 34. Many Ways to MapReduce  Raw Java Code  Hard to write well!!!  Best performance if well written  Hadoop Streaming  Uses Standard In / Out  Written in any language  25% lower performance than Java  Hive or Pig  Further Processing Abstraction (SQL and scripts data access)  10% lower performance than Java
  • 35. HDFS Hadoop Distributed File System  Distributed File System for large Data Sets  Focused on Batch processing execution  High throughput data access rather than low latency  Uses Native File System  Scalable and Fault Tolerant  Simple Coherency Model => write once, read many  Portable across heterogeneous HW/SW
  • 36. HDFS Architecture Rack 1 NameNode NN 2T Bytes DN 2T Bytes DN 2T Bytes Rack 2 Checkpoint NameNode 2T Bytes DN 2T Bytes DN 2T Bytes Data storage Data replication Read Write Access Rack 3 Datanode DN 2T Bytes DN 2T Bytes DN 2T Bytes Master Node HDFS Data and Metadata management Metadata snapshots
  • 37. Hadoop HDFS Accessibility  Natively => FileSystem Java API // or wrapper in C  HTTP browser over a HDFS instance  FS Shell => Command line: bin/hadoop dfs -mkdir /foodir bin/hadoop dfs -rmr /foodir bin/hadoop dfs -cat /foodir/myfile.txt
  • 38. Hadoop Ecosystem ZooKeeper • Resources management Mahout • Algorithms for Machine Learning • Streaming of Data (pull real time data from HDFS) Flume Sqoop • Pull/Push data from/to RDBMS Avro • Data in JSON format • MR abstraction via functional programming interface Pig Hive • MR abstraction via SQL like data support MapReduce • Distributed Data • Distributed File System HBase and HDFS
  • 39. Hadoop Usage  Retail: Amazon, eBay, American Airlines  Social Networks: Facebook, Yahoo  Financial: Federal Reserve Board  Search tools: Yahoo  Government
  • 40. Conclusion  We live in the information era where everything is connected and generates huge amount of data. Such data, if well analyzed, could aggregate value to society.  Hadoop addresses the Big Data challenges, proving to be an efficient framework of tools.  Hadoop is:  Scalable  Cost Effective  Flexible  Fast  Resilient to failures
  • 41. Question  What is the overall flow of a MapReduce operation proposed by Google?  http://goo.gl/O5he92
  • 42. References References:  http://hadoop.apache.org/  “MapReduce: Simplified Data Processing on Large Clusters” - Jeffrey Dean and Sanjay Ghemawat  http://www.statisticbrain.com/average-cost-of-hard-drive- storage/  http://zerotoprotraining.com  https://zephoria.com/

Editor's Notes

  1. Year Average Cost Per Gb 2014 $0.03 2013 $0.05 2010 $0.09 2005 $1.24 2000 $11.00 1990 $11,200 1980 $437,500 http://www.statisticbrain.com/average-cost-of-hard-drive-storage/
  2. 500 millions tweets/day – REF http://expandedramblings.com/ Digital Marketing Ramblings 1 million transactions per hour - Walmart 6 millions web pages visited on Facebook
  3. Data volume double every 18 months
  4. Increase of disk speed is linear, almost flat Increase of data volume is exponential, double every 18 months.
  5. In 2003, Doug Cutting was working in an Open Source “Google” based on two main components: Web Crawler Indexer Processing of such components were difficult because of massive storage and processing requirements. Then Google has released two papers between 2003 and 2004, GFS paper and MapReduce paper. Doug decided to re-design whole architecture of Nutch and delivered it in 2006 as Hadoop.
  6. Apache Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer.
  7. Master Slave distributed architecture. Few masters, many slaves… Exceptionally fault tolerant Meant to run on common, cheap and abundant hardware Developers design thinking about individual key/values pairs. They write their code to consume/produce key/value pairs.
  8. Slaves Task Tracker: small piece of task Data Node Master, same as Slave plus: Job Tracker Name Node
  9. Each daemon runs in an individual JVM. Both Hadoop core and developer’s algorithm runs in individual JVMs. The task tracker instantiate Mapper/Reducer code into their own JVM. Crashes like not handled Exceptions, out of memory problems, freezes, do not affect the entire solution, only the specific daemon.
  10. MapReduce is useful in a wide range of applications, including distributed pattern-based searching, distributed sorting, web link-graph reversal, Singular Value Decomposition,[9] web access log stats, inverted index construction, document clustering, machine learning,[10] and statistical machine translation. Moreover, the MapReduce model has been adapted to several computing environments like multi-core and many-core systems,[11][12][13] desktop grids,[14] volunteer computing environments,[15] dynamic cloud environments,[16] and mobile environments.[17]
  11. The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject.
  12. NameNode and DataNodes HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations (files and directories): opening closing renaming. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.