Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Big Data and Hadoop
1. Big Data / Hadoop
Ciclo de Seminários
MO655B – Gerência de Redes de Computadores
Alunos: Flavio Vit
Marco Aurelio Wolf
Professor: Edmundo Madeira
Dez/2014
2. Agenda
Big data
Hadoop
MapReduce
HDFS
Hadoop Ecosystem
Conclusion
4. Big Data Hot Topic
Big Buzz Word
Web
2.0
SOA
Social
Networks
Cloud
Computing
Big Data
5. Why is data getting bigger?
New devices generating data
Decreasing costs with storage
Increasing processors speed
Use of hardware commodity
Open Source code usage
6. Data Sources
From Humans:
Blogs
Forums
Web Sites
Documents
Social Networks
From machines
Sensors
App logs
Web site tracking info
House hood appliances
Hadoop MapReduce application results
Internet of Things (Computers, cell phones, cars …)
7. Big Data Drivers
Science (CERN 40TB / second)
Financial (Risk analysis)
Web (logs, online retail, cookies)
Social Medias (Facebook, LinkedIn, Twitter)
Mobile devices (~6 Billion cell phones / Sensory data)
Internet of Thinks (Wearables / Sensors / Home Automation)
You!!!
10. Velocity
Data Concurrent access
Real time requirements
Illness detection
Traffic congestion for bus routes
Patient care – brain signals analyzes
Huge amount of new data is generated:
• 500 millions tweets/day
• 1 million transactions per hour – Walmart
• Every 60 seconds on Facebook:
• 510 comments are posted
• 293,000 statuses are updated
• 136,000 photos are uploaded
11. Volume
Big data implies enormous
volumes of data
Terabytes / Petabytes / …
Transactions: Walmart’s database
estimated in 2.5+ petabytes.
100 terabytes of data uploaded daily to
Facebook
Data never sleeps…
12. Variety
Structured Data
Tables and well defined schemas (RDB)
Regular structures
Semi Structured Data
Irregular structures (xml)
Schemas are not mandatory
Unstructured Data
No specific data model (free text, emails, logs)
heterogeneous data (audio, video)
All the above
13. Storage Scale
Storage now cheep or free
More devices kicking off more data all the time
Year average cost per GB
US$
500,000.00
450,000.00
400,000.00
350,000.00
300,000.00
250,000.00
200,000.00
150,000.00
100,000.00
50,000.00
0.00
1980 1990 2000 2005 2010 2013 2014
Year
2014 $0.03
2013 $0.05
2010 $0.09
2005 $1.24
2000 $11.00
1990 $11,200
1980 $437,500
14. Data Volume vs
Disk Speed
90s 00s 10s
Capacity 2.1 GB 200 GB 3000 GB
Price US$ 157/GB US$ 1.05/GB US$ 0,05/GB
Speed 16.6 Mb/s 56,5 Mb/s 210 Mb/s
Time to Read
126 sec 58 min 4 hours
Whole Disk
15. Processing Scale
Analyzing Large datasets requires distributed
processing
Multiple concurrent access to a given dataset is
required
Organizations sitting on decades of raw data
How to process huge amount of data?
16. How Big will it get?
Nobody knows!
Systems need to:
Use horizontal linear scale
Distributed from the start
Cost effective
Easy to use
17.
18. Hadoop History
2003 Doug Cutting was creating Nutch
Open Source “Google”
Web Crawler
Indexer
Crawler and Indexing processing was difficult
Massive storage and processing problem
In 2003 Google publishes GFS paper and in 2004
MapReduce paper
Based in Google’s paper, Doug redesign Nutch
19. What is Hadoop?
Framework of tools
Open source maintained by and under Apache
License
Support running apps for BigData
Addressing the BigData challenges:
Variety
Volume Velocity
20. Hadoop Main Attributes
Distributed Master/Slave Architecture
Fault-tolerant
Commodity Hardware
Written in Java
Mature language
Each daemon runs in a dedicated JVM
Abstract away all infrastructure from Developer
Developers think in and codes for processing individual
records, or “Key->Value pairs”
22. Hadoop Architecture
Slaves
Task Tracker: execute small piece of main global task
Data Node: store small piece of the total data
Master, same as Slave plus:
Job Tracker: break the higher task coming from
application and send them to the appropriate task tracker.
Name Node: keep and index to track where, or on which
Data Node, is residing each piece of the total data.
23. Hadoop Architecture
Task
Tracker
Data
Node
Job
Tracker
Name
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Queue Application
Task
Tracker
Data
Node
Task
Tracker
Data
Node
MapReduce
HDFS
Master
Slaves
24. Hadoop Daemons
Dedicated JVMs are created Hadoop Daemons (Data
Nodes, Task Trackers, Name Nodes and Job Tracker) as
well as for developer’s algorithm code.
Task tracker daemons are responsible for instantiating and
populating these JVMs with the Mapping and Reducing
code.
Hadoop Daemons and developer tasks are isolated from
one another
Problems like “stack overflow”, “out of memory” are isolated
and do not jump out of containers
Each has dedicated memory / independently tunable
Automatically “garbage collected”
25. Hadoop
Easier life for programmers
Programmers don’t need to worry about:
Where files are located
How to manage failures
How to break computation into pieces
How to program for scaling
26. Why Hadoop?
Scalable
Breaks data into smaller equal pieces (blocks, typically
64/128 Mb)
Breaks big computation task down into smaller individual
tasks
More slaves, more processing and storage power
Cheap
Commodity hardware, open source software
Extremely fault tolerant
“Easy” too use
27. MapReduce
Programming Model initially developed by Google
Large Data sets processing and generation
Parallel and distributed algorithm on Clusters
Easy to use by programmers hiding details of
parallelization
fault-tolerance
locality optimization
load balancing
28. MapReduce
Scales to large clusters of machines (thousands of
machines)
Easy to parallelize and distribute computations
Turns computations fault-tolerant
Task are executed at same place where data is located:
Optimizations for reducing the amount of data sent
across the network
29. The Map
Master Node orchestrate the distributed work
Data is split and sent to Worker nodes
Workers apply the map() function over the data
Output is written to intermediate storage
31. Shuffling
The intermediate result is sorted and redistributed
among the Workers
32. Reducing
Shuffled and sorted data is processed by Workers per
Key in parallel
33. MapReduce Usage
Distributed Pattern based search
Distributed sorting
Inverted index (word belonging to which documents?)
Web access log statistics (URL access frequency)
Machine learning
Data mining
34. Many Ways to MapReduce
Raw Java Code
Hard to write well!!!
Best performance if well written
Hadoop Streaming
Uses Standard In / Out
Written in any language
25% lower performance than Java
Hive or Pig
Further Processing Abstraction (SQL and scripts data access)
10% lower performance than Java
35. HDFS
Hadoop Distributed File System
Distributed File System for large Data Sets
Focused on Batch processing execution
High throughput data access rather than low latency
Uses Native File System
Scalable and Fault Tolerant
Simple Coherency Model => write once, read many
Portable across heterogeneous HW/SW
37. Hadoop HDFS Accessibility
Natively => FileSystem Java API // or wrapper in C
HTTP browser over a HDFS instance
FS Shell => Command line:
bin/hadoop dfs -mkdir /foodir
bin/hadoop dfs -rmr /foodir
bin/hadoop dfs -cat /foodir/myfile.txt
38. Hadoop Ecosystem
ZooKeeper • Resources management
Mahout • Algorithms for Machine Learning
• Streaming of Data (pull real time data from
HDFS)
Flume
Sqoop • Pull/Push data from/to RDBMS
Avro • Data in JSON format
• MR abstraction via functional programming
interface
Pig
Hive • MR abstraction via SQL like data support
MapReduce • Distributed Data
• Distributed File System
HBase and
HDFS
39. Hadoop Usage
Retail: Amazon, eBay, American Airlines
Social Networks: Facebook, Yahoo
Financial: Federal Reserve Board
Search tools: Yahoo
Government
40. Conclusion
We live in the information era where everything is connected
and generates huge amount of data. Such data, if well
analyzed, could aggregate value to society.
Hadoop addresses the Big Data challenges, proving to be
an efficient framework of tools.
Hadoop is:
Scalable
Cost Effective
Flexible
Fast
Resilient to failures
41. Question
What is the overall flow of a
MapReduce operation proposed
by Google?
http://goo.gl/O5he92
42. References
References:
http://hadoop.apache.org/
“MapReduce: Simplified Data Processing on Large
Clusters” - Jeffrey Dean and Sanjay Ghemawat
http://www.statisticbrain.com/average-cost-of-hard-drive-
storage/
http://zerotoprotraining.com
https://zephoria.com/
Editor's Notes
Year Average Cost Per Gb
2014 $0.03
2013 $0.05
2010 $0.09
2005 $1.24
2000 $11.00
1990 $11,200
1980 $437,500
http://www.statisticbrain.com/average-cost-of-hard-drive-storage/
500 millions tweets/day – REF http://expandedramblings.com/ Digital Marketing Ramblings
1 million transactions per hour - Walmart
6 millions web pages visited on Facebook
Data volume double every 18 months
Increase of disk speed is linear, almost flat
Increase of data volume is exponential, double every 18 months.
In 2003, Doug Cutting was working in an Open Source “Google” based on two main components:
Web Crawler
Indexer
Processing of such components were difficult because of massive storage and processing requirements.
Then Google has released two papers between 2003 and 2004, GFS paper and MapReduce paper.
Doug decided to re-design whole architecture of Nutch and delivered it in 2006 as Hadoop.
Apache Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer.
Master Slave distributed architecture. Few masters, many slaves…
Exceptionally fault tolerant
Meant to run on common, cheap and abundant hardware
Developers design thinking about individual key/values pairs. They write their code to consume/produce key/value pairs.
Slaves
Task Tracker: small piece of task
Data Node
Master, same as Slave plus:
Job Tracker
Name Node
Each daemon runs in an individual JVM. Both Hadoop core and developer’s algorithm runs in individual JVMs.
The task tracker instantiate Mapper/Reducer code into their own JVM.
Crashes like not handled Exceptions, out of memory problems, freezes, do not affect the entire solution, only the specific daemon.
MapReduce is useful in a wide range of applications, including
distributed pattern-based searching,
distributed sorting,
web link-graph reversal,
Singular Value Decomposition,[9]
web access log stats,
inverted index construction,
document clustering,
machine learning,[10]
and statistical machine translation.
Moreover, the MapReduce model has been adapted to several computing environments like multi-core and many-core systems,[11][12][13] desktop grids,[14] volunteer computing environments,[15] dynamic cloud environments,[16] and mobile environments.[17]
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject.
NameNode and DataNodes
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients.
In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes.
The NameNode executes file system namespace operations (files and directories):
opening
closing
renaming.
It also determines the mapping of blocks to DataNodes.
The DataNodes are responsible for serving read and write requests from the file system’s clients.
The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
The existence of a single NameNode in a cluster greatly simplifies the architecture of the system.
The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.