SlideShare una empresa de Scribd logo
1 de 39
Descargar para leer sin conexión
Roman Nikitchenko, 09.05.2014
BIG.DATA technologies
& HADOOP infrastructure
2www.vitech.com.ua
Agenda
Hadoop causes real
big data Industry
changes
What technology
is behind this
name?
Why Hadoop is so
promising solution?
BIG DATA
APPROACH
HADOOP
ENVIRONMENT
INDUSTRY FACE IS
CHANGING
3www.vitech.com.ua
No escape for you ;-)
4www.vitech.com.ua
What is BIG DATA?
● Really BIG DATA things: photo banks, video storage,
historical measurements.
● Intensive data transactions and high distribution: stores
(offline or online), banks, advertising networks.
● Realtime data: measurements and minitoring, gaming.
● Intensive processing: science, modelling.
● High volumes of small things: social networks,
healthcare
BIG DATA IS EVERYWHERE
5www.vitech.com.ua
BIG DATA in just 3 words
Indeed any real big
data is just about
DIGITAL LIFE
FOOTPRINT
6www.vitech.com.ua
WORLD is big data itself
Yet to remember....
WORLD ITSELF CAN
BE DIGITIZED TOO
● Earth weather and environment: realtime, really
big data volume, high potential for processing, lot
of things to be analysed, historical data.
● Space: unlimited potential for analysis, ocean is
yet unknow volume.
● Internet of things is going to be digital world itself.
● ???
7www.vitech.com.ua
So...
BIG DATA is not about the
data. It is about OUR ABILITY
TO HANDLE THEM.
8www.vitech.com.ua
But how can I handle big data?
… BUT HOW TO
HANDLE IT?
BIG DATA
9www.vitech.com.ua
BIG DATA storage: requirements
NO BACKUPS
10www.vitech.com.ua
BIG DATA storage: requirements
SIMPLE BUT
RELIABLE
● Really big amount of
data is to be stored
in reliable manner.
● Storage is to be
simple, recoverable
and cheap.
11www.vitech.com.ua
BIG DATA storage: requirements
DECENTRALIZED
● No single point of failure.
● Scalable as close to
linear as possible.
● No manual actions to
recover in case of
failures
12www.vitech.com.ua
BIG DATA processing: requirements
SIMPLE TO USE
● Complexity is to
be burried inside.
● Interface is to be
functional and
compatible
between versions.
13www.vitech.com.ua
BIG DATA processing: requirements
TOOLS TO BE CLOSE
TO WORK
● Process data on the
same nodes as it is
stored on.
● Distributed storage
— distributed
processing.
14www.vitech.com.ua
BIG DATA processing: requirements
● Work is to be
balanced.
● Data placement is
to be appropriate to
balanced work.
● Amount of work is
to be balanced in
accordance to
resources.
SHARE LOAD
15www.vitech.com.ua
Solution requirements in general
WHAT FINALLY DO
WE NEED?
● CPU+HDD in one place
● Cluster of replacable nodes
● Lot of storage space
● Way to control resources
and balance load
● Everything is to be
relatively simple and
affordable
x MAX
+
=
BIG
DATA
16www.vitech.com.ua
… and what is the solution?
HADOOP magic is here!
17www.vitech.com.ua
What is it?
What is
HADOOP?
● Hadoop is open source
framework for big
data. Both distributed
storage and
processing.
● Hadoop is reliable and
fault tolerant with no
rely on hardware for
these properties.
● Hadoop has unique
horisontal scalability.
Currently — from
single computer up to
thousands of cluster
nodes.
18www.vitech.com.ua
Facts and trends
● 2004, Was inspired by by Google
MapReduce idea. Originally was
named just after son's elephant toy.
● On June 13, 2012
Facebook announced their
Hadoop cluster has 100 PB
of data. On November 8,
2012 they announced the
warehouse grows by
roughly half a PB per day.
● On February 19, 2008, Yahoo! Inc. launched what it
claimed was the world's largest Hadoop production
application. The Yahoo! Search Webmap is a Hadoop
application that runs on a more than 10,000 core Linux
cluster.
19www.vitech.com.ua
Hadoop: classical picture
Hadoop
historical
top view
● HDFS serves as file
system layer
● MapReduce originally
served as distributed
processing framework.
● Native client API is
Java but there are lot
of alternatives.
● This is only initial
architecture and it is
now more complex.
20www.vitech.com.ua
HDFS top view
● Namenode is 'management' component. Keeps
'directory' of what file blocks are stored where.
● Actual work is performed by data nodes.
21www.vitech.com.ua
HDFS files handling
● Files are stored in large enough blocks. Every block is
replicated to several data nodes.
● Replication is tracked by namenode. Clients only locate
blocks using namenode and actual load is taken by
datanode.
● Datanode failure leads to replication recovery. Namenode
could be backed by standby scheme.
22www.vitech.com.ua
HDFS properties
● Designed for throughput, not
for latency.
● Blocks are expected to be
large. There is issue with lot of
small files.
● Write once, read many times
ideology.
● Only append, no 'edit' ability.
● Special tools are required to
implement OLTP like Apache
HBase.
HDFS is ...
23www.vitech.com.ua
MapReduce framework model
● 2 steps data processing: transform and then reduce.
Really nice to do things in distributed manner.
● Large class of jobs can be adopted but not all of them.
24www.vitech.com.ua
MapReduce service: top view
● One JobTracker with
redundancy
possible.
● Multiple
TaskTrackers doing
actual job.
● Ideology is similar
to HDFS handling.
● HDFS is usually
used as storage on
all phases.
MapReduce service
25www.vitech.com.ua
Technology: Hadoop 2.0 concept
● New component (YARN) forms resource management
layer and completes real distributed data OS.
● MapReduce is from now only one among other YARN
appliactions.
26www.vitech.com.ua
YARN: notable addition
● Resource
manager
dispatches
client requests.
● Node managers
manage node
resources.
● Any application
is set of
containers
including
application
master.
YARN service
27www.vitech.com.ua
YARN: notable addition
● Better resource balance for
heterogeneous clusterss
and multple applications.
● Dynamic applications over
static services.
● Much wider applications
model over simple
MapReduce. Things like
Spark ot Tez.
Why YARN is SO
important?
28www.vitech.com.ua
Hadoop current picture
● HDFS2 is now about storage and YARN is about
processing resources.
● Lot of things to do on top of this data OS starting from
traditional MapReduce. Now there is lot of alternatives.
29www.vitech.com.ua
Just several items around
Infrastructure
● HBase: Scalable structured data
storage for large tables.
● Hive: A data warehouse
infrastructure that provides data
summarization and ad hoc
querying.
● Mahout: A Scalable machine
learning and data mining library.
● Pig: A high-level data-flow
language and execution
framework for parallel
computation.
● ZooKeeper: A high-performance
distributed coordination service.
30www.vitech.com.ua
Most important concept
First ever world
DATA OS
10.000 nodes computer...
Recent technology changes are focused on
higher scale. Better resource usage and
control, lower MTTR, higher security,
redundancy, fault tolerance.
31www.vitech.com.ua
Big data industry is changing.
HADOOP has influence
on whole BIG DATA
INDUSTRY face
32www.vitech.com.ua
New concepts
DATA LAKE
Take as much data
about your business
processes as you can
take. The more data
you have the more
value you could get
from it.
33www.vitech.com.ua
New concepts
ENTERPRISE DATA HUB
Don't ruine your existing data warehouse.
Just extend it with new, centralized big
data storage through data migration
solution.
34www.vitech.com.ua
Trends
Big data is goind BIGGER
● SSD are going to be widely used as storage
and memory based replica is not a miracle
anymore.
● Memory and SSD based caching schemes
are going to be more and more aggressive.
Particularry in HDFS and HBase.
● Clusters grow. Currently some open source
features are targeted for clusters of 1K
nodes. How about staging 300 nodes
cluster in companies like EBay?
● Production clusters go beyond 4000 nodes
(up to 10K). Node failure nearly every day.
35www.vitech.com.ua
Trends
● Typecal node is expected to
include at least 64G memory
● Starting from 4 x 2T drives for
storage. 8-16 x 4T drives are not
so rare. This is for general
'workload' node.
● 10 and more CPU cores. 2 CPUs
is normal approach.
● SSD is starting to be widely used
not only for OS and caching but
for data itself.
● Main outcome — per node costs
model is changing.
HARDWARE
IS GOING
CHEAPER
36www.vitech.com.ua
Most important concept
● You need to limit things
you are guessing
37www.vitech.com.ua
For whom bell tools?
Old way
● Make assumptions
about data you
need.
● Make assumptions
about data model.
● Make assumptions
about algorithms
you need.
● Get confirmation for
your initial guess
about result. Are you
surprised?
New way
● Get as much data as you can.
● Detect data model based on
set of algorithms with
extensive approach.
● Cluster your data, detect
correlations, clean from
anomalies... in all way you
can afford on whole data set.
● Get grounded results. You still
cen miss some fundamental
aspects but isn't it much
better in any case?
38www.vitech.com.ua
Major Hadoop distributions
● HortonWorks are 'barely open source'. Innovative, but
'running too fast'. Most ot their key technologies are not
so mature yet.
● Cloudera is stable enough but not stale. Hadoop 2.3 with
YARN, HBase 0.96.x. Balance.
● MapR focuses on performance per node but they are
slightly outdated in term of functionality and their
distribution costs. For cases where node performance is
high priority.
● Intel is newcomer on this market. Not for near future.
39www.vitech.com.ua
Questions and discussion

Más contenido relacionado

La actualidad más candente

Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big dataYukti Kaura
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in detailsMahmoud Yassin
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopFebiyan Rachman
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introductionFrans van Noort
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An OverviewC. Scyphers
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core conceptsMaryan Faryna
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop BasicsSonal Tiwari
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Imviplav
 

La actualidad más candente (20)

Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core concepts
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 

Destacado

10 Most Effective Big Data Technologies
10 Most Effective Big Data Technologies10 Most Effective Big Data Technologies
10 Most Effective Big Data TechnologiesMahindra Comviva
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsCognizant
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Data Infrastructure on Hadoop - Hadoop Summit 2011 BLR
Data Infrastructure on Hadoop - Hadoop Summit 2011 BLRData Infrastructure on Hadoop - Hadoop Summit 2011 BLR
Data Infrastructure on Hadoop - Hadoop Summit 2011 BLRSeetharam Venkatesh
 
Big data: current technology scope.
Big data: current technology scope.Big data: current technology scope.
Big data: current technology scope.Roman Nikitchenko
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsKaniska Mandal
 
Scalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and MesosScalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and MesosDataWorks Summit
 
Big Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and ChallengesBig Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and ChallengesUyoyo Edosio
 
Big data characteristics, value chain and challenges
Big data characteristics, value chain and challengesBig data characteristics, value chain and challenges
Big data characteristics, value chain and challengesMusfiqur Rahman
 
BigData_Chp2: Hadoop & Map-Reduce
BigData_Chp2: Hadoop & Map-ReduceBigData_Chp2: Hadoop & Map-Reduce
BigData_Chp2: Hadoop & Map-ReduceLilia Sfaxi
 
BigData_Chp1: Introduction à la Big Data
BigData_Chp1: Introduction à la Big DataBigData_Chp1: Introduction à la Big Data
BigData_Chp1: Introduction à la Big DataLilia Sfaxi
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Big data - Key Enablers, Drivers & Challenges
Big data - Key Enablers, Drivers & ChallengesBig data - Key Enablers, Drivers & Challenges
Big data - Key Enablers, Drivers & ChallengesShilpi Sharma
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big DataBernard Marr
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 

Destacado (20)

Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
10 Most Effective Big Data Technologies
10 Most Effective Big Data Technologies10 Most Effective Big Data Technologies
10 Most Effective Big Data Technologies
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Data Infrastructure on Hadoop - Hadoop Summit 2011 BLR
Data Infrastructure on Hadoop - Hadoop Summit 2011 BLRData Infrastructure on Hadoop - Hadoop Summit 2011 BLR
Data Infrastructure on Hadoop - Hadoop Summit 2011 BLR
 
Final White Paper_
Final White Paper_Final White Paper_
Final White Paper_
 
Big data: current technology scope.
Big data: current technology scope.Big data: current technology scope.
Big data: current technology scope.
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics
 
Scalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and MesosScalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and Mesos
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
Big Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and ChallengesBig Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and Challenges
 
Big data characteristics, value chain and challenges
Big data characteristics, value chain and challengesBig data characteristics, value chain and challenges
Big data characteristics, value chain and challenges
 
Big Data and Analytics
Big Data and AnalyticsBig Data and Analytics
Big Data and Analytics
 
BigData_Chp2: Hadoop & Map-Reduce
BigData_Chp2: Hadoop & Map-ReduceBigData_Chp2: Hadoop & Map-Reduce
BigData_Chp2: Hadoop & Map-Reduce
 
Big Data and Analytics on AWS
Big Data and Analytics on AWS Big Data and Analytics on AWS
Big Data and Analytics on AWS
 
BigData_Chp1: Introduction à la Big Data
BigData_Chp1: Introduction à la Big DataBigData_Chp1: Introduction à la Big Data
BigData_Chp1: Introduction à la Big Data
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Big data - Key Enablers, Drivers & Challenges
Big data - Key Enablers, Drivers & ChallengesBig data - Key Enablers, Drivers & Challenges
Big data - Key Enablers, Drivers & Challenges
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
 

Similar a Big data technologies and Hadoop infrastructure

Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.GeeksLab Odessa
 
Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!Roman Nikitchenko
 
Hadoop Training Tutorial for Freshers
Hadoop Training Tutorial for FreshersHadoop Training Tutorial for Freshers
Hadoop Training Tutorial for Freshersrajkamaltibacademy
 
Big data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymoreBig data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymoreStfalcon Meetups
 
Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.Roman Nikitchenko
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
 
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with HadoopCafé da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with HadoopOCTO Technology
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...Denodo
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
 

Similar a Big data technologies and Hadoop infrastructure (20)

Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
 
Big Data - Big Pitfalls.
Big Data - Big Pitfalls.Big Data - Big Pitfalls.
Big Data - Big Pitfalls.
 
Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop Training Tutorial for Freshers
Hadoop Training Tutorial for FreshersHadoop Training Tutorial for Freshers
Hadoop Training Tutorial for Freshers
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
bigdata 2.pptx
bigdata 2.pptxbigdata 2.pptx
bigdata 2.pptx
 
Big data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymoreBig data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymore
 
Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
bigdata.pptx
bigdata.pptxbigdata.pptx
bigdata.pptx
 
bigdata.pdf
bigdata.pdfbigdata.pdf
bigdata.pdf
 
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with HadoopCafé da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Big data
Big dataBig data
Big data
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 

Último

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Último (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

Big data technologies and Hadoop infrastructure

  • 1. Roman Nikitchenko, 09.05.2014 BIG.DATA technologies & HADOOP infrastructure
  • 2. 2www.vitech.com.ua Agenda Hadoop causes real big data Industry changes What technology is behind this name? Why Hadoop is so promising solution? BIG DATA APPROACH HADOOP ENVIRONMENT INDUSTRY FACE IS CHANGING
  • 4. 4www.vitech.com.ua What is BIG DATA? ● Really BIG DATA things: photo banks, video storage, historical measurements. ● Intensive data transactions and high distribution: stores (offline or online), banks, advertising networks. ● Realtime data: measurements and minitoring, gaming. ● Intensive processing: science, modelling. ● High volumes of small things: social networks, healthcare BIG DATA IS EVERYWHERE
  • 5. 5www.vitech.com.ua BIG DATA in just 3 words Indeed any real big data is just about DIGITAL LIFE FOOTPRINT
  • 6. 6www.vitech.com.ua WORLD is big data itself Yet to remember.... WORLD ITSELF CAN BE DIGITIZED TOO ● Earth weather and environment: realtime, really big data volume, high potential for processing, lot of things to be analysed, historical data. ● Space: unlimited potential for analysis, ocean is yet unknow volume. ● Internet of things is going to be digital world itself. ● ???
  • 7. 7www.vitech.com.ua So... BIG DATA is not about the data. It is about OUR ABILITY TO HANDLE THEM.
  • 8. 8www.vitech.com.ua But how can I handle big data? … BUT HOW TO HANDLE IT? BIG DATA
  • 9. 9www.vitech.com.ua BIG DATA storage: requirements NO BACKUPS
  • 10. 10www.vitech.com.ua BIG DATA storage: requirements SIMPLE BUT RELIABLE ● Really big amount of data is to be stored in reliable manner. ● Storage is to be simple, recoverable and cheap.
  • 11. 11www.vitech.com.ua BIG DATA storage: requirements DECENTRALIZED ● No single point of failure. ● Scalable as close to linear as possible. ● No manual actions to recover in case of failures
  • 12. 12www.vitech.com.ua BIG DATA processing: requirements SIMPLE TO USE ● Complexity is to be burried inside. ● Interface is to be functional and compatible between versions.
  • 13. 13www.vitech.com.ua BIG DATA processing: requirements TOOLS TO BE CLOSE TO WORK ● Process data on the same nodes as it is stored on. ● Distributed storage — distributed processing.
  • 14. 14www.vitech.com.ua BIG DATA processing: requirements ● Work is to be balanced. ● Data placement is to be appropriate to balanced work. ● Amount of work is to be balanced in accordance to resources. SHARE LOAD
  • 15. 15www.vitech.com.ua Solution requirements in general WHAT FINALLY DO WE NEED? ● CPU+HDD in one place ● Cluster of replacable nodes ● Lot of storage space ● Way to control resources and balance load ● Everything is to be relatively simple and affordable x MAX + = BIG DATA
  • 16. 16www.vitech.com.ua … and what is the solution? HADOOP magic is here!
  • 17. 17www.vitech.com.ua What is it? What is HADOOP? ● Hadoop is open source framework for big data. Both distributed storage and processing. ● Hadoop is reliable and fault tolerant with no rely on hardware for these properties. ● Hadoop has unique horisontal scalability. Currently — from single computer up to thousands of cluster nodes.
  • 18. 18www.vitech.com.ua Facts and trends ● 2004, Was inspired by by Google MapReduce idea. Originally was named just after son's elephant toy. ● On June 13, 2012 Facebook announced their Hadoop cluster has 100 PB of data. On November 8, 2012 they announced the warehouse grows by roughly half a PB per day. ● On February 19, 2008, Yahoo! Inc. launched what it claimed was the world's largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster.
  • 19. 19www.vitech.com.ua Hadoop: classical picture Hadoop historical top view ● HDFS serves as file system layer ● MapReduce originally served as distributed processing framework. ● Native client API is Java but there are lot of alternatives. ● This is only initial architecture and it is now more complex.
  • 20. 20www.vitech.com.ua HDFS top view ● Namenode is 'management' component. Keeps 'directory' of what file blocks are stored where. ● Actual work is performed by data nodes.
  • 21. 21www.vitech.com.ua HDFS files handling ● Files are stored in large enough blocks. Every block is replicated to several data nodes. ● Replication is tracked by namenode. Clients only locate blocks using namenode and actual load is taken by datanode. ● Datanode failure leads to replication recovery. Namenode could be backed by standby scheme.
  • 22. 22www.vitech.com.ua HDFS properties ● Designed for throughput, not for latency. ● Blocks are expected to be large. There is issue with lot of small files. ● Write once, read many times ideology. ● Only append, no 'edit' ability. ● Special tools are required to implement OLTP like Apache HBase. HDFS is ...
  • 23. 23www.vitech.com.ua MapReduce framework model ● 2 steps data processing: transform and then reduce. Really nice to do things in distributed manner. ● Large class of jobs can be adopted but not all of them.
  • 24. 24www.vitech.com.ua MapReduce service: top view ● One JobTracker with redundancy possible. ● Multiple TaskTrackers doing actual job. ● Ideology is similar to HDFS handling. ● HDFS is usually used as storage on all phases. MapReduce service
  • 25. 25www.vitech.com.ua Technology: Hadoop 2.0 concept ● New component (YARN) forms resource management layer and completes real distributed data OS. ● MapReduce is from now only one among other YARN appliactions.
  • 26. 26www.vitech.com.ua YARN: notable addition ● Resource manager dispatches client requests. ● Node managers manage node resources. ● Any application is set of containers including application master. YARN service
  • 27. 27www.vitech.com.ua YARN: notable addition ● Better resource balance for heterogeneous clusterss and multple applications. ● Dynamic applications over static services. ● Much wider applications model over simple MapReduce. Things like Spark ot Tez. Why YARN is SO important?
  • 28. 28www.vitech.com.ua Hadoop current picture ● HDFS2 is now about storage and YARN is about processing resources. ● Lot of things to do on top of this data OS starting from traditional MapReduce. Now there is lot of alternatives.
  • 29. 29www.vitech.com.ua Just several items around Infrastructure ● HBase: Scalable structured data storage for large tables. ● Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. ● Mahout: A Scalable machine learning and data mining library. ● Pig: A high-level data-flow language and execution framework for parallel computation. ● ZooKeeper: A high-performance distributed coordination service.
  • 30. 30www.vitech.com.ua Most important concept First ever world DATA OS 10.000 nodes computer... Recent technology changes are focused on higher scale. Better resource usage and control, lower MTTR, higher security, redundancy, fault tolerance.
  • 31. 31www.vitech.com.ua Big data industry is changing. HADOOP has influence on whole BIG DATA INDUSTRY face
  • 32. 32www.vitech.com.ua New concepts DATA LAKE Take as much data about your business processes as you can take. The more data you have the more value you could get from it.
  • 33. 33www.vitech.com.ua New concepts ENTERPRISE DATA HUB Don't ruine your existing data warehouse. Just extend it with new, centralized big data storage through data migration solution.
  • 34. 34www.vitech.com.ua Trends Big data is goind BIGGER ● SSD are going to be widely used as storage and memory based replica is not a miracle anymore. ● Memory and SSD based caching schemes are going to be more and more aggressive. Particularry in HDFS and HBase. ● Clusters grow. Currently some open source features are targeted for clusters of 1K nodes. How about staging 300 nodes cluster in companies like EBay? ● Production clusters go beyond 4000 nodes (up to 10K). Node failure nearly every day.
  • 35. 35www.vitech.com.ua Trends ● Typecal node is expected to include at least 64G memory ● Starting from 4 x 2T drives for storage. 8-16 x 4T drives are not so rare. This is for general 'workload' node. ● 10 and more CPU cores. 2 CPUs is normal approach. ● SSD is starting to be widely used not only for OS and caching but for data itself. ● Main outcome — per node costs model is changing. HARDWARE IS GOING CHEAPER
  • 36. 36www.vitech.com.ua Most important concept ● You need to limit things you are guessing
  • 37. 37www.vitech.com.ua For whom bell tools? Old way ● Make assumptions about data you need. ● Make assumptions about data model. ● Make assumptions about algorithms you need. ● Get confirmation for your initial guess about result. Are you surprised? New way ● Get as much data as you can. ● Detect data model based on set of algorithms with extensive approach. ● Cluster your data, detect correlations, clean from anomalies... in all way you can afford on whole data set. ● Get grounded results. You still cen miss some fundamental aspects but isn't it much better in any case?
  • 38. 38www.vitech.com.ua Major Hadoop distributions ● HortonWorks are 'barely open source'. Innovative, but 'running too fast'. Most ot their key technologies are not so mature yet. ● Cloudera is stable enough but not stale. Hadoop 2.3 with YARN, HBase 0.96.x. Balance. ● MapR focuses on performance per node but they are slightly outdated in term of functionality and their distribution costs. For cases where node performance is high priority. ● Intel is newcomer on this market. Not for near future.