Introduction to Big Data & Hadoop

www.edureka.co/big-data-and-hadoopCMC Contact : aparna.jaiswal@cmcltd.com Edureka Contact : corp@edureka.co
Introduction to big data and hadoop

CMC Contact : aparna.jaiswal@cmcltd.com Edureka Contact : corp@edureka.co www.edureka.co/big-data-and-hadoop
Objectives
At the end of this session , you will understand the:
 Big Data Introduction
 Use Cases of Big Data in Multiple Industry Verticals
 Hadoop and Its Eco-System
 Hadoop Architecture
 Learning Path for Developers, Administrators, Testing Professionals and Aspiring Data Scientists

Un-structured Data is Exploding
Source: Twitter

www.edureka.co/big-data-and-hadoop
IBM’s Definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
IBM’s Definition of Big Data

Annie’s Introduction
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.

Annie’s Question
Map the following to corresponding data type:
» XML files, e-mail body
» Audio, Video, Images, Archived documents
» Data from Enterprise systems (ERP, CRM etc.)

Annie’s Answer
Ans. XML files, e-mail body  Semi-structured data
Audio, Video, Image, Files, Archived documents  Unstructured data
Data from Enterprise systems (ERP, CRM etc.)  Structured data

Further Reading
More on Big Data
http://www.edureka.in/blog/the-hype-behind-big-data/
Why Hadoop?
http://www.edureka.in/blog/why-hadoop/
Opportunities in Hadoop
http://www.edureka.in/blog/jobs-in-hadoop/
Big Data
http://en.wikipedia.org/wiki/Big_Data
IBM’s definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/

Slide 9Slide 9 www.edureka.co/big-data-and-hadoopCMC Contact : aparna.jaiswal@cmcltd.com Edureka Contact : corp@edureka.co
Common Big Data Customer Scenarios
 Web and e-tailing
» Recommendation Engines
» Ad Targeting
» Search Quality
» Abuse and Click Fraud Detection
 Telecommunications
» Customer Churn Prevention
» Network Performance Optimization
» Calling Data Record (CDR) Analysis
» Analysing Network to Predict Failure
http://wiki.apache.org/hadoop/PoweredBy

 Government
» Fraud Detection and Cyber Security
» Welfare Schemes
» Justice
 Healthcare and Life Sciences
» Health Information Exchange
» Gene Sequencing
» Serialization
» Healthcare Service Quality Improvements
» Drug Safety
Common Big Data Customer Scenarios (Contd.)

Common Big Data Customer Scenarios (Contd.)
 Banks and Financial services
» Modeling True Risk
» Threat Analysis
» Fraud Detection
» Trade Surveillance
» Credit Scoring and Analysis
 Retail
» Point of Sales Transaction Analysis
» Customer Churn Analysis
» Sentiment Analysis

Why DFS?
Read 1 TB Data
4 I/O Channels
Each Channel – 100 MB/s
1 Machine
4 I/O Channels
10 Machine

Why DFS? (Contd.)
4 I/O Channels
1 Machine
4 I/O Channels
10 Machine
43 Minutes
Read 1 TB Data

Why DFS? (Contd.)
4 I/O Channels
1 Machine
4 I/O Channels
10 Machine
4.3 Minutes43 Minutes
Read 1 TB Data

RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Hadoop Cluster: A Typical Use Case
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores.
OS: 64-bit CentOS
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
OS: 64-bit CentOS
Power: Redundant Power Supply
RAM: 32 GB,
Hard disk: 1 TB
OS: 64-bit CentOS
Active NameNodeSecondary NameNode
DataNode DataNode
RAM: 64 GB,
Hard disk: 1 TB
OS: 64-bit CentOS
StandBy NameNode

Hidden Treasure
 Insight into data can provide Business Advantage.
 Some key early indicators can mean Fortunes to Business.
 More Precise Analysis with more data.
*Sears was using traditional systems such as Oracle Exadata, Teradata and
SAS etc., to store and process the customer activity and sales data.
Case Study: Sears Holding Corporation
http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?

Mostly Append
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
ETL Compute Grid
Storage only Grid (Original Raw Data)
Collection
Instrumentation
A meagre
10% of the
~2PB data is
available for
BI
Storage
2. Moving data to compute
doesn’t scale
90% of
the ~2PB
archived
Processing
3. Premature data
death
1. Can’t explore original
high fidelity raw data
Limitations of Existing Data Analytics Architecture

Mostly Append
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
Hadoop : Storage + Compute Grid
Collection
Instrumentation
Both
Storage
And
Processing
Entire ~2PB
Data is
available for
processing
No Data
Archiving
1. Data Exploration &
Advanced analytics
2. Scalable throughput for ETL &
aggregation
3. Keep data alive
forever
*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as
was the case with existing Non-Hadoop solutions.
Solution: A Combined Storage Computer Layer

Annie’s Question
Hadoop is a framework that allows for the distributed
processing of:
» Small Data Sets
» Large Data Sets

Annie’s Answer
Ans. Large Data Sets.
It is also capable of processing small data-sets. However, to
experience the true power of Hadoop, one needs to have
data in TB’s. Because this is where RDBMS takes hours and
fails whereas Hadoop does the same in couple of minutes.

Hadoop Ecosystem
Pig Latin
Data Analysis
Hive
DW System
Other
YARN
Frameworks
(MPI, GRAPH)
HBaseMapReduce Framework
YARN
Cluster Resource Management
Apache Oozie
(Workflow)
HDFS
(Hadoop Distributed File System)
Hadoop 2.0
Sqoop
Unstructured or
Semi-structured Data Structured Data
Flume
Mahout
Machine Learning

Hadoop Cluster: Facebook
Facebook
 We use Hadoop to store copies of internal log and dimension data sources and use
it as a source for reporting/analytics and machine learning.
 Currently we have 2 major clusters:
» A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
» A 300-machine cluster with 2400 cores and about 3 PB raw storage.
» Each (commodity) node has 8 cores and 12 TB of storage.
» We are heavy users of both streaming as well as the Java APIs. We have
built a higher level data warehousing framework using these features called
Hive(see the http://Hadoop.apache.org/hive/). We have also developed a
FUSE implementation over HDFS.

BATCH
(MapReduce)
INTERACTIVE
(Text)
ONLINE
(HBase)
STREAMING
(Storm, S4, …)
GRAPH
(Giraph)
IN-MEMORY
(Spark)
HPC MPI
(OpenMPI)
OTHER
(Search)
(Weave..)
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html
YARN – Moving beyond MapReduce

Hadoop can run in any of the following three modes:
Fully-Distributed Mode
Pseudo-Distributed Mode
 No daemons, everything runs in a single JVM.
 Suitable for running MapReduce programs during development.
 Has no DFS.
 Hadoop daemons run on the local machine.
 Hadoop daemons run on a cluster of machines.
Standalone (or Local) Mode
Hadoop Cluster Modes

Big Data Learning Path
• Java / Python / Ruby
• Hadoop Eco-system
• NoSQL DB
• Spark
• Linux Administration
• Cluster Management
• Cluster Performance
• Virtualization
• Statistics Skills
• Machine Learning
• Hadoop Essentials
• Expertise in R
Developer/Testing
Administration
Data Analyst
Big Data and Hadoop
MapReduce
Design Patterns
Apache
Spark & Scala
Apache Cassandra
Linux Administration Hadoop Administration
Data Science
Business Analytics
Using R
Advance Predictive
Modelling in R
Talend for Big Data
Data Visualization
Using Tableau

Learning Path to Certification
CourseLIVE Online Class Class Recording in LMS
24/7 Post Class Support Module Wise Quiz and Assignment
Project Work
Verifiable Certificate
1. Assistance from Peers and
Support team
2. Review for Certification

Further Reading
 Apache Hadoop and HDFS
http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/
 Apache Hadoop HDFS Architecture
http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/

Assignment
Referring the documents present in the LMS under assignment solve the below problem.
How many such DataNodes you would need to read 100TB data in 5 minutes in your Hadoop Cluster?

Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!
Please spare few minutes to take the survey after the webinar.
Survey

Introduction to Big Data & Hadoop

Introduction to Big Data & Hadoop

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Introduction to Big Data & Hadoop

Similar a Introduction to Big Data & Hadoop (20)

Más de Edureka!

Más de Edureka! (20)

Último

Último (20)

Introduction to Big Data & Hadoop