Hadoop for Data Warehousing professionals

Hadoop for Data Warehousing
professionals
www.edureka.in/hadoop
View Complete Course at : www.edureka.in/hadoop
*
Post your Questions on Twitter on @edurekaIN: #askEdureka

Objectives of this Session
• Un
 What is Big Data
 Traditional Data Warehousing Solutions
 Problems with Traditional DWH Solutions
 Data Warehousing and Hadoop
 Hadoop for Big Data
 Hadoop and MapReduce
 Why Hadoop for Big Data?
 Real-world Use Case – Sears Holding Corp.
 Where to use what
For Queries during the session and class recording:
Post on Twitter @edurekaIN: #askEdureka
Post on Facebook /edurekaIN

Big Data
 Lots of Data (Terabytes or Petabytes)
 Big data is the term for a collection of data sets
so large and complex that it becomes difficult to
process using on-hand database management
tools or traditional data processing applications.
 The challenges include capture, curation,
storage, search, sharing, transfer, analysis, and
visualization.
cloud
tools
statistics
No SQL
compression
storage
support
database
analize
information
terabytes
processing
mobile
Big Data

Unstructured Data is Exploding
 2,500 exabytes of new information in 2012 with internet as primary driver
 “Digital universe grew by 62% last year to 800K petabytes and will grow to1.2 zettabytes” this year

Big Data Scenarios : Hospital Care
Hospitals are analyzing medical data and patient
records to predict those patients that are likely to seek
readmission within a few months of discharge. The
hospital can then intervene in hopes of preventing
another costly hospital stay.
Medical diagnostics company analyzes millions of lines
of data to develop first non-intrusive test for
predicting coronary artery disease. To do so,
researchers at the company analyzed over 100 million
gene samples to ultimately identify the 23 primary
predictive genes for coronary artery disease

http://wp.streetwise.co/wp-content/uploads/2012/08/Amazon-Recommendations.png
Amazon has an unrivalled bank of data on online consumer
purchasing behaviour that it can mine from its 152
million customer accounts.
Amazon also uses Big Data to monitor, track and secure its 1.5
billion items in its retail store that are laying around it 200
fulfilment centres around the world. Amazon stores the
product catalogue data in S3.
S3 can write, read and delete objects up to 5 TB of data each.
The catalogue stored in S3 receives more than 50 million
updates a week and every 30 minutes all data received is
crunched and reported back to the different warehouses and
the website.
Big Data Scenarios : Amazon.com

http://smhttp.23575.nexcesscdn.net/80ABE1/sbmedia/blog/wp-content/uploads/2013/03/netflix-in-asia.png
Netflix uses 1 petabyte to store the videos for streaming.
BitTorrent Sync has transferred over 30 petabytes of data
since its pre-alpha release in January 2013.
The 2009 movie Avatar is reported to have taken over 1
petabyte of local storage at Weta Digital for the rendering
of the 3D CGI effects.
One petabyte of average MP3-encoded songs (for mobile,
roughly one megabyte per minute), would require 2000
years to play.
Big Data Scenarios: NetFlix

 IBM’s Definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
Web
logs
Images
Videos
Audios
Sensor
Data
Volume Velocity Variety
IBM’s Definition

Traditional Data Warehousing Solutions
CRM OLTP SCM LegacyERP
Extract, Transform, Load
Data Warehouse
BI and Analytics
Batch
Processing

Problems with Traditional DWH Solutions
Increasing Data Volumes New data sources and types Need for Real-time data processing
Email and documents
Social Media, Web Logs
Machine Device(Scientific)
Transactions,
OLTP, OLAP

Data Warehousing and Hadoop
Data Sources
Hadoop Extract,
Transform
File Copy,
Streaming
DW
Query +
PresentSensor Data
Web Logs

Hadoop for Big Data
 Apache Hadoop is a framework that allows for the distributed processing of large data sets across
clusters of commodity computers using a simple programming model.
 It is an Open-source Data Management with scale-out storage & distributed processing.

Hadoop and MapReduce
Hadoop is a system for large scale data processing.
It has two main components:
 HDFS – Hadoop Distributed File System (Storage)
 highly fault-tolerant
 high throughput access to application data
 suitable for applications that have large data set
 Natively redundant
MapReduce (Processing)
 software framework for easily writing applications which process
vast amounts of data (multi-terabyte data-sets) in-parallel on
large clusters (thousands of nodes) in a reliable, fault-tolerant
manner
 Splits a task across processors
Map-Reduce
Key Value

Hadoop Vs. RDBMS
vs.Many enterprises are turning to Hadoop
 Especially applications generating
big data
 Web applications, social networks,
scientific applications
Hadoop
MapReduce Computing Paradigm
Traditional database systems
RDBMS concepts
Database

Why Hadoop for Big Data?
Database
Scalability (petabytes of data,
thousands of machines)
Flexibility in accepting all data formats
(no schema)
Efficient and simple fault-tolerant
mechanism
Commodity inexpensive hardware
Performance (tons of indexing, tuning,
data organization tech.)
Features:
- Provenance tracking
- Annotation management
- ….

Real-world Use Case – Sears Holding Corp.
Case Study: Sears Holding Corporation
X
 Sears was using traditional systems such as Oracle Exadata,
Teradata and SAS etc. to store and process the customer
activity and sales data.
 Sears incorporated Hadoop in it’s Data Warehousing solution
 Insight into data can provide Business Advantage.
 Some key early indicators can mean Fortunes to Business.
 More Precise Analysis with more data.
http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?

90% of
the ~2PB
Archived
Storage
Processing
Instrumentation
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
ETL Compute Grid
3. Premature data
death
1. Can’t explore original
high fidelity raw data
2. Moving data to compute
doesn’t scale
Mostly Append
A meagre
10% of the
~2PB Data is
available for
BI
Storage only Grid (original Raw Data)
Collection
Limitations of Existing Data Analytics Architecture

*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather
than a meagre 10% as was the case with existing Non-Hadoop solutions.
No Data
Archiving
1. Data Exploration &
Advanced analytics
2. Scalable throughput for ETL &
aggregation
3. Keep data alive
forever
Mostly Append
Instrumentation
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
Collection
Hadoop : Storage + Compute Grid
Entire ~2PB
Data is
available for
processing
Both
Storage
And
Processing
Solution: Data Warehousing with Hadoop

Scalability in Hadoop
Slaves
Master

Where to use what?
Requirement Data Warehouse Hadoop
Low latency, interactive reports, and OLAP
ANSI 2003 SQL compliance is required
Preprocessing or exploration of raw unstructured data
Online archives alternative to tape
High-quality cleansed and consistent data
100s to 1000s of concurrent users
Discover unknown relationship in the data
Parallel complex process logic
CPU intense analysis
Systems, users and data governance
Many flexible programming languages running in parallel
Unrestricted, ungoverned sand box explorations
Analysis of provisional data
Extensive security and regulatory compliance
Real time data loading and 1 second tactical queries
http://www.teradata.com/white-papers/Hadoop-and-the-Data-Warehouse-When-to-Use-Which

Questions?
Buy Complete Course at : www.edureka.in/hadoop
Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions www.edureka.in/hadoop

Hadoop for Data Warehousing professionals

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Hadoop for Data Warehousing professionals

Similar a Hadoop for Data Warehousing professionals (20)

Más de Edureka!

Más de Edureka! (20)

Último

Último (20)

Hadoop for Data Warehousing professionals