Hadoop simplifies your job as a Data Warehousing professional. With Hadoop, you can manage any volume, variety and velocity of data, flawlessly and comparably in less time. As a Data Warehousing professional, you will undoubtedly have troubleshooting and data processing skills. These skills are sufficient for you to be a proficient Hadoop-er.
Key Questions Answered
What is Big Data and Hadoop?
What are the limitations of current Data Warehouse solutions?
How Hadoop solves these problems?
Real World Hadoop Use-Case in Data Warehouse Solutions?
1. Slide 1
Hadoop for Data Warehousing
professionals
www.edureka.in/hadoop
View Complete Course at : www.edureka.in/hadoop
*
Post your Questions on Twitter on @edurekaIN: #askEdureka
2. Slide 2
Objectives of this Session
• Un
What is Big Data
Traditional Data Warehousing Solutions
Problems with Traditional DWH Solutions
Data Warehousing and Hadoop
Hadoop for Big Data
Hadoop and MapReduce
Why Hadoop for Big Data?
Real-world Use Case – Sears Holding Corp.
Where to use what
For Queries during the session and class recording:
Post on Twitter @edurekaIN: #askEdureka
Post on Facebook /edurekaIN
www.edureka.in/hadoop
3. Slide 3
Big Data
Lots of Data (Terabytes or Petabytes)
Big data is the term for a collection of data sets
so large and complex that it becomes difficult to
process using on-hand database management
tools or traditional data processing applications.
The challenges include capture, curation,
storage, search, sharing, transfer, analysis, and
visualization.
cloud
tools
statistics
No SQL
compression
storage
support
database
analize
information
terabytes
processing
mobile
Big Data
www.edureka.in/hadoop
4. Slide 4
Unstructured Data is Exploding
2,500 exabytes of new information in 2012 with internet as primary driver
“Digital universe grew by 62% last year to 800K petabytes and will grow to1.2 zettabytes” this year
www.edureka.in/hadoop
5. Slide 5
Big Data Scenarios : Hospital Care
Hospitals are analyzing medical data and patient
records to predict those patients that are likely to seek
readmission within a few months of discharge. The
hospital can then intervene in hopes of preventing
another costly hospital stay.
Medical diagnostics company analyzes millions of lines
of data to develop first non-intrusive test for
predicting coronary artery disease. To do so,
researchers at the company analyzed over 100 million
gene samples to ultimately identify the 23 primary
predictive genes for coronary artery disease
www.edureka.in/hadoop
6. Slide 6 http://wp.streetwise.co/wp-content/uploads/2012/08/Amazon-Recommendations.png
Amazon has an unrivalled bank of data on online consumer
purchasing behaviour that it can mine from its 152
million customer accounts.
Amazon also uses Big Data to monitor, track and secure its 1.5
billion items in its retail store that are laying around it 200
fulfilment centres around the world. Amazon stores the
product catalogue data in S3.
S3 can write, read and delete objects up to 5 TB of data each.
The catalogue stored in S3 receives more than 50 million
updates a week and every 30 minutes all data received is
crunched and reported back to the different warehouses and
the website.
Big Data Scenarios : Amazon.com
www.edureka.in/hadoop
7. Slide 7 http://smhttp.23575.nexcesscdn.net/80ABE1/sbmedia/blog/wp-content/uploads/2013/03/netflix-in-asia.png
Netflix uses 1 petabyte to store the videos for streaming.
BitTorrent Sync has transferred over 30 petabytes of data
since its pre-alpha release in January 2013.
The 2009 movie Avatar is reported to have taken over 1
petabyte of local storage at Weta Digital for the rendering
of the 3D CGI effects.
One petabyte of average MP3-encoded songs (for mobile,
roughly one megabyte per minute), would require 2000
years to play.
Big Data Scenarios: NetFlix
www.edureka.in/hadoop
8. Slide 8
IBM’s Definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
Web
logs
Images
Videos
Audios
Sensor
Data
Volume Velocity Variety
IBM’s Definition
www.edureka.in/hadoop
9. Slide 9
Traditional Data Warehousing Solutions
CRM OLTP SCM LegacyERP
Extract, Transform, Load
Data Warehouse
BI and Analytics
Batch
Processing
www.edureka.in/hadoop
10. Slide 10
Problems with Traditional DWH Solutions
Increasing Data Volumes New data sources and types Need for Real-time data processing
Email and documents
Social Media, Web Logs
Machine Device(Scientific)
Transactions,
OLTP, OLAP
www.edureka.in/hadoop
11. Slide 11
Data Warehousing and Hadoop
Data Sources
Hadoop Extract,
Transform
File Copy,
Streaming
DW
Query +
PresentSensor Data
Web Logs
www.edureka.in/hadoop
12. Slide 12
Hadoop for Big Data
Apache Hadoop is a framework that allows for the distributed processing of large data sets across
clusters of commodity computers using a simple programming model.
It is an Open-source Data Management with scale-out storage & distributed processing.
www.edureka.in/hadoop
13. Slide 13
Hadoop and MapReduce
Hadoop is a system for large scale data processing.
It has two main components:
HDFS – Hadoop Distributed File System (Storage)
highly fault-tolerant
high throughput access to application data
suitable for applications that have large data set
Natively redundant
MapReduce (Processing)
software framework for easily writing applications which process
vast amounts of data (multi-terabyte data-sets) in-parallel on
large clusters (thousands of nodes) in a reliable, fault-tolerant
manner
Splits a task across processors
Map-Reduce
Key Value
www.edureka.in/hadoop
14. Slide 14
Hadoop Vs. RDBMS
vs.Many enterprises are turning to Hadoop
Especially applications generating
big data
Web applications, social networks,
scientific applications
Hadoop
MapReduce Computing Paradigm
Traditional database systems
RDBMS concepts
Database
www.edureka.in/hadoop
15. Slide 15
Why Hadoop for Big Data?
Database
Scalability (petabytes of data,
thousands of machines)
Flexibility in accepting all data formats
(no schema)
Efficient and simple fault-tolerant
mechanism
Commodity inexpensive hardware
Performance (tons of indexing, tuning,
data organization tech.)
Features:
- Provenance tracking
- Annotation management
- ….
www.edureka.in/hadoop
16. Slide 16
Real-world Use Case – Sears Holding Corp.
Case Study: Sears Holding Corporation
X
Sears was using traditional systems such as Oracle Exadata,
Teradata and SAS etc. to store and process the customer
activity and sales data.
Sears incorporated Hadoop in it’s Data Warehousing solution
Insight into data can provide Business Advantage.
Some key early indicators can mean Fortunes to Business.
More Precise Analysis with more data.
www.edureka.in/hadoop
http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?
17. Slide 17
90% of
the ~2PB
Archived
Storage
Processing
Instrumentation
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
ETL Compute Grid
3. Premature data
death
1. Can’t explore original
high fidelity raw data
2. Moving data to compute
doesn’t scale
Mostly Append
A meagre
10% of the
~2PB Data is
available for
BI
Storage only Grid (original Raw Data)
Collection
Limitations of Existing Data Analytics Architecture
www.edureka.in/hadoop
18. Slide 18
*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather
than a meagre 10% as was the case with existing Non-Hadoop solutions.
No Data
Archiving
1. Data Exploration &
Advanced analytics
2. Scalable throughput for ETL &
aggregation
3. Keep data alive
forever
Mostly Append
Instrumentation
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
Collection
Hadoop : Storage + Compute Grid
Entire ~2PB
Data is
available for
processing
Both
Storage
And
Processing
Solution: Data Warehousing with Hadoop
www.edureka.in/hadoop
20. Slide 20
Where to use what?
Requirement Data Warehouse Hadoop
Low latency, interactive reports, and OLAP
ANSI 2003 SQL compliance is required
Preprocessing or exploration of raw unstructured data
Online archives alternative to tape
High-quality cleansed and consistent data
100s to 1000s of concurrent users
Discover unknown relationship in the data
Parallel complex process logic
CPU intense analysis
Systems, users and data governance
Many flexible programming languages running in parallel
Unrestricted, ungoverned sand box explorations
Analysis of provisional data
Extensive security and regulatory compliance
Real time data loading and 1 second tactical queries
www.edureka.in/hadoop
http://www.teradata.com/white-papers/Hadoop-and-the-Data-Warehouse-When-to-Use-Which
21. Slide 21
Questions?
Buy Complete Course at : www.edureka.in/hadoop
Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions www.edureka.in/hadoop