Más contenido relacionado
La actualidad más candente (20)
Similar a Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3 (20)
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
- 1. We’ll get started soon…
Q&A box is available for your questions
Webinar will be recorded for future viewing
Thank you for joining!
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
- 2. Deliver the Data Lake (demo/deep dive)
…using HDP and Red Hat JBoss Data Virtualization
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
We do Hadoop.
- 3. Your speakers…
Raghu Thiagarajan, Dir, Partner Product Management, Hortonworks
Kimberly Palko, Principal Product Manager, Red Hat
Kenny Peeples, Principal Technical Marketing Manager, Red Hat
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
- 4. An architectural shift towards an HDP Data Lake
Unlocking the Data Lake
SCALE SCOPE
RDBMS
MPP
EDW
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Lake
Enabled by YARN
• Single data repository,
shared infrastructure
• Multiple biz apps
accessing all the data
• Enable a shift from
reactive to proactive
interactions
• Gain new insight across
the entire enterprise
New Analytic Apps
or IT Optimization
HDP 2.1
Governance
& Integration
Security
Operations
Data Access
YARN
Data Management
- 5. What is a Data Lake?
Architectural Pattern in the Data Center
Uses Hadoop to deliver deeper insight across a large, broad, diverse set
of data efficiently
§ Multipurpose, Open PLATFORM for Data (NOT a database)
§ Land all data in a single place and interact with it in many ways
§ Allows for the ecosystem to provide higher level services (SAS, SAP, Microsoft for Streaming,
MPP, In-memory, etc..)
§ First class data management capabilities (metadata management, security, transformation
pipelines, replication, retention, etc..)
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
- 6. HDP Data Lake Solution Architecture
Manage Steps 1-4: Data Management with Falcon, Security with HDP Advanced
Security
Step 4: Schedule and Orchestrate
Step 3: Transform, Aggregate & Materialize
STORM
JMS
Step 1:Extract & Load
NFS
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HIVE PIG Cascading
(table & user-defined metadata)
Step 2: Model/Apply Metadata
compute
&
storage
HCATALOG
. . .
SolR
Storm
. . .
. .
compute
&
storage
.
.
YARN
AMBARI
Data Lake HDP Grid
Use Case Type 1:
Materialize &
Exchange
Interactive
Hive Server
(Tez/Stinger)
Stream Processing,
Real-time Search,
MPI, etc.
YARN Apps
Opens up Many
New Use Cases
Query/
Analytics/Reporting
Tools
Tableau, Excel,
Microstrategy
Datameer, Platfora,
Business Objects
Use Case Type 2:
Explore/Visualize
FALCON (Data pipeline & flow management)
SOURCE DATA
Click Stream
Sales
Transactions
Product Data
Marketing/
Inventory
Social Data
EDW
NFS
Apache Argus (Unified Access Controls and Audit)
(data processing)
Exchange
HBase
Client
Sqoop/Hive
Downstream
Data Sources
OLTP
HBase
EDW
(Teradata)
MR2 Graph
SAS
Ingestion
SQOOP
FLUME
Web HDFS
REST
HTTP
Streamin
g
TEZ
Mahout
- 7. HDP Data Lake Solution Architecture + Virtual Data Mart
Manage Steps 1-4: Data Management with Falcon, Security with HDP
Advanced Security
Step 4: Schedule and Orchestrate
HIVE PIG Cascadin
g
Step 3: Transform, Aggregate & Materialize
(table & user-defined metadata)
Step 2: Model/Apply Metadata
compute
&
storage
STORM
JMS
Step 1:Extract & Load
NFS
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HCATALOG
. . .
SolR
Storm
. . .
. .
compute
&
storage
.
.
YARN
AMBARI
Data Lake HDP Grid
Use Case Type 1:
Materialize &
Exchange
Interactive
Hive Server
(Tez/Stinger)
Stream
Processing,
Real-time Search,
MPI, etc.
YARN Apps
Opens up Many
New Use Cases
Query/
Analytics/
Reporting Tools
Tableau, Excel,
Microstrategy
Datameer,
Platfora, Business
Objects
Use Case Type 2:
Explore/Visualize
FALCON (Data pipeline & flow management)
SOURCE DATA
Click Stream
Sales
Transactions
Product Data
Marketing/
Inventory
Social Data
EDW
NFS
Apache Argus (Unified Access Controls and Audit)
(data processing)
Exchange
HBase
Client
Sqoop/Hive
Downstream
Data Sources
OLTP
HBase
EDW
(Teradata)
MR2 Graph
SAS
Ingestion
SQOOP
FLUME
Web HDFS
REST
HTTP
Streami
ng
TEZ
Mahout
Dept Base Virtual Database (VDB)
Team 1
VDB
Team2
VDB
View1 View2
- 8. Yarn allows for new processing engines
Manage Steps 1-4: Data Management with Falcon, Security with HDP Advanced
STORM
JMS
Step 1:Extract & Load
NFS
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Security
Step 4: Schedule and Orchestrate
HIVE PIG Cascading
Step 3: Transform, Aggregate & Materialize
(table & user-defined metadata)
Step 2: Model/Apply Metadata
compute
&
storage
HCATALOG
. . .
SolR
Storm
. . .
. .
compute
&
storage
.
.
YARN
AMBARI
Data Lake HDP Grid
Use Case Type 1:
Materialize &
Exchange
Interactive
Hive Server
(Tez/Stinger)
Stream Processing,
Real-time Search,
MPI, etc.
YARN Apps
Opens up Many New
Use Cases
Query/
Analytics/Reporting
Tools
Tableau, Excel,
Microstrategy
Datameer, Platfora,
Business Objects
Use Case Type 2:
Explore/Visualize
FALCON (Data pipeline & flow management)
SOURCE DATA
Click Stream
Sales
Transactions
Product Data
Marketing/
Inventory
Social Data
EDW
NFS
Apache Argus (Unified Access Controls and Audit)
(data processing)
Exchange
HBase
Client
Sqoop/Hive
Downstream
Data Sources
OLTP
HBase
EDW
(Teradata)
MR2 Graph
SAS
Ingestion
SQOOP
FLUME
Web HDFS
REST
HTTP
Streamin
g
TEZ
Mahout
- 9. Falcon enables Governance of Data Pipelines
Manage Steps 1-4: Data Management with Falcon, Security with HDP Advanced
STORM
JMS
Step 1:Extract & Load
NFS
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Security
Step 4: Schedule and Orchestrate
HIVE PIG Cascading
Step 3: Transform, Aggregate & Materialize
(table & user-defined metadata)
Step 2: Model/Apply Metadata
compute
&
storage
HCATALOG
. . .
SolR
Storm
. . .
. .
compute
&
storage
.
.
YARN
AMBARI
Data Lake HDP Grid
Use Case Type 1:
Materialize &
Exchange
Interactive
Hive Server
(Tez/Stinger)
Stream Processing,
Real-time Search,
MPI, etc.
YARN Apps
Opens up Many New
Use Cases
Query/
Analytics/Reporting
Tools
Tableau, Excel,
Microstrategy
Datameer, Platfora,
Business Objects
Use Case Type 2:
Explore/Visualize
FALCON (Data pipeline & flow management)
SOURCE DATA
Click Stream
Sales
Transactions
Product Data
Marketing/
Inventory
Social Data
EDW
NFS
Apache Argus (Unified Access Controls and Audit)
(data processing)
Exchange
HBase
Client
Sqoop/Hive
Downstream
Data Sources
OLTP
HBase
EDW
(Teradata)
MR2 Graph
SAS
Ingestion
SQOOP
FLUME
Web HDFS
REST
HTTP
Streamin
g
TEZ
Mahout
- 10. Apache Falcon: Data Governance in the Lake
Falcon Adds the required data governance features
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data pipeline
Raw Clean Prep
Defined in
Adds the required data governance
Auto generate &
orchestrate
Multiple complex Oozie workflows
Job1
Job2 JobN
Job3
Job4 Job7 Job6 JobN
Job1
Job2 JobN
Job3
Job4 Job7 Job6 JobN
Other Hadoop
ecosystem tools
Eg. DistCp
features
DEFINITION
Replication | Retention
Eviction | Late data
MONITORING
TRACING
Audit | Lineage
Tagging
- 11. Mashing up diverse data types in the Data Lake
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
- 12. Mashing up diverse data types in the Data Lake
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
- 13. Mashing up diverse data types in the Data Lake
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
- 14. Mashing up diverse data types in the Data Lake
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
- 15. Mashing up diverse data types in the Data Lake
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
- 16. Mashing up diverse data types in the Data Lake
Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
- 17. Virtual Data Marts with Red Hat JBoss
Data Virtualization and Hortonworks HDP
Kimberly Palko
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
- 18. Data Supply and Integration Solution
Data Virtualization sits in front of multiple data
sources and
ü allows them to be treated a single source
ü delivering the desired data
ü in the required form
ü at the right time
ü to any application and/or user.
THINK VIRTUAL MACHINE FOR DATA
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
- 19. Easy Access to Big Data
Hive
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
• Reporting tool accesses the
data virtualization server via rich
SQL dialect
• The data virtualization server
translates rich SQL dialect to
HiveQL
• Hive translates HiveQL to
MapReduce
• MapReduce runs MR job on big
data
MapReduce
HDFS
Analytical
Reporting
Tool
Data
Virtualization
Server
Hadoop
Big Data
- 20. Use Case 1: Combine data from
Hadoop with traditional data
sources
Problem:
Data from new data sources like social media,
clickstream and sensors needs to be combined
with data from traditional sources to get the full
value.
Solution:
Leverage JBoss Data Virtualization to mashup
new data in Hadoop with data in traditional data
sources without moving or copying any data and
access it through a variety of BI tools and SOA
technologies.
Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data
can
be
accessed
by
mul/ple
tools
and
methods
already
in-‐house
Consume
Compose
Connect
JBoss Data
Virtualization
Hive
SOURCE
1:
Hive/Hadoop
contains
data
from
new
data
sources
like
social
media,
clickstream
and
sensor
data
SOURCE
2:
Tradi/onal
rela/onal
databases
in
the
enterprise
- 21. Use Case 2: Federating across
Geographically Distributed
Hadoop Clusters
Problem:
Geographically distributed Hadoop clusters contains
sensitive data like patient records or customer
identification that cannot be accessed by other
regions due to regulatory policy. IT needs access to
all data, but users can only access the data in their
region.
Solution:
Leverage JBoss Data Virtualization to provide Row
Level Security and Masking of columns while
federating across Hadoop clusters.
Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data
can
be
accessed
by
mul/ple
tools
and
methods
already
in-‐house
Consume
Compose
Connect
JBoss Data
Virtualization
Hive
Hadoop
cluster
in
one
geographic
region
Hive
Hadoop
cluster
in
a
second
geographic
region
- 22. Data for entire organization in Hadoop Data Lake
Problem: How does IT control access and give business users just the
data they need?
- Does every line of business have access to everyone’s data?
- How do business users get access to the data they need in a
simple (even self-service) way?
Hadoop Data Lake
HR Employee
Files Server
Marketing
Clickstream
Data Finance
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Expense
Reports
Logs
Sales
Transactions
Customer
Twitter Sentiment Accounts
Data
- 23. Secure, Self-Service Virtual Data Marts for Hadoop
Solution: Use JBoss Data Virtualization to create virtual data marts
on top of a Hadoop cluster
- Lines of Business get access to the data they need in a simple manner
- IT maintains the process and control it needs
- All data remains in the data lake, nothing is copied or moved
Marketing Finance IT
Marketing
Clickstream Data
Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop Data Lake
HR Employee Files Sales Transactions
Finance
Customer
Expense
Reports
Twitter Sentiment Accounts
Data
Sales
Server Logs
- 24. Optional hierarchical data architectures with virtual data mart
Can be combined with security features like user role access and row and
column masking
Team2
VDB
Dept Base Virtual Database (VDB)
Team 1
VDB
Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
View1 View2
- 25. Want most recent data in an operational data store
Problem: All the legacy and archived data is in the Hadoop data lake.
We want to access the most recent, up to the minute, operational data
often and quickly.
Marketing
Clickstream Data
Hadoop Data Lake
Historical Data
Finance
Expense
Reports
Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HR Employee Files Server
Logs
Sales Transactions
Customer
Accounts
Twitter Sentiment Data
- 26. Caching For Faster Performance – Materialized View
Query 1 Query 2
Virtual Database (VDB)
Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Cached or Materialized
View 1
View 1
• Same cached view for multiple
queries
• Refreshed automatically or manually
• Cache repository can be any
supported data source
- 27. Want most recent data in an operational data store
Solution: Use JBoss Data Virtualization to integrate up to the minute data from
multiple diverse data sources that can be quickly queried.
- Use HDP for all data older than today.
- Use JDV to materialize the data in HDP for faster access and to combine with operational VDB
Materialized
View
Operational VDB Historical Data
with up to the
minute data
Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Marketing
Clickstream Data
Hadoop Data Lake
HR Employee
Files
Finance
Expense
Reports
Server
Logs
Sales
Transactions
Customer
Accounts
Twitter Sentiment
Data
Nightly
Transfer from
Data Sources
- 28. Demonstration
Virtual Data Marts
Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
with
Hadoop Data Lake
Kenny Peeples
- 29. Use Case 3 - Overview
Objexcxtivxe :
–Purpose oriented data views for
functional teams over a rich variety of
semi-structured and structured data
Problem:
–Data Lakes have large volumes of
consolidated clickstream data, product
and customer data that need to be
constrained for multi-departmental use.
Solution:
–Leverage HDP to mashup Clickstream
analysis data with product and customer
data on HDP to answer
- Leverage Jboss Data Virt to provide
Virtual data marts for each of Marketing
and Product teams to …..
Page 29 © Hortonworks Inc. 2011 – 2014. All Rights RHesOerRveTdO NWORKS CONFIDENTIAL & PROPRIETARY INFORMATION
- 30. Use Case 3 - Architecture
APPLICATIONS
Business
Analy/cs
Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Custom
Applica/ons
Packaged
Applica/ons
DATA
SYSTEM
SOURCES
Emerging
Sources
(Sensor,
Sen/ment,
Geo,
Unstructured)
Exis/ng
Sources
(CRM,
ERP,
Clickstream,
Logs)
HDP 2.1
Governance
& Integration
Security
Operations
Data Access
VIRTUAL
DATA
MART
Data Management
- 31. Use Case 3 - Resources
• GUIDE
How to guide: https://github.com/DataVirtualizationByExample/HortonworksUseCase3
Tutorial: Available soon
• VIDEOS:
http://vimeo.com/user16928011/hwxuc3configuration
http://vimeo.com/user16928011/hwxuc3run
http://vimeo.com/user16928011/hwxuc3overview
• SOURCE:
https://github.com/DataVirtualizationByExample/HortonworksUseCase3
Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
- 32. Benefits of JBoss Data Virtualization with
Hortonworks HDP 2.1
• Creates virtual databases for controlling
access to data in a data lake while giving
lines of business the autonomy they seek
• Combines new data in Hadoop with data in
traditional data sources without moving or
copying data
• Gives access to a variety of BI and analytics
tools
• Provides caching for faster access to data
• Provides consistent security policy across
multiple data sources
Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
- 33. Thank you!
Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hortonworks and Red Hat JBoss Data Virtualization
- 34. Next Steps...
More about Red Hat & Hortonworks
http://hortonworks.com/partner/redhat
Download the Hortonworks Sandbox
Learn Hadoop
Build Your Analytic App
Try Hadoop 2
Contact us: events@hortonworks.com
Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
- 35. Don’t Forget to Register for our Next Webinar!
Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
September 17th, 10 AM PST
Red Hat JBoss Data Virtualization and Hortonworks Data Platform
http://info.hortonworks.com/RedHatSeries_Hortonworks.html