Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

We’ll get started soon…
Q&A box is available for your questions
Webinar will be recorded for future viewing
Thank you for joining!
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Deliver the Data Lake (demo/deep dive)
…using HDP and Red Hat JBoss Data Virtualization
We do Hadoop.

Your speakers…
Raghu Thiagarajan, Dir, Partner Product Management, Hortonworks
Kimberly Palko, Principal Product Manager, Red Hat
Kenny Peeples, Principal Technical Marketing Manager, Red Hat

An architectural shift towards an HDP Data Lake
Unlocking the Data Lake
SCALE SCOPE
RDBMS
MPP
EDW
Data Lake
Enabled by YARN
• Single data repository,
shared infrastructure
• Multiple biz apps
accessing all the data
• Enable a shift from
reactive to proactive
interactions
• Gain new insight across
the entire enterprise
New Analytic Apps
or IT Optimization
HDP 2.1
Governance
& Integration
Security
Operations
Data Access
YARN
Data Management

What is a Data Lake?
Architectural Pattern in the Data Center
Uses Hadoop to deliver deeper insight across a large, broad, diverse set
of data efficiently
§ Multipurpose, Open PLATFORM for Data (NOT a database)
§ Land all data in a single place and interact with it in many ways
§ Allows for the ecosystem to provide higher level services (SAS, SAP, Microsoft for Streaming,
MPP, In-memory, etc..)
§ First class data management capabilities (metadata management, security, transformation
pipelines, replication, retention, etc..)

HDP Data Lake Solution Architecture
Manage Steps 1-4: Data Management with Falcon, Security with HDP Advanced
Security
Step 4: Schedule and Orchestrate
Step 3: Transform, Aggregate & Materialize
STORM
JMS
Step 1:Extract & Load
NFS
HIVE PIG Cascading
(table & user-defined metadata)
Step 2: Model/Apply Metadata
compute
&
storage
HCATALOG
. . .
SolR
Storm
. . .
. .
compute
&
storage
.
.
YARN
AMBARI
Data Lake HDP Grid
Use Case Type 1:
Materialize &
Exchange
Interactive
Hive Server
(Tez/Stinger)
Stream Processing,
Real-time Search,
MPI, etc.
YARN Apps
Opens up Many
New Use Cases
Query/
Analytics/Reporting
Tools
Tableau, Excel,
Microstrategy
Datameer, Platfora,
Business Objects
Use Case Type 2:
Explore/Visualize
FALCON (Data pipeline & flow management)
SOURCE DATA
Click Stream
Sales
Transactions
Product Data
Marketing/
Inventory
Social Data
EDW
NFS
Apache Argus (Unified Access Controls and Audit)
(data processing)
Exchange
HBase
Client
Sqoop/Hive
Downstream
Data Sources
OLTP
HBase
EDW
(Teradata)
MR2 Graph
SAS
Ingestion
SQOOP
FLUME
Web HDFS
REST
HTTP
Streamin
g
TEZ
Mahout

HDP Data Lake Solution Architecture + Virtual Data Mart
Manage Steps 1-4: Data Management with Falcon, Security with HDP
Advanced Security
HIVE PIG Cascadin
g
compute
&
storage
STORM
JMS
NFS
HCATALOG
. . .
SolR
Storm
. . .
. .
compute
&
storage
.
.
YARN
AMBARI
Data Lake HDP Grid
Use Case Type 1:
Materialize &
Exchange
Interactive
Hive Server
(Tez/Stinger)
Stream
Processing,
Real-time Search,
MPI, etc.
YARN Apps
Opens up Many
New Use Cases
Query/
Analytics/
Reporting Tools
Tableau, Excel,
Microstrategy
Datameer,
Platfora, Business
Objects
Use Case Type 2:
Explore/Visualize
SOURCE DATA
Click Stream
Sales
Transactions
Product Data
Marketing/
Inventory
Social Data
EDW
NFS
(data processing)
Exchange
HBase
Client
Sqoop/Hive
Downstream
Data Sources
OLTP
HBase
EDW
(Teradata)
MR2 Graph
SAS
Ingestion
SQOOP
FLUME
Web HDFS
REST
HTTP
Streami
ng
TEZ
Mahout
Dept Base Virtual Database (VDB)
Team 1
VDB
Team2
VDB
View1 View2

Yarn allows for new processing engines
STORM
JMS
NFS
Security
HIVE PIG Cascading
compute
&
storage
HCATALOG
. . .
SolR
Storm
. . .
. .
compute
&
storage
.
.
YARN
AMBARI
Data Lake HDP Grid
Use Case Type 1:
Materialize &
Exchange
Interactive
Hive Server
(Tez/Stinger)
Stream Processing,
Real-time Search,
MPI, etc.
YARN Apps
Opens up Many New
Use Cases
Query/
Analytics/Reporting
Tools
Tableau, Excel,
Microstrategy
Datameer, Platfora,
Business Objects
Use Case Type 2:
Explore/Visualize
SOURCE DATA
Click Stream
Sales
Transactions
Product Data
Marketing/
Inventory
Social Data
EDW
NFS
(data processing)
Exchange
HBase
Client
Sqoop/Hive
Downstream
Data Sources
OLTP
HBase
EDW
(Teradata)
MR2 Graph
SAS
Ingestion
SQOOP
FLUME
Web HDFS
REST
HTTP
Streamin
g
TEZ
Mahout

Falcon enables Governance of Data Pipelines
STORM
JMS
NFS
Security
HIVE PIG Cascading
compute
&
storage
HCATALOG
. . .
SolR
Storm
. . .
. .
compute
&
storage
.
.
YARN
AMBARI
Data Lake HDP Grid
Use Case Type 1:
Materialize &
Exchange
Interactive
Hive Server
(Tez/Stinger)
Stream Processing,
Real-time Search,
MPI, etc.
YARN Apps
Opens up Many New
Use Cases
Query/
Analytics/Reporting
Tools
Tableau, Excel,
Microstrategy
Datameer, Platfora,
Business Objects
Use Case Type 2:
Explore/Visualize
SOURCE DATA
Click Stream
Sales
Transactions
Product Data
Marketing/
Inventory
Social Data
EDW
NFS
(data processing)
Exchange
HBase
Client
Sqoop/Hive
Downstream
Data Sources
OLTP
HBase
EDW
(Teradata)
MR2 Graph
SAS
Ingestion
SQOOP
FLUME
Web HDFS
REST
HTTP
Streamin
g
TEZ
Mahout

Apache Falcon: Data Governance in the Lake
Falcon Adds the required data governance features
Data pipeline
Raw Clean Prep
Defined in
Adds the required data governance
Auto generate &
orchestrate
Multiple complex Oozie workflows
Job1
Job2 JobN
Job3
Job4 Job7 Job6 JobN
Job1
Job2 JobN
Job3
Job4 Job7 Job6 JobN
Other Hadoop
ecosystem tools
Eg. DistCp
features
DEFINITION
Replication | Retention
Eviction | Late data
MONITORING
TRACING
Audit | Lineage
Tagging

Mashing up diverse data types in the Data Lake

Virtual Data Marts with Red Hat JBoss
Data Virtualization and Hortonworks HDP
Kimberly Palko

Data Supply and Integration Solution
Data Virtualization sits in front of multiple data
sources and
ü allows them to be treated a single source
ü delivering the desired data
ü in the required form
ü at the right time
ü to any application and/or user.
THINK VIRTUAL MACHINE FOR DATA

Easy Access to Big Data
Hive
• Reporting tool accesses the
data virtualization server via rich
SQL dialect
• The data virtualization server
translates rich SQL dialect to
HiveQL
• Hive translates HiveQL to
MapReduce
• MapReduce runs MR job on big
data
MapReduce
HDFS
Analytical
Reporting
Tool
Data
Virtualization
Server
Hadoop
Big Data

Use Case 1: Combine data from
Hadoop with traditional data
sources
Problem:
Data from new data sources like social media,
clickstream and sensors needs to be combined
with data from traditional sources to get the full
value.
Solution:
Leverage JBoss Data Virtualization to mashup
new data in Hadoop with data in traditional data
sources without moving or copying any data and
access it through a variety of BI tools and SOA
technologies.
Data
can
be
accessed
by
mul/ple
tools
and
methods
already
in-‐house
Consume
Compose
Connect
JBoss Data
Virtualization
Hive
SOURCE
1:
Hive/Hadoop
contains
data
from
new
data
sources
like
social
media,
clickstream
and
sensor
data
SOURCE
2:
Tradi/onal
rela/onal
databases
in
the
enterprise

Use Case 2: Federating across
Geographically Distributed
Hadoop Clusters
Problem:
Geographically distributed Hadoop clusters contains
sensitive data like patient records or customer
identification that cannot be accessed by other
regions due to regulatory policy. IT needs access to
all data, but users can only access the data in their
region.
Solution:
Leverage JBoss Data Virtualization to provide Row
Level Security and Masking of columns while
federating across Hadoop clusters.
Data
can
be
accessed
by
mul/ple
tools
and
methods
already
in-‐house
Consume
Compose
Connect
JBoss Data
Virtualization
Hive
Hadoop
cluster
in
one
geographic
region
Hive
Hadoop
cluster
in
a
second
geographic
region

Data for entire organization in Hadoop Data Lake
Problem: How does IT control access and give business users just the
data they need?
- Does every line of business have access to everyone’s data?
- How do business users get access to the data they need in a
simple (even self-service) way?
Hadoop Data Lake
HR Employee
Files Server
Marketing
Clickstream
Data Finance
Expense
Reports
Logs
Sales
Transactions
Customer
Twitter Sentiment Accounts
Data

Secure, Self-Service Virtual Data Marts for Hadoop
Solution: Use JBoss Data Virtualization to create virtual data marts
on top of a Hadoop cluster
- Lines of Business get access to the data they need in a simple manner
- IT maintains the process and control it needs
- All data remains in the data lake, nothing is copied or moved
Marketing Finance IT
Marketing
Clickstream Data
Hadoop Data Lake
HR Employee Files Sales Transactions
Finance
Customer
Expense
Reports
Twitter Sentiment Accounts
Data
Sales
Server Logs

Optional hierarchical data architectures with virtual data mart
Can be combined with security features like user role access and row and
column masking
Team2
VDB
Dept Base Virtual Database (VDB)
Team 1
VDB
View1 View2

Want most recent data in an operational data store
Problem: All the legacy and archived data is in the Hadoop data lake.
We want to access the most recent, up to the minute, operational data
often and quickly.
Marketing
Clickstream Data
Hadoop Data Lake
Historical Data
Finance
Expense
Reports
HR Employee Files Server
Logs
Sales Transactions
Customer
Accounts
Twitter Sentiment Data

Caching For Faster Performance – Materialized View
Query 1 Query 2
Virtual Database (VDB)
Cached or Materialized
View 1
View 1
• Same cached view for multiple
queries
• Refreshed automatically or manually
• Cache repository can be any
supported data source

Want most recent data in an operational data store
Solution: Use JBoss Data Virtualization to integrate up to the minute data from
multiple diverse data sources that can be quickly queried.
- Use HDP for all data older than today.
- Use JDV to materialize the data in HDP for faster access and to combine with operational VDB
Materialized
View
Operational VDB Historical Data
with up to the
minute data
Marketing
Clickstream Data
Hadoop Data Lake
HR Employee
Files
Finance
Expense
Reports
Server
Logs
Sales
Transactions
Customer
Accounts
Twitter Sentiment
Data
Nightly
Transfer from
Data Sources

Demonstration
Virtual Data Marts
with
Hadoop Data Lake
Kenny Peeples

Use Case 3 - Overview
Objexcxtivxe :
–Purpose oriented data views for
functional teams over a rich variety of
semi-structured and structured data
Problem:
–Data Lakes have large volumes of
consolidated clickstream data, product
and customer data that need to be
constrained for multi-departmental use.
Solution:
–Leverage HDP to mashup Clickstream
analysis data with product and customer
data on HDP to answer
- Leverage Jboss Data Virt to provide
Virtual data marts for each of Marketing
and Product teams to …..
Page 29 © Hortonworks Inc. 2011 – 2014. All Rights RHesOerRveTdO NWORKS CONFIDENTIAL & PROPRIETARY INFORMATION

Use Case 3 - Architecture
APPLICATIONS
Business
Analy/cs
Custom
Applica/ons
Packaged
Applica/ons
DATA
SYSTEM
SOURCES
Emerging
Sources
(Sensor,
Sen/ment,
Geo,
Unstructured)
Exis/ng
Sources
(CRM,
ERP,
Clickstream,
Logs)
HDP 2.1
Governance
& Integration
Security
Operations
Data Access
VIRTUAL
DATA
MART
Data Management

Use Case 3 - Resources
• GUIDE
How to guide: https://github.com/DataVirtualizationByExample/HortonworksUseCase3
Tutorial: Available soon
• VIDEOS:
http://vimeo.com/user16928011/hwxuc3configuration
http://vimeo.com/user16928011/hwxuc3run
http://vimeo.com/user16928011/hwxuc3overview
• SOURCE:
https://github.com/DataVirtualizationByExample/HortonworksUseCase3

Benefits of JBoss Data Virtualization with
Hortonworks HDP 2.1
• Creates virtual databases for controlling
access to data in a data lake while giving
lines of business the autonomy they seek
• Combines new data in Hadoop with data in
traditional data sources without moving or
copying data
• Gives access to a variety of BI and analytics
tools
• Provides caching for faster access to data
• Provides consistent security policy across
multiple data sources

Thank you!
Hortonworks and Red Hat JBoss Data Virtualization

Next Steps...
More about Red Hat & Hortonworks
http://hortonworks.com/partner/redhat
Download the Hortonworks Sandbox
Learn Hadoop
Build Your Analytic App
Try Hadoop 2
Contact us: events@hortonworks.com

Don’t Forget to Register for our Next Webinar!
September 17th, 10 AM PST
Red Hat JBoss Data Virtualization and Hortonworks Data Platform
http://info.hortonworks.com/RedHatSeries_Hortonworks.html

Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Similar a Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3 (20)

Más de Hortonworks

Más de Hortonworks (20)

Último

Último (20)

Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3