Strata+Hadoop NY 2015: Hydrate a data lake in days with CDAP

PROPRIETARY & CONFIDENTIAL
Why Cask?
2
@jgrayla

SIMPLE ACCESS TO POWERFUL TECHNOLOGY
Cask’s goal is to enable every developer and enterprise to 
quickly and easily build and run modern data applications 
using open source big data technologies like Hadoop

James Dixon (Pentaho)
Data Lake
Data streams in from sources
to fill the lake, and various
users of the lake can come to
examine, dive in, or sample 
(Hadoop World NYC 2010)
Introduction to Data Lakes
Gartner
Data Lake
Enterprise-wide data
management platforms for
analyzing disparate sources of
data in its native format
Hortonworks
Data Lake
Collect everything, dive in
anywhere, give flexible access. 
Maximum scale and insight
with the lowest possible
friction and cost.
Cloudera
Data Hub
A centralized, unified data source
that can quickly provide diverse
business users with the
information they need 
to do their jobs.
Data
Lake
1
0
1
0
0
01 1
0
1

The Journey to Data Lakes
The Journey to Data Lakes is not Easy
Our customers are some of the 
most advanced users of Hadoop 
and have years invested into their journeys.
The goal of CDAP is to provide a framework 
and set of abstractions to avoid the pitfalls and 
long timelines that plague Hadoop projects.
CDAP drastically accelerates your 
adoption and utilization of big data.
1
0
1
0
0 0
1
0
1

Raw
a.k.a. Level 0 
 
Data that has been
left in it’s native form
without any
transformation.
Types of Water… er, Data
Defined
a.k.a. Level 1 
 
Data that has a
defined schema and
has been wrangled
and cleansed.
Refined
a.k.a. Level 2 
 
Data that has been
aggregated from the
source records, like
counts or models.

Analysts
Vertical Expertise
Utilizes BI Tools
No Programming
Needs UI for Access
Types of Data Users
Scientists
Mixed Expertise
Utilizes Py/R/SQL/etc.
Basic Programming 
Needs Tools for Access
Developers
Horizontal Expertise
Utilizes Java/Scripting
Advanced Programming
Needs Code for Access

Data Lake Architectures
Data Reservoir
Raw + Deﬁned Data 
which is governed and
audited to ensure
compliance and security
Data
Reservoir
1
0
1
0
0
0
1
Data Pond
Raw Data copied from
existing internal data
stores and pulled from
external data sources
Data
Pond
1
0
1
0
1 0
Data Lake
Raw + Deﬁned Data 
pushed from other
systems into a centralized,
shared storage cluster
Data
Lake
1
0
1
0
1
0

Data Pond
Raw Data copied from existing internal data
stores and pulled from external data sources
SME / Enterprise Line of Business
Customer 360° View
Bring together silo’d datasets
Combine with external data sources
Ask new questions, ﬁnd unknown unknowns
Data
Pond
1
0
1
0
1
0

Data Lake
Web Startup Company
Raw + Deﬁned Data pushed from other systems
into a centralized, shared storage cluster
Log Storage and Analytics
Ingestion of data from multiple sources
Transforming and processing of data
Centralized storage and analytics of log data
Data
Lake
1
0
1
0
1
0
0

Data Reservoir
Fortune 500 Enterprise
Raw + Deﬁned Data which is governed and audited to
ensure compliance and security enforcement
Enterprise Data Hub
Storage and processing for all enterprise data
Provide centralized auditing and enforcement
Any data available while ensuring compliance
Data
Reservoir
10
1
0
0
0
1
0

Data Lake Challenges
Manual processes requiring
hand-coding and reliance on 
command-line tools
Hard to find data and 
it’s lineage for data 
discovery and exploration
Operationalizing processes 
for production and to 
maintain SLAs
Coupling of ingestion and
processing drives 
architecture decisions
Ensuring data is in canonical
forms with a shared schema
usable by others
Sharing infrastructure in a 
multi-tenant environment 
without low-level QoS support
Multiple architectures and
technologies used by different
teams on different clusters
Guaranteeing compliance in a
system that is designed for
schema-on-read and raw data
Coding or filing tickets often
required to perform manual 
ingestion and processing tasks
Data
Reservoir
1
0
1
0
0
0
1
Data
Pond
1
0
1
0
1 0
Data
Lake
1
0
1
0
1
0

CASK DATA APPLICATION PLATFORM
Integrated Framework for Building and
Running Data Applications on Hadoop
Integrates the Latest
Big Data Technologies
Supports All Major
Hadoop Distributions
Fully Open Source
and Highly Extensible

PROPRIETARY & CONFIDENTIAL14
Key Features
CASK DATA APPLICATION PLATFORM
Infrastructure
INTEGRATION
Provide an integrated
product experience 
with out-of-the-box
capabilities
Architecture
STANDARDS
Deﬁne a reference
architecture to standardize
support for mixed
infrastructure
Programming
ABSTRACTIONS
Utilize abstraction layers
to encapsulate complex
patterns and insulate
developers
Production
SERVICES
Provides development tools
and runtime services to
enable production 
apps and data

Self-Service Ingestion and ETL 
for Hadoop Data Lakes
Built for Production
on CDAP
Rich Drag-and-Drop
User Interface
Open Source &
Highly Extensible

DISCOVER
data using user and machine
generated metadata
INGEST
any data from any source
in real-time and batch
BUILD
drag-and-drop ETL/ELT
pipelines that run on Hadoop
EGRESS
any data to any destination
in real-time and batch

Data Lakes on CDAP
Hydrator framework with
templates and plugins enables
production workflows in minutes
Never lose data by ensuring all
ingested data is tracked with 
metadata and lineage
Operationalize workflows using 
scheduling and SLA monitoring 
with time / partition awareness
Separation of ingestion 
and processing to support 
any type, format and rate
Using common transformations
and a shared system for 
defining and exposing schema
Multi-tenant namespacing
provides data and app isolation,
tying together infrastructure
Reference architecture ensures
a common platform across
teams, orgs, ops and security
Ensure compliance by 
requiring the use of specific
transformations and validation
Self-service access through
Cask Hydrator for the discovery,
ingest and exploration of data
Data
Reservoir
1
0
1
0
0
0
1
Data
Pond
1
0
1
0
1 0
Data
Lake
1
0
1
0
1
0

CDAP Community
100% Open Source (ASL2)
Website:
http://cdap.io
Mailing List:
cdap-user@googlegroups.com
cdap-dev@googlegroups.com
IRC:
#cdap on freenode.net
CDAP Enterprise
100% Commercially Supported
Website:
http://cask.co
Contact Sales:
sales@cask.co
Contact Me:
jon@cask.co or @jgrayla 
Accelerate Your
Data Lake Journey
Tap In @ cask.co

Thank You!
Jonathan Gray
jon@cask.co
@jgrayla
Questions?

Strata+Hadoop NY 2015: Hydrate a data lake in days with CDAP

Recomendados

Recomendados

Más contenido relacionado

Más de Cask Data

Más de Cask Data (10)

Último

Último (20)

Strata+Hadoop NY 2015: Hydrate a data lake in days with CDAP