Strata+Hadoop World | New York, NY | Sept 29-Oct 1, 2015
About the talk:
Data lakes represent a powerful new data architecture, providing enterprises with the scale and flexibility required for big data: unbounded storage for unbounded questions. Hadoop is the de facto standard for implementing data lakes, but significant expertise, time, and effort are still required for organizations to deliver one. Today, enterprises building their own data lakes on Hadoop are effectively implementing their own internal platforms from a collection of individual open source technologies.
The many projects provided by open source and commercial Hadoop distributions must be integrated with each other, integrated with the existing environment, and operationalized into new and existing processes. With no established best practices or standards, each organization is left to find their own way and rely on expensive, external experts. Data lake proof of concepts can take months.
This talk introduces Cask Hydrator, a new open source data lake framework included in the latest release of the Cask Data App Platform (CDAP). Hydrator is a self-service data ingestion and ETL framework with a drag-and-drop user interface and JSON-based pipeline configurations. Enforcing best practices and providing out-of-the-box functionality, Hydrator enables enterprises to build data lakes in a matter of days. Integrations are included with open source and traditional data sources, from Kafka and Flume to Oracle and Teradata. Completely open source, Cask Hydrator is highly extensible and can be easily integrated with new data sources and sinks, and extended with custom transformations and validations.
Attendees will learn about data lakes, the different approaches and architectures enterprises are utilizing, the benefits and challenges associated with them, and how Cask Hydrator can enable the rapid creation of data lakes and dramatically decrease the complexity in operationalizing them.
This session is sponsored by Cask
Speaker Bio:
Jonathan Gray, founder and CEO of Cask, is an entrepreneur and software engineer with a background in startups, open source, and all things data. Prior to founding Cask, Jonathan was a software engineer at Facebook where he helped drive HBase engineering efforts, including Facebook Messages and several other large-scale projects, from inception to production.
An open source evangelist, Jonathan was responsible for helping build the Facebook engineering brand through developer outreach and refocusing the open source strategy of the company. Prior to Facebook, Jonathan founded Streamy.com, where he became an early adopter of Hadoop and HBase and is now a core contributor and active committer in the community.
Jonathan holds a bachelor’s degree in electrical and computer engineering from Carnegie Mellon University.
http://cask.co/
3. SIMPLE ACCESS TO POWERFUL TECHNOLOGY
Cask’s goal is to enable every developer and enterprise to
quickly and easily build and run modern data applications
using open source big data technologies like Hadoop
4. PROPRIETARY & CONFIDENTIAL
James Dixon (Pentaho)
Data Lake
Data streams in from sources
to fill the lake, and various
users of the lake can come to
examine, dive in, or sample
(Hadoop World NYC 2010)
Introduction to Data Lakes
Gartner
Data Lake
Enterprise-wide data
management platforms for
analyzing disparate sources of
data in its native format
Hortonworks
Data Lake
Collect everything, dive in
anywhere, give flexible access.
Maximum scale and insight
with the lowest possible
friction and cost.
Cloudera
Data Hub
A centralized, unified data source
that can quickly provide diverse
business users with the
information they need
to do their jobs.
Data
Lake
1
0
1
0
0
01 1
0
1
5. PROPRIETARY & CONFIDENTIAL
The Journey to Data Lakes
The Journey to Data Lakes is not Easy
Our customers are some of the
most advanced users of Hadoop
and have years invested into their journeys.
The goal of CDAP is to provide a framework
and set of abstractions to avoid the pitfalls and
long timelines that plague Hadoop projects.
CDAP drastically accelerates your
adoption and utilization of big data.
1
0
1
0
0 0
1
0
1
6. PROPRIETARY & CONFIDENTIAL
Raw
a.k.a. Level 0
Data that has been
left in it’s native form
without any
transformation.
Types of Water… er, Data
Defined
a.k.a. Level 1
Data that has a
defined schema and
has been wrangled
and cleansed.
Refined
a.k.a. Level 2
Data that has been
aggregated from the
source records, like
counts or models.
7. PROPRIETARY & CONFIDENTIAL
Analysts
Vertical Expertise
Utilizes BI Tools
No Programming
Needs UI for Access
Types of Data Users
Scientists
Mixed Expertise
Utilizes Py/R/SQL/etc.
Basic Programming
Needs Tools for Access
Developers
Horizontal Expertise
Utilizes Java/Scripting
Advanced Programming
Needs Code for Access
8. PROPRIETARY & CONFIDENTIAL
Data Lake Architectures
Data Reservoir
Raw + Defined Data
which is governed and
audited to ensure
compliance and security
Data
Reservoir
1
0
1
0
0
0
1
Data Pond
Raw Data copied from
existing internal data
stores and pulled from
external data sources
Data
Pond
1
0
1
0
1 0
Data Lake
Raw + Defined Data
pushed from other
systems into a centralized,
shared storage cluster
Data
Lake
1
0
1
0
1
0
9. PROPRIETARY & CONFIDENTIAL
Data Pond
Raw Data copied from existing internal data
stores and pulled from external data sources
SME / Enterprise Line of Business
Customer 360° View
Bring together silo’d datasets
Combine with external data sources
Ask new questions, find unknown unknowns
Data
Pond
1
0
1
0
1
0
10. PROPRIETARY & CONFIDENTIAL
Data Lake
Web Startup Company
Raw + Defined Data pushed from other systems
into a centralized, shared storage cluster
Log Storage and Analytics
Ingestion of data from multiple sources
Transforming and processing of data
Centralized storage and analytics of log data
Data
Lake
1
0
1
0
1
0
0
11. PROPRIETARY & CONFIDENTIAL
Data Reservoir
Fortune 500 Enterprise
Raw + Defined Data which is governed and audited to
ensure compliance and security enforcement
Enterprise Data Hub
Storage and processing for all enterprise data
Provide centralized auditing and enforcement
Any data available while ensuring compliance
Data
Reservoir
10
1
0
0
0
1
0
12. PROPRIETARY & CONFIDENTIAL
Data Lake Challenges
Manual processes requiring
hand-coding and reliance on
command-line tools
Hard to find data and
it’s lineage for data
discovery and exploration
Operationalizing processes
for production and to
maintain SLAs
Coupling of ingestion and
processing drives
architecture decisions
Ensuring data is in canonical
forms with a shared schema
usable by others
Sharing infrastructure in a
multi-tenant environment
without low-level QoS support
Multiple architectures and
technologies used by different
teams on different clusters
Guaranteeing compliance in a
system that is designed for
schema-on-read and raw data
Coding or filing tickets often
required to perform manual
ingestion and processing tasks
Data
Reservoir
1
0
1
0
0
0
1
Data
Pond
1
0
1
0
1 0
Data
Lake
1
0
1
0
1
0
13. PROPRIETARY & CONFIDENTIAL
CASK DATA APPLICATION PLATFORM
Integrated Framework for Building and
Running Data Applications on Hadoop
Integrates the Latest
Big Data Technologies
Supports All Major
Hadoop Distributions
Fully Open Source
and Highly Extensible
14. PROPRIETARY & CONFIDENTIAL14
Key Features
CASK DATA APPLICATION PLATFORM
Infrastructure
INTEGRATION
Provide an integrated
product experience
with out-of-the-box
capabilities
Architecture
STANDARDS
Define a reference
architecture to standardize
support for mixed
infrastructure
Programming
ABSTRACTIONS
Utilize abstraction layers
to encapsulate complex
patterns and insulate
developers
Production
SERVICES
Provides development tools
and runtime services to
enable production
apps and data
16. PROPRIETARY & CONFIDENTIAL
Self-Service Ingestion and ETL
for Hadoop Data Lakes
Built for Production
on CDAP
Rich Drag-and-Drop
User Interface
Open Source &
Highly Extensible
17. PROPRIETARY & CONFIDENTIAL
DISCOVER
data using user and machine
generated metadata
INGEST
any data from any source
in real-time and batch
BUILD
drag-and-drop ETL/ELT
pipelines that run on Hadoop
EGRESS
any data to any destination
in real-time and batch
18. PROPRIETARY & CONFIDENTIAL
Data Lakes on CDAP
Hydrator framework with
templates and plugins enables
production workflows in minutes
Never lose data by ensuring all
ingested data is tracked with
metadata and lineage
Operationalize workflows using
scheduling and SLA monitoring
with time / partition awareness
Separation of ingestion
and processing to support
any type, format and rate
Using common transformations
and a shared system for
defining and exposing schema
Multi-tenant namespacing
provides data and app isolation,
tying together infrastructure
Reference architecture ensures
a common platform across
teams, orgs, ops and security
Ensure compliance by
requiring the use of specific
transformations and validation
Self-service access through
Cask Hydrator for the discovery,
ingest and exploration of data
Data
Reservoir
1
0
1
0
0
0
1
Data
Pond
1
0
1
0
1 0
Data
Lake
1
0
1
0
1
0
20. CDAP Community
100% Open Source (ASL2)
Website:
http://cdap.io
Mailing List:
cdap-user@googlegroups.com
cdap-dev@googlegroups.com
IRC:
#cdap on freenode.net
CDAP Enterprise
100% Commercially Supported
Website:
http://cask.co
Contact Sales:
sales@cask.co
Contact Me:
jon@cask.co or @jgrayla
Accelerate Your
Data Lake Journey
Tap In @ cask.co