SlideShare una empresa de Scribd logo
1 de 23
Descargar para leer sin conexión
Hydrator

Code-free Data Pipelines

for Hadoop, Spark, and HBase
Jonathan Gray, CEO @ Cask
Big Data Day LA - July 9th, 2016
cask.co
Cask, CDAP, Cask Hydrator and Cask Tracker are trademarks or registered trademarks of Cask Data. Apache Spark, Spark, the Spark logo, Apache Hadoop, Hadoop and the Hadoop logo are trademarks or registered trademarks of the Apache Software Foundation. All other trademarks and registered trademarks are the property of their respective owners.
cask.co
About Me
2
cask.co
Hadoop Enables New Apps and Patterns
3
ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS
Batch and Realtime
Data Ingestion
Any type of data from any
type of source in any volume
Batch and Streaming ETL
Code-free self-service creation
and management of pipelines
SQL Exploration and
Data Science
All data is automatically
accessible via SQL and client SDKs
Data as a Service
Easily expose generic or
custom REST APIs on any data
360o
Customer View
Integrate data from any source
and expose through queries
and APIs
Realtime Dashboards
Perform realtime OLAP
aggregations and serve them
through REST APIs
Time Series Analysis
Store, process and serve massive
volumes of time-series data
Realtime Log Analytics
Ingestion and processing of
high-throughput streaming
log events
Recommendation Engines
Build models in batch using
historical data and serve them
in realtime
Anomaly Detection Systems
Process streaming events and
predictably compare them in
realtime to historical data
NRT Event Monitoring
Reliably monitor large streams of
data and perform defined actions
within a specified time
Internet of Things
Ingestion, storage and processing
of events that is highly-available,
scalable and consistent
ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS
Batch and Realtime
Data Ingestion
Any type of data from any
type of source in any volume
Batch and Streaming ETL
Code-free self-service creation
and management of pipelines
SQL Exploration and
Data Science
All data is automatically
accessible via SQL and client SDKs
Data as a Service
Easily expose generic or
custom REST APIs on any data
360o
Customer View
Integrate data from any source
and expose through queries
and APIs
Realtime Dashboards
Perform realtime OLAP
aggregations and serve them
through REST APIs
Time Series Analysis
Store, process and serve massive
volumes of time-series data
Realtime Log Analytics
Ingestion and processing of
high-throughput streaming
log events
Recommendation Engines
Build models in batch using
historical data and serve them
in realtime
Anomaly Detection Systems
Process streaming events and
predictably compare them in
realtime to historical data
NRT Event Monitoring
Reliably monitor large streams of
data and perform defined actions
within a specified time
Internet of Things
Ingestion, storage and processing
of events that is highly-available,
scalable and consistent
ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS
Batch and Realtime
Data Ingestion
Any type of data from any
type of source in any volume
Batch and Streaming ETL
Code-free self-service creation
and management of pipelines
SQL Exploration and
Data Science
All data is automatically
accessible via SQL and client SDKs
Data as a Service
Easily expose generic or
custom REST APIs on any data
360o
Customer View
Integrate data from any source
and expose through queries
and APIs
Realtime Dashboards
Perform realtime OLAP
aggregations and serve them
through REST APIs
Time Series Analysis
Store, process and serve massive
volumes of time-series data
Realtime Log Analytics
Ingestion and processing of
high-throughput streaming
log events
Recommendation Engines
Build models in batch using
historical data and serve them
in realtime
Anomaly Detection Systems
Process streaming events and
predictably compare them in
realtime to historical data
NRT Event Monitoring
Reliably monitor large streams of
data and perform defined actions
within a specified time
Internet of Things
Ingestion, storage and processing
of events that is highly-available,
scalable and consistent
PROPRIETARY & CONFIDENTIAL
Web Analytics and Reporting Use Case
✦ Hadoop ETL pipeline stitched together using hard-to-maintain, brittle scripts

✦ Not enough people with expertise in all the Hadoop components (HDFS, MapReduce, Spark,
YARN, HBase, Kafka) or a general lack of expertise

✦ Hard to debug and validate, resulting in frequent failures in production environment

✦ Difficult to integrate into SQL / BI reporting solutions for business users

✦ As use cases advance into Data Science, Machine Learning, and Predictive Analytics you need
to include scientists and advanced ML programmers
Transform web log data from S3 every hour to Hadoop cluster for backup, as well as, perform
analytics and enable realtime reporting of metrics such as number of successful/failure
responses, most popular pages, etc.
The Challenges —
cask.co
The Many Faces of Hadoop
5
Developer
Advanced Programming
Focused on App Logic
Data Scientist
Basic Dev & Complex Analytics
Focused on Data & Algorithms
IT Pro / Ops
Configuring & Monitoring
Focused on Infrastructure & SLA’s
LOB / Product
Decision Making & Driving Revenue
Focused on Apps & Insights
Challenge: The tools are missing to connect these users and take apps from prototype to production
cask.co6
Enter Cask
Key Customers
and Partners
Named a Gartner Cool Vendor 2016
Founded in 2011 by early Hadoop engineers from Facebook and Yahoo!
cask.co
Introducing the Data Application Platform
7
Deployment Models
On-premises Hybrid Cloud
Governance Operations
Pre-packaged Integrations
Orchestration/Automation/Workflows
Core Application and Data Integration
Role-based User Experience
Developer Data
Scientist
IT
/Ops
cask.co
Introducing the Cask Data App Platform
8
Open Source, Integrated Framework for
Building and Running Data Applications
on Hadoop and Spark
• Supports all major Hadoop distros
• Integrates the latest Big Data technologies
• 100% open source and highly extensible
9
What’s in CDAP ?
A self-service, re-configurable, code-free framework to build, run
and operate real-time or batch data pipelines in cloud or on-
premise.
A self-service tool for tracking the flow of data in and out of Data
Lake. Track, Index and Search technical, business and operational
metadata of applications and pipelines
An integration platform that integrates and abstracts underlying Hadoop
technologies. Build data analytics solutions in cloud or on-premise.
The platform is powerful and versatile for you to build, publish and
manage operational self-service analytics applications
Your Apps
cask.co10
A self-service, code-free framework to
build, run and operate data pipelines
on Apache Hadoop and Spark
Built for Production
on CDAP
Rich Drag-and-Drop
User Interface
Open Source &
Highly Extensible
PROPRIETARY & CONFIDENTIAL
INGEST
any data from any source
in real-time and batch
BUILD
drag-and-drop ETL/ELT
pipelines that run on Hadoop
EGRESS
any data to any destination
in real-time and batch
Hydrator Data Pipelines
provide the ability to automate complex workflows that involves fetching data, possibly from multiple
data sources, combining, performing non-trivial transformations and aggregations on the data,
writing it to one more data sinks and making it available for applications and analytics
PROPRIETARY & CONFIDENTIAL
Stack of Data Enablers
PROPRIETARY & CONFIDENTIAL
Hydrator Studio
✦ Drag-and-drop GUI for visual Data
Pipeline creation

✦ Rich library of pre-built sources,
transforms, sinks for data ingestion and
ETL use cases

✦ Separation of pipeline creation from
execution framework - MapReduce,
Spark, Spark Streaming etc.

✦ Hadoop-native and Hadoop Distro
agnostic
PROPRIETARY & CONFIDENTIAL
Hydrator Data Pipeline
✦ Captures Metadata, Audit, Lineage
info, discovered and visualized using
Cask Tracker

✦ Notifications, scheduling, and
monitoring with centralized metrics
and log collection for ease of
operability
✦ Simple Java API to build your own
source, transforms, sinks with class
loading isolation

✦ Javascript and Python transforms

✦ Include arbitrary Spark jobs
PROPRIETARY & CONFIDENTIAL
✦ Elastic, SFTP, Cassandra, Kafka, RDBMS, EDW and many more sources and sinks
✦ Parse/Encode/Hash, Distinct/Group By, Custom JavaScript/Python Transforms
Out of the box Integrations
PROPRIETARY & CONFIDENTIAL
✦ Implement your own batch (or realtime) source, transform, sink plugins using simple Java API
Custom Plugins
PROPRIETARY & CONFIDENTIAL
Pipeline Implementation
Logical Pipeline
Physical Workflow
MR/Spark Executions
Planner
CDAP
✦ Planner converts logical pipeline to a physical
execution plan

✦ Optimizes and bundles functions into one or more
MR/Spark jobs

✦ CDAP is the runtime environment where all the
components of the data pipeline are executed

✦ CDAP provides centralized log and metrics collection,
transaction, lineage and audit information
PROPRIETARY & CONFIDENTIAL
Pipeline Implementation
19
Support for fine-grain role-based
authorizing of entities in CDAP

Integration with Sentry and Ranger
Security — Authentication
and Authorization
Ability to preview pipelines with real or
injected data before deploying (Standalone)
Security — Impersonation

and Encryption
Learn about how datasets are
being used and the top
applications accessing it
Tracker — Data Usage Analytics
Support for annotating business
metadata based on business
specified taxonomy
Metadata Taxonomy
Build and run Hydrator real-time
pipelines using Spark Streaming
Hydrator — Spark Streaming
Ability to run CDAP and CDAP Apps
as specified users and ability to
encrypt/decrypt sensitive configuration
Hydrator — Preview Mode
Capability to join multiple streams
(inner & outer) and ability to
configure actions allowing one to
run binaries on designated nodes
Hydrator — Join & Action
Support for XML, Mainframe (COBOL
Copybook), Value Mapper, Normalizer,
Denormalizer, JsonToXml, SSH Action,
Excel Reader, Solr & Spark ML
Hydrator — Plugins
3.5 - Latest Features
PROPRIETARY & CONFIDENTIAL
✦ Join across multiple data sources (CDAP-5588)

✦ Live Debug/Preview of pipelines in build mode

✦ Macro substitutions for configuration/properties

✦ Custom Actions anywhere in pipeline

✦ Spark streaming support for real-time pipelines
Hydrator Roadmap
21
Use case mapping
• Build operational analytics
applications
• Micro-service Enablement
• Self-Service Data Analytics / Data
Science
• Data-As-A-Service
• Empower developers to easily
build solution on Hadoop
• Abstract technologies, future proof
• Ingestion, Transformation,
Blending (complex joins) and
Lookup.
• Machine Learning, Aggregation
and Reporting
• Realtime and Batch data pipelines
• DW Offloading (Netezza,
Teradata, etc)
• Painless and Fast Ingest into
Impala operationalized
• Data Ingestion from varied
sources
• Easy way to catalog application and
pipeline level metadata
• Search across technical, business
and operational metadata
• Track Lineage and Provenance,
• Track across non-Hadoop
integrations
• Usage Analytics of cluster data
• Data Quality Measure
• Integration with other MDM systems
including Navigator
PROPRIETARY & CONFIDENTIAL
Demo Example
Load Log Files from S3 to
HDFS and perform
aggregations/analysis
• Start with web access logs stored in Amazon S3
• Store the raw logs into HDFS Avro Files
• Parse the access log lines into individual fields
• Calculate the total number of requests by IP and
status code
• Find out IPs which received maximum
successful status code and error codes
69.181.160.120 - - [08/Feb/2015:04:36:40 +0000] "GET /ajax/planStatusHistory HTTP/1.1" 200 508 "http://builds.cask.co/log" "Mozilla/5.0
(Macintosh; Intel Mac OS X 10_10_1) AppleWebKit Chrome/38.0.2125.122 Safari/537.36"
Fields: IP Address, Timestamp, Http Method, URI, Http Status, Response Size, URI, Client Info
Sample Web access log (Combined Log Format):
cask.co23
Thanks!
Jonathan Gray
@jgrayla
Download CDAP w/ Hydrator: http://cask.co/downloads/

Más contenido relacionado

Más de Cask Data

Más de Cask Data (6)

ACID Transactions in Apache Phoenix with Apache Tephra™ (incubating), by Poor...
ACID Transactions in Apache Phoenix with Apache Tephra™ (incubating), by Poor...ACID Transactions in Apache Phoenix with Apache Tephra™ (incubating), by Poor...
ACID Transactions in Apache Phoenix with Apache Tephra™ (incubating), by Poor...
 
Logging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data CollectorLogging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data Collector
 
Introducing Athena: 08/19 Big Data Application Meetup, Talk #3
Introducing Athena: 08/19 Big Data Application Meetup, Talk #3 Introducing Athena: 08/19 Big Data Application Meetup, Talk #3
Introducing Athena: 08/19 Big Data Application Meetup, Talk #3
 
NRT Event Processing with Guaranteed Delivery of HTTP Callbacks, HBaseCon 2015
NRT Event Processing with Guaranteed Delivery of HTTP Callbacks, HBaseCon 2015NRT Event Processing with Guaranteed Delivery of HTTP Callbacks, HBaseCon 2015
NRT Event Processing with Guaranteed Delivery of HTTP Callbacks, HBaseCon 2015
 
Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bag
Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bagBrown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bag
Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bag
 
HBase Meetup @ Cask HQ 09/25
HBase Meetup @ Cask HQ 09/25HBase Meetup @ Cask HQ 09/25
HBase Meetup @ Cask HQ 09/25
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016

  • 1. Hydrator
 Code-free Data Pipelines for Hadoop, Spark, and HBase Jonathan Gray, CEO @ Cask Big Data Day LA - July 9th, 2016 cask.co Cask, CDAP, Cask Hydrator and Cask Tracker are trademarks or registered trademarks of Cask Data. Apache Spark, Spark, the Spark logo, Apache Hadoop, Hadoop and the Hadoop logo are trademarks or registered trademarks of the Apache Software Foundation. All other trademarks and registered trademarks are the property of their respective owners.
  • 3. cask.co Hadoop Enables New Apps and Patterns 3 ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS Batch and Realtime Data Ingestion Any type of data from any type of source in any volume Batch and Streaming ETL Code-free self-service creation and management of pipelines SQL Exploration and Data Science All data is automatically accessible via SQL and client SDKs Data as a Service Easily expose generic or custom REST APIs on any data 360o Customer View Integrate data from any source and expose through queries and APIs Realtime Dashboards Perform realtime OLAP aggregations and serve them through REST APIs Time Series Analysis Store, process and serve massive volumes of time-series data Realtime Log Analytics Ingestion and processing of high-throughput streaming log events Recommendation Engines Build models in batch using historical data and serve them in realtime Anomaly Detection Systems Process streaming events and predictably compare them in realtime to historical data NRT Event Monitoring Reliably monitor large streams of data and perform defined actions within a specified time Internet of Things Ingestion, storage and processing of events that is highly-available, scalable and consistent ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS Batch and Realtime Data Ingestion Any type of data from any type of source in any volume Batch and Streaming ETL Code-free self-service creation and management of pipelines SQL Exploration and Data Science All data is automatically accessible via SQL and client SDKs Data as a Service Easily expose generic or custom REST APIs on any data 360o Customer View Integrate data from any source and expose through queries and APIs Realtime Dashboards Perform realtime OLAP aggregations and serve them through REST APIs Time Series Analysis Store, process and serve massive volumes of time-series data Realtime Log Analytics Ingestion and processing of high-throughput streaming log events Recommendation Engines Build models in batch using historical data and serve them in realtime Anomaly Detection Systems Process streaming events and predictably compare them in realtime to historical data NRT Event Monitoring Reliably monitor large streams of data and perform defined actions within a specified time Internet of Things Ingestion, storage and processing of events that is highly-available, scalable and consistent ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS Batch and Realtime Data Ingestion Any type of data from any type of source in any volume Batch and Streaming ETL Code-free self-service creation and management of pipelines SQL Exploration and Data Science All data is automatically accessible via SQL and client SDKs Data as a Service Easily expose generic or custom REST APIs on any data 360o Customer View Integrate data from any source and expose through queries and APIs Realtime Dashboards Perform realtime OLAP aggregations and serve them through REST APIs Time Series Analysis Store, process and serve massive volumes of time-series data Realtime Log Analytics Ingestion and processing of high-throughput streaming log events Recommendation Engines Build models in batch using historical data and serve them in realtime Anomaly Detection Systems Process streaming events and predictably compare them in realtime to historical data NRT Event Monitoring Reliably monitor large streams of data and perform defined actions within a specified time Internet of Things Ingestion, storage and processing of events that is highly-available, scalable and consistent
  • 4. PROPRIETARY & CONFIDENTIAL Web Analytics and Reporting Use Case ✦ Hadoop ETL pipeline stitched together using hard-to-maintain, brittle scripts
 ✦ Not enough people with expertise in all the Hadoop components (HDFS, MapReduce, Spark, YARN, HBase, Kafka) or a general lack of expertise
 ✦ Hard to debug and validate, resulting in frequent failures in production environment
 ✦ Difficult to integrate into SQL / BI reporting solutions for business users
 ✦ As use cases advance into Data Science, Machine Learning, and Predictive Analytics you need to include scientists and advanced ML programmers Transform web log data from S3 every hour to Hadoop cluster for backup, as well as, perform analytics and enable realtime reporting of metrics such as number of successful/failure responses, most popular pages, etc. The Challenges —
  • 5. cask.co The Many Faces of Hadoop 5 Developer Advanced Programming Focused on App Logic Data Scientist Basic Dev & Complex Analytics Focused on Data & Algorithms IT Pro / Ops Configuring & Monitoring Focused on Infrastructure & SLA’s LOB / Product Decision Making & Driving Revenue Focused on Apps & Insights Challenge: The tools are missing to connect these users and take apps from prototype to production
  • 6. cask.co6 Enter Cask Key Customers and Partners Named a Gartner Cool Vendor 2016 Founded in 2011 by early Hadoop engineers from Facebook and Yahoo!
  • 7. cask.co Introducing the Data Application Platform 7 Deployment Models On-premises Hybrid Cloud Governance Operations Pre-packaged Integrations Orchestration/Automation/Workflows Core Application and Data Integration Role-based User Experience Developer Data Scientist IT /Ops
  • 8. cask.co Introducing the Cask Data App Platform 8 Open Source, Integrated Framework for Building and Running Data Applications on Hadoop and Spark • Supports all major Hadoop distros • Integrates the latest Big Data technologies • 100% open source and highly extensible
  • 9. 9 What’s in CDAP ? A self-service, re-configurable, code-free framework to build, run and operate real-time or batch data pipelines in cloud or on- premise. A self-service tool for tracking the flow of data in and out of Data Lake. Track, Index and Search technical, business and operational metadata of applications and pipelines An integration platform that integrates and abstracts underlying Hadoop technologies. Build data analytics solutions in cloud or on-premise. The platform is powerful and versatile for you to build, publish and manage operational self-service analytics applications Your Apps
  • 10. cask.co10 A self-service, code-free framework to build, run and operate data pipelines on Apache Hadoop and Spark Built for Production on CDAP Rich Drag-and-Drop User Interface Open Source & Highly Extensible
  • 11. PROPRIETARY & CONFIDENTIAL INGEST any data from any source in real-time and batch BUILD drag-and-drop ETL/ELT pipelines that run on Hadoop EGRESS any data to any destination in real-time and batch Hydrator Data Pipelines provide the ability to automate complex workflows that involves fetching data, possibly from multiple data sources, combining, performing non-trivial transformations and aggregations on the data, writing it to one more data sinks and making it available for applications and analytics
  • 13. PROPRIETARY & CONFIDENTIAL Hydrator Studio ✦ Drag-and-drop GUI for visual Data Pipeline creation
 ✦ Rich library of pre-built sources, transforms, sinks for data ingestion and ETL use cases
 ✦ Separation of pipeline creation from execution framework - MapReduce, Spark, Spark Streaming etc.
 ✦ Hadoop-native and Hadoop Distro agnostic
  • 14. PROPRIETARY & CONFIDENTIAL Hydrator Data Pipeline ✦ Captures Metadata, Audit, Lineage info, discovered and visualized using Cask Tracker
 ✦ Notifications, scheduling, and monitoring with centralized metrics and log collection for ease of operability ✦ Simple Java API to build your own source, transforms, sinks with class loading isolation
 ✦ Javascript and Python transforms
 ✦ Include arbitrary Spark jobs
  • 15. PROPRIETARY & CONFIDENTIAL ✦ Elastic, SFTP, Cassandra, Kafka, RDBMS, EDW and many more sources and sinks ✦ Parse/Encode/Hash, Distinct/Group By, Custom JavaScript/Python Transforms Out of the box Integrations
  • 16. PROPRIETARY & CONFIDENTIAL ✦ Implement your own batch (or realtime) source, transform, sink plugins using simple Java API Custom Plugins
  • 17. PROPRIETARY & CONFIDENTIAL Pipeline Implementation Logical Pipeline Physical Workflow MR/Spark Executions Planner CDAP ✦ Planner converts logical pipeline to a physical execution plan
 ✦ Optimizes and bundles functions into one or more MR/Spark jobs
 ✦ CDAP is the runtime environment where all the components of the data pipeline are executed
 ✦ CDAP provides centralized log and metrics collection, transaction, lineage and audit information
  • 19. 19 Support for fine-grain role-based authorizing of entities in CDAP
 Integration with Sentry and Ranger Security — Authentication and Authorization Ability to preview pipelines with real or injected data before deploying (Standalone) Security — Impersonation
 and Encryption Learn about how datasets are being used and the top applications accessing it Tracker — Data Usage Analytics Support for annotating business metadata based on business specified taxonomy Metadata Taxonomy Build and run Hydrator real-time pipelines using Spark Streaming Hydrator — Spark Streaming Ability to run CDAP and CDAP Apps as specified users and ability to encrypt/decrypt sensitive configuration Hydrator — Preview Mode Capability to join multiple streams (inner & outer) and ability to configure actions allowing one to run binaries on designated nodes Hydrator — Join & Action Support for XML, Mainframe (COBOL Copybook), Value Mapper, Normalizer, Denormalizer, JsonToXml, SSH Action, Excel Reader, Solr & Spark ML Hydrator — Plugins 3.5 - Latest Features
  • 20. PROPRIETARY & CONFIDENTIAL ✦ Join across multiple data sources (CDAP-5588)
 ✦ Live Debug/Preview of pipelines in build mode
 ✦ Macro substitutions for configuration/properties
 ✦ Custom Actions anywhere in pipeline
 ✦ Spark streaming support for real-time pipelines Hydrator Roadmap
  • 21. 21 Use case mapping • Build operational analytics applications • Micro-service Enablement • Self-Service Data Analytics / Data Science • Data-As-A-Service • Empower developers to easily build solution on Hadoop • Abstract technologies, future proof • Ingestion, Transformation, Blending (complex joins) and Lookup. • Machine Learning, Aggregation and Reporting • Realtime and Batch data pipelines • DW Offloading (Netezza, Teradata, etc) • Painless and Fast Ingest into Impala operationalized • Data Ingestion from varied sources • Easy way to catalog application and pipeline level metadata • Search across technical, business and operational metadata • Track Lineage and Provenance, • Track across non-Hadoop integrations • Usage Analytics of cluster data • Data Quality Measure • Integration with other MDM systems including Navigator
  • 22. PROPRIETARY & CONFIDENTIAL Demo Example Load Log Files from S3 to HDFS and perform aggregations/analysis • Start with web access logs stored in Amazon S3 • Store the raw logs into HDFS Avro Files • Parse the access log lines into individual fields • Calculate the total number of requests by IP and status code • Find out IPs which received maximum successful status code and error codes 69.181.160.120 - - [08/Feb/2015:04:36:40 +0000] "GET /ajax/planStatusHistory HTTP/1.1" 200 508 "http://builds.cask.co/log" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit Chrome/38.0.2125.122 Safari/537.36" Fields: IP Address, Timestamp, Http Method, URI, Http Status, Response Size, URI, Client Info Sample Web access log (Combined Log Format):
  • 23. cask.co23 Thanks! Jonathan Gray @jgrayla Download CDAP w/ Hydrator: http://cask.co/downloads/