Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to Increase Enterprise Adoption

Journey to the Data Lake:
How Progressive Paved a Faster, Smoother
Path to Increase Enterprise Adoption
Krishna Potluri – Big Data Tech Lead, Progressive
@TenduYogurtcu – CTO, Syncsort

Data Liberation, Integrity & Integration for Next-Generation Analytics
Marquee global customer base of leaders and
emerging businesses across all major industries
Trusted Industry Leadership
We provide unique data management solutions
and expertise to over 2,500 large enterprises worldwide
with an unmatched focus on customer success & value
Best Quality, Top Performance, Lower Costs
Our proven software efficiently delivers all critical enterprise
data assets with the highest integrity to Big Data
environments, on premise or in the cloud
Highly Acclaimed & Award Winning
• Data Quality “Leader” in Gartner Magic Quadrant
• IT World Awards® 2016 “Innovations in IT” Gold Winner
• Database Trends & Applications “Companies That Matter
Most in Data”
• Mainframe Access & Integration
for Application Data
• High-Performance ETL
Data Access & Transformation
• Mainframe Access & Integration
for Machine Data
Data Infrastructure
Optimization
Data Quality
• Big Data Quality & Integration
• Data Enrichment & Validation
• Data Governance
• Customer 360
• Enterprise Data Warehouse
Optimization
• Application Modernization
• Mainframe Optimization

Syncsort Makes All Data Accessible & Usable for Analytics

Syncsort + Hortonworks Solution
Enabling the Enterprise Data Lake
Fast. Secure. Enterprise Grade.
ETL Onboarding Mainframe Access & Integration
• Free up your budget: dramatically reduce EDW
costs and avoid continual upgrades
• Get fresher data: from right-time to real-time
• Keep all data as long as you want: from weeks
to months, years and beyond
• Optimize your data warehouse: faster database
queries for faster analysis
• Blend new data, fast: from structured to
unstructured. From mainframe to the Cloud
• Deliver powerful new insights: combine
mainframe data with Big Data
• Securely access mainframe data when you need
it: directly access and translate mainframe data
• Reduce storage costs: Go from $100K/TB to
$2K/TB by migrating data to HDFS
• Use your MIPS wisely: Save an average of up to
$7K per MIPS when offloading batch workloads to
Hadoop.

Syncsort + Hortonworks Reference Architecture
• Apache Ambari Integration
• Deploy DMX-h across cluster
• Monitor DMX-h jobs
• Process in MapReduce or Spark
• Source relational and non relational data (including
mainframes)
• Syncsort DMX-h & HDF – batch and real time
• Out-of-the-box integration, interoperability &
certifications
• Kerberos-secured clusters
• Apache Sentry/Ranger security certified
• Early beta, release certification
• Metadata lineage export from DMX
• Atlas integration

Our Strategy: Simplify Big Data Integration
• Deploy on premise or in the cloud
• Choose among multiple execution frameworks – Hadoop, Spark, Spark 2.0,
Linux, Unix, Windows
• Integrate streaming and batch data with a single data pipeline for
innovative applications, like IoT
• Future-proof applications to avoid re-writing jobs in order to take
advantage of innovations in new execution frameworks
• Access and integrate ALL enterprise data sources – including mainframe –
for advanced analytics

How Progressive Paved a Faster, Smoother
Path to Increase Enterprise Adoption
Krishna Potluri
Big Data Tech Lead
JOURNEY TO THE DATA LAKE:

Big Data Use Cases
Telematics Ads Claims

External
Data
Progressive
Data
Hive
ORC
Batch
Batch
Hortonworks Cluster
Hive / Java
Sqoop / Java
Hive
Text
Staging Target
MPP
Appliance
Batch
On-Demand
PolyBase
Oozie
Pig
Hive
M/R
Batch Batch
Hive
Instrumentation
Hive
Balancing
Scheduler
Tools and Ingestion Flow - Past

• Developer Skill Set - Mainframe and SSIS
• Hadoop Upgrade Issues
• Maven, Java, and Artifactory
• Custom Oozie Actions
• Custom Oozie Triggers
• Pig and Hive UDFs
• Custom Pig Loaders
• Complex N way Hive Reports using JDBC
• Custom Hive Serdes
• Complex Oozie workflows, bundle, coordinators
Support Issues

Hadoop
Microsoft
PDW
INGEST
ETL
IT WAREHOUSE TEAMS
BUSINESS ANALYSTS
Queue
Ingestion Team
Centralized Team for Hadoop

1. Custom ETL in Pig and Hive:
• Tied up resources
• Maintenance and knowledge transfer issues
2. Ingestion projects:
• No consistency in code
• Reconciliation and Balancing Issues
• Each developer had their own pattern
3. Projects were queued up!
Again, Support Issues

Requirements Tools Selection
• Industrial Strength
• Run natively on Hadoop
• All Data Sources Support
• Usable by Non-Java Programmers
• Security/Cloud/IDE
• Pricing/Stability
• Enterprise Support
• Ability to customize frameworks
• At pace with Hortonworks
• Fit with Internal Build and Elevate
Syncsort
Talend
Informatica
Actian
CDAP
Cascading
Syncsort
Ingestion/ETL Tool Selection

External
Data
Progressive
Data
Hive
ORC
Batch
Batch
Hortonworks Cluster
DMX-h
DMX-h
Hive
Text
Staging Target
Microsoft
PDW
Batch
On-Demand
PolyBase
DMX-h
Batch Batch
Scheduler
Data Lake Ingestion - Current

Data Node
Name Node
Hortonworks Cluster
Secondary
Name Node
DMX-h
Data Node
DMX-h
Data Node
DMX-h
Data Node
DMX-h
…
Edge
Node
Edge
Node
DMX-h
Daemon
DMX-h
Daemon
Firewall
Developer
Machine
DMX-h Client
Non-Prod
Prod
Build
Machine
Interactive
Deploy
TFS
Syncsort Implementation

Sqoop Syncsort DMX-h
Connectivity Database Driver Jars Data Direct
In-Memory Transformations Not Supported Supported
Parallelism Supported Supported
Edge-Node Ingestion Not Supported Supported
Sqoop vs. Syncsort DMX-h

Syncsort DMX-h DTL Syncsort DMX-h GUI
Complexity High Low
Development Time High Medium
Flexibility High Medium
Reuse High Medium
Choosing the Right Interface for the Job

DMX-h DTL
Copy Task
DMX-h GUI
Copy Task
Syncsort DMX-h DTL vs. GUI - Visual

No CDC
Incremental
Extract
Full Extract
CDC On
Source
Hive
(ORC)
Hive
(Text)
7 Days
Rolling
Partition
Hive
(ORC)
Daily
Partitioned
Staging
Raw
History
Target
Current View
DMX-h
DTL
I/U/D I/U
CDC Patterns – Syncsort DMX-h DTL
DMX-h
DTL

DataFunnel DMX-h
DTL
HDFS
Hive
Schema Generate
Ingest
In-Memory
Transformations
Multiple Data Sources
Multiple Tables
DataFunnel
UI
Syncsort DataFunnel™
TM
TM

Hadoop
Metadata
Collector
HDFS
Sqoop
Hadoop Job
Configuration
UI
Hadoop Ingestor Service
Sources
Meta
Create Job Edit Job
Hadoop Ingestor Service
JDBC Hive
DMX-h
Code
Hadoop Ingestor

Table 1
Table 2
Table n
Pattern 2
Hive
Staging
Hive
Raw
Hive
Target
Hive
Reconciliation
Hive
Reconciliation
Pattern 3
Hive
Staging
Hive
Raw
Hive
Target
HiveSqoop DTL
Hive
Reconciliation
Hive
Reconciliation
Pattern 4
Hive
Staging
Hive
Raw
DTLDTL
Hive
Reconciliation
Hive
Reconciliation
Hive
Reconciliation
Hive
Reconciliation
Staging Raw Target
Hive
Target
DTL
Hive
Reconciliation
Logging Metrics Balancing
Hadoop Ingestor Service Flow
DMX-h
DTL
DMX-h
DTL
DMX-h
DTL
DMX-h
DTL
DMX-h
DTL
DMX-h
DTL
DMX-h
DTL

• One Code Base for Ingestion
• Choice of CDC Patterns
• Every Table is ingested the same way
• Jobs don’t need to be recoded for new framework - DMX-h Shields us
• Every table is reconciled and balanced
• Capture metrics around run time and row counts
• Ingestion time remains the same no matter the table count
“ Ingestion has gone from days to hours, and Progressive IT has
a single entry point and code base for Data Lake Ingestion ”
Hadoop Ingestor + Syncsort DMX-h (Benefits)
- Ingestion Team

Learn More!
Visit us at Booth #1102
Get a demo & pick up your
Data Liberator t-shirt!

Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to Increase Enterprise Adoption

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to Increase Enterprise Adoption

Similar a Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to Increase Enterprise Adoption (20)

Más de DataWorks Summit

Más de DataWorks Summit (20)

Último

Último (20)

Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to Increase Enterprise Adoption

Notas del editor