Más contenido relacionado

Presentaciones para ti(20)

Similar a Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to Increase Enterprise Adoption(20)


Más de DataWorks Summit(20)


Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to Increase Enterprise Adoption

  1. Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to Increase Enterprise Adoption Krishna Potluri – Big Data Tech Lead, Progressive @TenduYogurtcu – CTO, Syncsort
  2. Data Liberation, Integrity & Integration for Next-Generation Analytics Marquee global customer base of leaders and emerging businesses across all major industries Trusted Industry Leadership We provide unique data management solutions and expertise to over 2,500 large enterprises worldwide with an unmatched focus on customer success & value Best Quality, Top Performance, Lower Costs Our proven software efficiently delivers all critical enterprise data assets with the highest integrity to Big Data environments, on premise or in the cloud Highly Acclaimed & Award Winning • Data Quality “Leader” in Gartner Magic Quadrant • IT World Awards® 2016 “Innovations in IT” Gold Winner • Database Trends & Applications “Companies That Matter Most in Data” • Mainframe Access & Integration for Application Data • High-Performance ETL Data Access & Transformation • Mainframe Access & Integration for Machine Data Data Infrastructure Optimization Data Quality • Big Data Quality & Integration • Data Enrichment & Validation • Data Governance • Customer 360 • Enterprise Data Warehouse Optimization • Application Modernization • Mainframe Optimization
  3. Syncsort Makes All Data Accessible & Usable for Analytics
  4. Syncsort + Hortonworks Solution Enabling the Enterprise Data Lake Fast. Secure. Enterprise Grade. ETL Onboarding Mainframe Access & Integration • Free up your budget: dramatically reduce EDW costs and avoid continual upgrades • Get fresher data: from right-time to real-time • Keep all data as long as you want: from weeks to months, years and beyond • Optimize your data warehouse: faster database queries for faster analysis • Blend new data, fast: from structured to unstructured. From mainframe to the Cloud • Deliver powerful new insights: combine mainframe data with Big Data • Securely access mainframe data when you need it: directly access and translate mainframe data • Reduce storage costs: Go from $100K/TB to $2K/TB by migrating data to HDFS • Use your MIPS wisely: Save an average of up to $7K per MIPS when offloading batch workloads to Hadoop.
  5. Syncsort + Hortonworks Reference Architecture • Apache Ambari Integration • Deploy DMX-h across cluster • Monitor DMX-h jobs • Process in MapReduce or Spark • Source relational and non relational data (including mainframes) • Syncsort DMX-h & HDF – batch and real time • Out-of-the-box integration, interoperability & certifications • Kerberos-secured clusters • Apache Sentry/Ranger security certified • Early beta, release certification • Metadata lineage export from DMX • Atlas integration
  6. Our Strategy: Simplify Big Data Integration • Deploy on premise or in the cloud • Choose among multiple execution frameworks – Hadoop, Spark, Spark 2.0, Linux, Unix, Windows • Integrate streaming and batch data with a single data pipeline for innovative applications, like IoT • Future-proof applications to avoid re-writing jobs in order to take advantage of innovations in new execution frameworks • Access and integrate ALL enterprise data sources – including mainframe – for advanced analytics
  7. How Progressive Paved a Faster, Smoother Path to Increase Enterprise Adoption Krishna Potluri Big Data Tech Lead JOURNEY TO THE DATA LAKE:
  8. Big Data Use Cases Telematics Ads Claims
  9. External Data Progressive Data Hive ORC Batch Batch Hortonworks Cluster Hive / Java Sqoop / Java Hive Text Staging Target MPP Appliance Batch On-Demand PolyBase Oozie Pig Hive M/R Batch Batch Hive Instrumentation Hive Balancing Scheduler Tools and Ingestion Flow - Past
  10. • Developer Skill Set - Mainframe and SSIS • Hadoop Upgrade Issues • Maven, Java, and Artifactory • Custom Oozie Actions • Custom Oozie Triggers • Pig and Hive UDFs • Custom Pig Loaders • Complex N way Hive Reports using JDBC • Custom Hive Serdes • Complex Oozie workflows, bundle, coordinators Support Issues
  11. Hadoop Microsoft PDW INGEST ETL IT WAREHOUSE TEAMS BUSINESS ANALYSTS Queue Ingestion Team Centralized Team for Hadoop
  12. 1. Custom ETL in Pig and Hive: • Tied up resources • Maintenance and knowledge transfer issues 2. Ingestion projects: • No consistency in code • Reconciliation and Balancing Issues • Each developer had their own pattern 3. Projects were queued up! Again, Support Issues
  13. Requirements Tools Selection • Industrial Strength • Run natively on Hadoop • All Data Sources Support • Usable by Non-Java Programmers • Security/Cloud/IDE • Pricing/Stability • Enterprise Support • Ability to customize frameworks • At pace with Hortonworks • Fit with Internal Build and Elevate Syncsort Talend Informatica Actian CDAP Cascading Syncsort Ingestion/ETL Tool Selection
  14. External Data Progressive Data Hive ORC Batch Batch Hortonworks Cluster DMX-h DMX-h Hive Text Staging Target Microsoft PDW Batch On-Demand PolyBase DMX-h Batch Batch Scheduler Data Lake Ingestion - Current
  15. Data Node Name Node Hortonworks Cluster Secondary Name Node DMX-h Data Node DMX-h Data Node DMX-h Data Node DMX-h … Edge Node Edge Node DMX-h Daemon DMX-h Daemon Firewall Developer Machine DMX-h Client Non-Prod Prod Build Machine Interactive Deploy TFS Syncsort Implementation
  16. Sqoop Syncsort DMX-h Connectivity Database Driver Jars Data Direct In-Memory Transformations Not Supported Supported Parallelism Supported Supported Edge-Node Ingestion Not Supported Supported Sqoop vs. Syncsort DMX-h
  17. Syncsort DMX-h DTL Syncsort DMX-h GUI Complexity High Low Development Time High Medium Flexibility High Medium Reuse High Medium Choosing the Right Interface for the Job
  18. DMX-h DTL Copy Task DMX-h GUI Copy Task Syncsort DMX-h DTL vs. GUI - Visual
  19. No CDC Incremental Extract Full Extract CDC On Source Hive (ORC) Hive (Text) 7 Days Rolling Partition Hive (ORC) Daily Partitioned Staging Raw History Target Current View DMX-h DTL I/U/D I/U CDC Patterns – Syncsort DMX-h DTL DMX-h DTL
  20. DataFunnel DMX-h DTL HDFS Hive Schema Generate Ingest In-Memory Transformations Multiple Data Sources Multiple Tables DataFunnel UI Syncsort DataFunnel™ TM TM
  21. Hadoop Metadata Collector HDFS Sqoop Hadoop Job Configuration UI Hadoop Ingestor Service Sources Meta Create Job Edit Job Hadoop Ingestor Service JDBC Hive DMX-h Code Hadoop Ingestor
  22. Table 1 Table 2 Table n Pattern 2 Hive Staging Hive Raw Hive Target Hive Reconciliation Hive Reconciliation Pattern 3 Hive Staging Hive Raw Hive Target HiveSqoop DTL Hive Reconciliation Hive Reconciliation Pattern 4 Hive Staging Hive Raw DTLDTL Hive Reconciliation Hive Reconciliation Hive Reconciliation Hive Reconciliation Staging Raw Target Hive Target DTL Hive Reconciliation Logging Metrics Balancing Hadoop Ingestor Service Flow DMX-h DTL DMX-h DTL DMX-h DTL DMX-h DTL DMX-h DTL DMX-h DTL DMX-h DTL
  23. • One Code Base for Ingestion • Choice of CDC Patterns • Every Table is ingested the same way • Jobs don’t need to be recoded for new framework - DMX-h Shields us • Every table is reconciled and balanced • Capture metrics around run time and row counts • Ingestion time remains the same no matter the table count “ Ingestion has gone from days to hours, and Progressive IT has a single entry point and code base for Data Lake Ingestion ” Hadoop Ingestor + Syncsort DMX-h (Benefits) - Ingestion Team
  25. Learn More! Visit us at Booth #1102 Get a demo & pick up your Data Liberator t-shirt!

Notas del editor

  1. Move Access and Transformation to the top
  2. We focus on two main use cases: ETL or ELT onboarding– Moving data processing from expensive platforms like Teradata and moving it to Hadoop One of Our biggest use case is what we call mainframe access and integration
  3. Syncsort/Hortonworks reference architecture Deployed by Ambari On every node Data movement and transformation MapReduce or Spark
  4. Syncsort’s data integration has always delivered the ability to process large data volumes, in less time, with fewer resources. However, performance and efficiency are just are starting points. It became apparent in speaking with our customers a few years ago – particularly when “Big Data” and Hadoop took off – that they were facing new challenges that had a common theme of complexity. The rapid evolution of Big Data technologies presents several challenges on its own: New technologies require new specialized skills that continue to be in short supply, and very expensive if you can find them Because execution frameworks continue to be improved, customers don’t want to feel locked in. They don’t want to have to redevelop all their jobs if they want to take advantage of innovative new frameworks. A great example is MapReduce v1 to MapReduce v2 to Spark. New sources and types of data add to the complexity as well – with streaming sources as a recent example. Connectivity and skills challenges. Many of our customers are large enterprises that still have a significant reliance on the mainframe – and these companies found it very difficult to leverage the mainframe with the rest of the Big Data integration strategy Organizations were building data lakes but then struggling to fill them. We heard from customers and partners alike that ingesting data from all its enterprise sources – Mainframe, data warehouse, etc. -- into Hadoop was a big problem to solve. So, our product strategy has focused on not only delivering data integration products with exceptional performance and efficiency and lower TCO – but to also simplify the data integration process for all enterprise data sources and across all platforms – Linux, Unix, Windows, Hadoop, Spark – on premise, or in the cloud.