Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Dealing With Drift - Building an Enterprise Data Lake

556 visualizaciones

Publicado el

Data drift, the gradual morphing of data structure and semantics, is a fact of life in enterprise IT. New requirements force schema changes, the meaning of database columns changes over time, and infrastructure upgrades add new fields to log files. Left unchecked, drift in data sources can cause applications and dataflows to fail, with costly downtime and, in the worst case, corruption in downstream data stores.
Cox Automotive comprises more than 25 companies dealing with different aspects of the car ownership lifecycle, with data as the common language they all share. The challenge for Cox was to create an efficient engine for the timely and trustworthy ingest of data capability for an unknown but large number of data assets from practically any source. Discover how their big data engineering team overcame data drift and are now populating a data lake, allowing analysts easy access to data from their subsidiary companies and producing new data assets unique to the industry.

Publicado en: Software
  • Sé el primero en comentar

Dealing With Drift - Building an Enterprise Data Lake

  1. 1. Dealing with Drift Building an Enterprise Data Lake
  2. 2. Speakers Nathan Swetye Sr. Manager of Platform Engineering Cox Automotive Michael Gay Lead Technical Architect Cox Automotive Pat Patterson Community Champion StreamSets
  3. 3. 3 25 (and growing) companies dealing with the automotive space Spans the full vehicle ownership lifecycle Data perceived as the integration point for all companies Cox Automotive
  4. 4. Enterprise Data DNA Commercial Customers Across Verticals 150,000 downloads 40 of the Fortune 100 Doubling each quarter Strong Partner Ecosystem Open Source Success Mission: empower enterprises to harness their data in motion. StreamSets Overview
  5. 5. StreamSets Data Collector™ StreamSets Dataflow Performance Manager (DPM™) Instrumented, open source UI and engine to build any-to-any dataflows. Cloud Service to map, measure and master dataflow operations. DATAFLOW LIFECYCLE Developers Scientists Architects StreamSets Enterprise EVOLVE (Proactive) REMEDIATE (Reactive) DEVELOP OPERATE Operators Stewards Architects
  6. 6. EFFICIENCY Intent Driven Flows Batch & Streaming Ingest In-stream Sanitization CONTROL Fine-grained Stage & Flow Metrics Drift Handling Lineage and Impact Analysis Capture AGILITY Flexible deployment Exception Handling Seamless Evolution StreamSets Data Collector is a complete IDE for building and executing any-to-any ingest pipelines. StreamSets Data Collector
  7. 7. StreamSets DPM provides a single pane of glass to map, measure and master your dataflow operations. MASTER Availability & Accuracy Proactive Remediation MEASURE Any Path Any Time MAP Dataflow Lineage Live Data Architecture StreamSets Dataflow Performance Manager (DPM)
  8. 8. Data Drift Change is the New Normal The unpredictable, unannounced and unending mutation of data characteristics caused by the operation, maintenance and modernization of the systems that produce the data Structure Drift Semantic Drift Infrastructure Drift
  9. 9. SQL on Hadoop (Hive) Y/Y Click Through Rate 80% of analyst time is spent preparing and validating data, while the remaining 20% is actual data analysis Example: Data Loss and Corrosion
  10. 10. Data Drift and Scale At the micro level, data drift leads to breakage and errors At the macro level, data drift brings your system to a grinding halt!
  11. 11. 11 The Problem of Data Exchange at Scale Everyone wants each others’ data, but often difficult to acquire A tangled mess of data flow A source of anguish and sorrow
  12. 12. 12 The Problem of Data Exchange at Scale Enter the Data Lake The central store for valuable data Mission: Data Lake, not Data Swamp Data$Lake
  13. 13. 13 Great. A Data Lake. But how do you Populate it? Problem: $$ Cost – a Question of Scale • 25 Companies • 9+ Source Types, mostly DBs • 1-Many Schemas per Database • Many Tables per Schema Example: • AutoTrader -> Oracle -> ATM1: ~1600 Tables
  14. 14. 14 Great. A Data Lake. But how do you Populate it? Problem: $$ Cost – a Question of Scale • 25 Companies • 9+ Source Types, mostly DBs • 1-Many Schemas per Database • Many Tables per Schema Example: • AutoTrader -> Oracle -> ATM1: ~1600 Tables We’ve ingested about that much
  15. 15. 15 Great. A Data Lake. But how do you Populate it?
  16. 16. 16 Back to Square 0
  17. 17. 17 Back to Square 0
  18. 18. 18 Cox Automotive’s StreamSets Architecture Databases Amazon S3 Files FTP Sources StreamSets Acquisition StreamSets StreamSets StreamSets Hadoop Filesystem Big Data SQL Amazon S3 Targets StreamSets Ingestion StreamSets StreamSets StreamSets Data Pipelines Separates Acquisition from Ingestion Dynamic Error Handling Encrypted Data in Transit Data standards applied automatically: • Compression • File Formats • Partitioning Schemes • Row-level Watermarks • Time-stamping Ingestion farm scales with demand Auto-creates schemas en route Data comes from a variety of sources Pipelines are established for each source Ingestion Back Pressure Scaling, Secure, load-balanced Actual ingestion activities On-premises and Cloud Big Data Systems StreamSets RPC StreamSets StreamSets StreamSets LoadBalancer
  19. 19. 19 Acquisition Deployment Model Ingest Form StreamSets Pipeline Deployment Virtual Host Deployment Ingestion Team Member StreamSets Acquisition Pipeline Enterprise Data Lake start workflow submit form start workflow build virtual host deploy data pipeline Enterprise Data Sources DevOps Team Member
  20. 20. 20 Throughput! 0 100 200 300 400 Jan Feb Mar Apr May Jun Jul Aug Sept Monthly Ingestion Requests StreamSets 7x
  21. 21. Live Environment
  22. 22. 25 Where do we go from Here? • Amazon Web Services • StreamSets Dataflow Performance Manager • Acquire/Ingest decision point: Centralized, Federated, or Democratized? • Quality • Streamline access to sources • Change data capture • Integration with enterprise data catalogs • Ingestion post-processing
  23. 23. Questions