Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Building Data Pipelines with Spark and StreamSets

Big data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Metadata in upstream sources can ‘drift’ due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail. StreamSets Data Collector (SDC) is an Apache 2.0 licensed open source platform for building big data ingest pipelines that allows you to design, execute and monitor robust data flows. In this session we’ll look at how SDC’s “intent-driven” approach keeps the data flowing, with a particular focus on clustered deployment with Spark and other exciting Spark integrations in the works.

Audiolibros relacionados

Gratis con una prueba de 30 días de Scribd

Ver todo
  • Sé el primero en comentar

Building Data Pipelines with Spark and StreamSets

  1. 1. Building Data Pipelines with Spark and StreamSets Pat Patterson Community Champion @metadaddy pat@streamsets.com
  2. 2. Agenda Data Drift StreamSets Data Collector Running Pipelines on Spark Today Future Spark Integration Demo
  3. 3. Past ETL ETL Emerging Ingest Analyze Data Sources Data Stores Data Consumers The Evolution of Data-in-Motion
  4. 4. Data Drift - a Data Engineering Headache The unpredictable, unannounced and unending mutation of data characteristics caused by the operation, maintenance and modernization of the systems that produce the data Structure Drift Semantic Drift Infrastructure Drift
  5. 5. SQL on Hadoop (Hive) Y/Y Click Through Rate 80% of analyst time is spent preparing and validating data, while the remaining 20% is actual data analysis Example: Data Loss and Corrosion
  6. 6. Delayed and False Insights Solving Data Drift Tools Applications Data Stores Data Consumers Poor Data QualityData Drift Custom code Fixed-schema Data Sources // DIY Custom Code
  7. 7. Solving Data Drift Trusted InsightsData KPIs Tools Applications Data Drift Intent-Driven Drift-Handling Data Stores Data ConsumersData Sources // DIY Custom Code
  8. 8. StreamSets Data Collector Open source software for the rapid development and reliably operation of complex data flows. ➢ Intent-driven ➢ UI Abstraction ➢ Extensible
  9. 9. Handling Drift with Hive ➢ Monitor data structure ➢ Detect schema change ➢ Alter Hive Metadata
  10. 10. Running Pipelines on Spark Today spark-submit --num-executors ... --archives ... --files ... --jar ... --class ... ● Container on Spark ● Leverage Kafka RDD ● Scale out for performance
  11. 11. SDC on Spark - Connectivity Sources • Kafka Destinations • HDFS • HBase • S3 • Kudu • MapR DB • Cassandra • ElasticSearch • Kafka • MapR Streams • Kinesis • etc, etc, etc!
  12. 12. Future Directions
  13. 13. Run Pipelines on Databricks ● Container on Databricks ● Leverage REST API ● Add S3 origin { "name":"StreamSets Data Collector", "new_cluster":{ "num_workers":1 }, "spark_jar_task":{ "main_class_name":"com.streamsets..." } }
  14. 14. Break Out Spark Processor ● Standalone containers, Spark processor ● Leverage Spark code ● Custom RDD ● Start local Spark job for each batch ● Example use cases: running image classification, sentiment analysis Local Mode
  15. 15. Spark Processor - Connectivity Sources • Kafka • S3 • MapR Streams • JDBC • MongoDB • Local Filesystem • Redis • JMS • HTTP • UDP • etc, etc, etc! Destinations • HDFS • HBase • S3 • Kudu • MapR DB • Cassandra • ElasticSearch • Kafka • MapR Streams • JDBC • etc, etc, etc!
  16. 16. Deepen Spark Integration ● Container on Spark, Spark processor ● Leverage Spark code ● Custom RDD ● Start Spark job ‘on cluster’ for each pipeline ● Example use cases: training image classification, sentiment analysis
  17. 17. Spark Integration Architecture
  18. 18. SDC on Spark - Connectivity Tomorrow Sources • Kafka • S3 • MapR Streams • JDBC • MongoDB • Redis • JMS • HTTP • UDP • ...any partitionable data source... Destinations • HDFS • HBase • S3 • Kudu • MapR DB • Cassandra • ElasticSearch • Kafka • MapR Streams • JDBC • etc, etc, etc!
  19. 19. Demo
  20. 20. Conclusion StreamSets Data Collector brings a UI abstraction to Spark Standalone container + local Spark Processor bring wide connectivity to Spark code Spark Container + Spark Processor allow iterative Spark code in pipelines
  21. 21. Resources Download StreamSets Data Collector https://streamsets.com/opensource Contribute Code https://github.com/streamsets/datacollector Get Involved https://streamsets.com/community
  22. 22. Thank You!
  23. 23. Backup Slides
  24. 24. Structure Drift Data structures and formats evolve and change unexpectedly Implication: Data Loss Data Squandering Delimited Data 107.3.137.195 fe80::21b:21ff:fe83:90fa Attribute Format Changes { “first“: “jon” “last“: “smith” “email“: “jsmith@acme.com” “add1“: “123 Washington” “add2“: “” “city“: “Tucson” “state“: “AZ” “zip“: “85756” } { “first“: “jane” “last“: “smith” “email“: “jane@earth.net” “add1“: “456 Fillmore” “add2“: “Apt 120” “city“: “Fairfield” “state“: “VA” “zip“: “24435-1001” “phone”: “401-555-1212” } Data Structure Evolution Structure Drift
  25. 25. Semantic Drift Data semantics change with evolving applications Implication: Data Corrosion Data Loss Semantic Drift 24122-52172 00-24122-52172 Account Number Expansion M134: user {jsmith} read access granted {ac:24122-52172} M134: user {jsmith} read access granted {ca.ac:24122-52172} Namespace Qualification …… …,3588310669797950,$91.41,jcb,K1088-W#9,… …,6759006011936944,$155.04,switch,A6504-Y#9,… …,6771111111151415,$37.78,laser,Q9936-T#9,… …,3585905063294299,$164.48,jcb,S4643-H#9,… …,5363527828638736,$117.52,mastercard,X3286-P#9,… …,4903080150282806,$168.03,switch,I9133-W#3,… …… Outlier / Anomaly Detection
  26. 26. Infrastructure Drift Physical and Logical Infrastructure changes rapidly Implication: Poor Agility Operational Downtime Data Center 1 Data Center 2 Data Center n 3rd Party Service Provider App a App k App q Cloud Infrastructure Infrastructure Drift

×