Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×



Eche un vistazo a continuación

1 de 38 Anuncio

Más Contenido Relacionado

Más reciente (20)



  1. 1. Data Pipelines with Azure Synapse: Real-life scenarios and solutions Dustin Vannoy
  2. 2. Dustin Consultant – Data Engineer /in/dustinvannoy Data Engineering SD meetup Technologies ➢ Azure & AWS ➢ Apache Spark ➢ Apache Kafka ➢ Azure Synapse Analytics ➢ Python & Scala Vannoy @dustinvannoy
  3. 3. Agenda What is a Data Pipeline? Technology Overview Scenario 1: Ingest from Azure Storage Scenario 2: Ingest from SQL Server Scenario 3: Ingest streaming data
  4. 4. What is a Data Pipeline?
  5. 5. Defining Data Pipeline (General) A set of jobs that process data from one place to another.
  6. 6. Defining Data Pipeline (Typical Use) The process of bringing data into a data lake or data warehouse, including cleaning, enriching, and transforming data.
  7. 7. Data Lake Defined Big Data Capable Store first, evaluate and model later Data Zones Ready for Analysts Query layer, other analytic tools access Raw Enriched Curated / Certified
  8. 8. Data Warehouse Defined Structured Data Processed and modeled for analytics use Interactive query Analysts can get answers to questions quickly BI tool support Reporting tools can query efficiently
  9. 9. Curate Enrich Clean Make Available Collect
  10. 10. Data Ingestion Decisions Do we use Azure Data Factory or Synapse Pipelines? How do we schedule and orchestrate job steps? How do we monitor job success? Do we attempt to validate data quality? Any field level encryption required?
  11. 11. Technology Overview
  12. 12. Data Lake Storage, Gen 2 • Built on Azure Blob Storage • Hadoop compatible access • Optimized for cloud analytics • Low cost: $
  13. 13. Managed Apache Spark Synapse Pipelines Serverless & Dedicated SQL Data Explorer AZURE SYNAPSE ANALYTICS
  14. 14. Serverless Apache Spark for data processing and exploration Synapse Pipelines for no-code or low-code data ingestion Serverless SQL for easy querying Dedicated SQL for high performance analytic queries using MPP database Synapse Capabilities
  15. 15. Ingest from Azure Storage
  16. 16. Synapse Data Lake Ingest Sources Azure Data Lake Storage Synapse Spark
  17. 17. Why Spark? Big data and the cloud changed our mindset. We want tools that scale easily as data size grows. ⮚ Fast, general purpose data processing ⮚ Simple code for distributed processing ⮚ Many options to develop and run
  18. 18. Simple code, parallel compute Worker Controller Worker Worker Worker
  19. 19. Demo Azure Storage Ingest
  20. 20. Ingest from SQL Server
  21. 21. Ingest from SQL Server How can I keep the table schema? How will I maintain this as new tables get added? How will I deal with new or removed columns? Can I do a full reload of every table for every run? Is it outside of our Azure virtual network? Can private endpoint be easily configured? Do I need to add specific IPs to an allow list?
  22. 22. Demo SQL Server Ingest
  23. 23. Ingest from Event Stream
  24. 24. Synapse Spark Streaming Apache Kafka Synapse Spark Sources Data Lake Storage
  25. 25. Why Kafka? Apache Kafka is a scalable message broker / distributed log. Producers can quickly publish and move on while data is persisted for all consumers. Reliable place to stream events; decoupled from destination
  26. 26. Distributed Log (message broker) Decouple producer and consumer Durable storage Low-latency High scalability Apache Kafka
  27. 27. Hub for streaming data Data Lake Post data User Dashboard Real-time report User data Apache Kafka / Event Hubs
  28. 28. What is Spark Structured Streaming? "The simplest way to perform streaming analytics is not having to reason about streaming at all" A table that is constantly appended with each micro-batch - Tathagata Das “TD” Reference:
  29. 29. Structured Streaming - Read df = spark.readStream .format("kafka") .options(**consumer_config) .load()
  30. 30. Structured Streaming - Write df.writeStream .format("kafka") .options(**producer_config) .option("checkpointLocation","/tmp/cp001") .start()
  31. 31. Structured Streaming –Checkpoint df.writeStream .format("delta") .outputMode("append") .option("checkpointLocation","/chkpnt/dq1") .start("/tmp/demo_out"))
  32. 32. Structured Streaming – Output Mode df.writeStream .format("delta") .outputMode("append") .option("checkpointLocation","/chkpnt/dq1") .start("/tmp/demo_out"))
  33. 33. Spark Streaming Benefits ● Re-use Spark batch code ● Stateful streaming and joins ● Mature with many integrations ● Kafka or Event Hubs not required
  34. 34. Demo Ingest Event Stream
  35. 35. Final Thoughts
  36. 36. Session Feedback Surveys In the pursuit of making our conferences even better, we need to hear your feedback about this session. Here’s How - ▪ Simply go to the Whova App on your smartphone ▪ Go to the conference homepage ▪ Scroll down to ‘Additional Resources’ and click ‘Surveys’. ▪ Click ‘Session Feedback’. ▪ Scroll down to click on this session title. ▪ Complete the session feedback survey. ▪ Finally, click ‘Submit’