Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 41 Anuncio

Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA

Descargar para leer sin conexión

EFSA is the European agency providing independent scientific advice on existing and emerging risks across the entire food chain. On 27/03/2021 a new EU regulation (EU 2019/1381) has been enacted, requiring EFSA to significantly increase the transparency of its risk assessment processes towards all citizens.

To comply with this new regulation, delaware has been supporting EFSA in undergoing a large Digital Transformation program. We have been designing and rolling-out a modern data platform running on Azure and powered by Databricks. This platform acts as a central control tower brokering data between a variety of applications. It is built around modularity principles, making it adaptable and versatile while keeping the overall ecosystem aligned w.r.t. changing processes and data models. At the heart of the platform lie two important patterns:

1. An Event Driven Architecture (EDA): enabling an extremely loosely coupled system landscape. By centrally brokering events near real-time, consumer applications can react immediately to events from producer applications as they occur. Event producers are decoupled from consumers via a publisher/subscribe mechanism.

2. A central data store built around a lakehouse architecture. The lakehouse collects, organizes and serves data across all stages of the data processing cycle, all data types and all data volumes. Events streams from the EDA layer feed into the store as curated data blocks and are complemented by other sources. This store in turn feeds into APIs, reporting and applications, including the new Open EFSA portal: a public website developed by delaware hosting all relevant scientific data, updated in near real-time.

At delaware we are very excited about this project and proud of what we have achieved with EFSA so far.

EFSA is the European agency providing independent scientific advice on existing and emerging risks across the entire food chain. On 27/03/2021 a new EU regulation (EU 2019/1381) has been enacted, requiring EFSA to significantly increase the transparency of its risk assessment processes towards all citizens.

To comply with this new regulation, delaware has been supporting EFSA in undergoing a large Digital Transformation program. We have been designing and rolling-out a modern data platform running on Azure and powered by Databricks. This platform acts as a central control tower brokering data between a variety of applications. It is built around modularity principles, making it adaptable and versatile while keeping the overall ecosystem aligned w.r.t. changing processes and data models. At the heart of the platform lie two important patterns:

1. An Event Driven Architecture (EDA): enabling an extremely loosely coupled system landscape. By centrally brokering events near real-time, consumer applications can react immediately to events from producer applications as they occur. Event producers are decoupled from consumers via a publisher/subscribe mechanism.

2. A central data store built around a lakehouse architecture. The lakehouse collects, organizes and serves data across all stages of the data processing cycle, all data types and all data volumes. Events streams from the EDA layer feed into the store as curated data blocks and are complemented by other sources. This store in turn feeds into APIs, reporting and applications, including the new Open EFSA portal: a public website developed by delaware hosting all relevant scientific data, updated in near real-time.

At delaware we are very excited about this project and proud of what we have achieved with EFSA so far.

Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA (20)

Anuncio

Más de Databricks (20)

Más reciente (20)

Anuncio

Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA

  1. 1. Building the foundations of an intelligent, event-driven data platform at EFSA Sebastiaan Leysen Data Platform Lead delaware BeLux David De Wilde Solution Architect delaware BeLux
  2. 2. About the speakers Sebastiaan Leysen Senior Consultant in Data & Analytics Data Platform Lead David De Wilde, Ph.D. Lead Expert in Data & Analytics Solution Architect
  3. 3. About EFSA
  4. 4. About EFSA
  5. 5. About EFSA
  6. 6. About EFSA Independent scientific advice Collects & analyzes research data Risk Assessment Risk Management Policy making on food safety Legislative decisions
  7. 7. EFSA’s scientific risk assessment process Step 1: Receipt of request More information: http://www.efsa.europa.eu/en/interactive-pages/scientificprocess/ScientificProcess Request for scientific advice (mandate) European Commission European Parliament Other National food safety bodies
  8. 8. EFSA’s scientific risk assessment process Request for scientific advice (mandate) Step 1: Receipt of request More information: http://www.efsa.europa.eu/en/interactive-pages/scientificprocess/ScientificProcess Pesticides Plant health Genetically modified organisms Food contact materials Food additives Nutrition Contaminants Biological hazards Animal health and welfare Animal feed Panels of scientific experts and related working groups
  9. 9. EFSA’s scientific risk assessment process Step 2: Risk Assessment More information: http://www.efsa.europa.eu/en/interactive-pages/scientificprocess/ScientificProcess Literature Scientific evidence Data Expertise Opinion Food contact materials adopted by the panel Public Consultations
  10. 10. EFSA’s scientific risk assessment process Step 3: Adoption More information: http://www.efsa.europa.eu/en/interactive-pages/scientificprocess/ScientificProcess Opinion European Commission European Parliament Other National food safety bodies published in EFSA Journal sent back to the original requestors
  11. 11. A digital transformation opportunity…
  12. 12. A digital transformation opportunity… Ensuring more transparency, reliability and objectivity Strengthening independence of studies Comprehensive risk communication
  13. 13. Solution Architecture
  14. 14. To help EFSA comply, we worked on 3 main pillars DATA PLATFORM Modern data platform supporting the transparency project and laying the foundation for future data projects. INTEGRATION Event Driven Integration Architecture that is extremely loosely coupled and is centrally brokering events in real-time, following a publisher/subscribe mechanism. OPEN EFSA PORTAL A public website making unprecedented amounts of risk assessment related data publicly accessible and enabling greater scrutiny of EFSA’s work.
  15. 15. communicates between airplanes tracks airplane movements in real-time monitors changing weather conditions validate flight plans and gives clearance Control tower analogy: schedules arrivals and departures orchestrates ground processes the airport
  16. 16. brokers information between applications in real-time harmonizes and consolidates data streams serves data to other parties (internal or external) monitors data flows and platform health orchestrates data movements Control tower analogy: the data platform validates incoming data (quality checks)
  17. 17. 1 real-time, event-driven communication app 1 app 3 app 2 app 5 app 4
  18. 18. 1 real-time, event-driven communication app 1 app 3 app 2 app 4 app 5 2 central knowledge base
  19. 19. Central knowledge base fed by events FSCAP Future Azure Data Store Open EFSA Portal Internal Dossier Viewer Management Reporting API Portal
  20. 20. Open EFSA portal https://open.efsa.europa.eu
  21. 21. API developer portal
  22. 22. Making it work… Control tower Central data store
  23. 23. Making it work… Raw: Data in native format, as received from source system Curated: Data in cleaned, validated & standardized format. High confidence. Canonical Data Model: ‘Shelf’ of re-usable pieces of information, constituting the business information model Views: Data ready for consumption by other systems. Optimized for reading. 1 2 4 3
  24. 24. Technical challenges
  25. 25. Technical Challenges ▪ +/- 350 tables → 350 notebooks ▪ Quick reload times ▪ Cost efficient ▪ Changing team + different experience levels ▪ Keep consistent way of working ▪ Central logging of the data pipelines ▪ Data governance + data catalog ▪ Housekeeping: optimize, initialize, set properties
  26. 26. Framework
  27. 27. Focus on adding business value Logging Ingest data Business Logic House- keeping Orchestrate Governance Export data Writing to delta lake Optimization Lineage Governance … By automating the repeatable non-value adding activities Strict split between • Business Logic: dataframe definition • Everything else: framework
  28. 28. Object oriented framework Object oriented: 2 classes • DataLakeTable: • dataframe • + Extra configuration properties • + Source information (lineage) • DataLake (group of DataLakeTable objects): • DAG (Directed Acyclic graph) • Orchestration functionality dataframe Execution plan: = ETL steps = Business Logic = Actual value dataframe DataLakeTable dataframe DataLakeTable dataframe DataLakeTable DataLake Developers create/change Framework
  29. 29. Parameters Actual transformations in Databricks notebook environment Metadata driven Orchestration: save (merge) to datalake Export: save (merge) to sqldb,… Catalog data Business Logic
  30. 30. Executing the non-value adding activities • Build phase (CI) • All DataLakeTables are created based on notebooks • Collect in a single DataLake object and store to disk (datalake.save) • Deploy phase (CD) • Load DataLake object (datalake.load) • Initialize non existing delta tables + update existing (datalake.initialize) • Load data • Load new data (datalake.run) • Export updated data (datalake.export) • Housekeeping • Execute metadatadriven (which columns to zorder by) optimize (datalake.optimize) • Helper notebook to visualize DAG from Datalake • Helper notebook to export data catalog to excel (backlog: export to Azure Purview) Governance By calling a single method from the DataLake object
  31. 31. Optimize the orchestration process Directed Acyclic Graph (DAG)
  32. 32. The challenge • A lot of different event types • +/- 350 tables • Limited amount of events per batch → Inefficient to reload all data for every batch Data heterogeneity hinders efficient execution
  33. 33. The solution • Determine all descendants from received event types: 2 and 4 • Only those tables need to be reloaded Data lineage based skipping of data loading
  34. 34. In practice for EFSA New MandateCreated → Load 40 tables instead of 350 Data lineage based skipping MandateCreated
  35. 35. Implementation highlight Exporting data
  36. 36. Exports implemented • Azure SQLDB • Merge Into • Overwrite • Azure storage account: spark file formats • Txt, Csv, Parquet, Json, … • Azure storage account: pdf • Pdf via weasyprint (python) • Rest API calls • based on data 3 6 Export html column to pdf Export same data via API call Implemented in DataLakeTable class Metadata driven
  37. 37. Code reusability through class methods ° https://medium.com/delaware-pro/executing-ddl-statements-stored-procedures-on-sql-server-using-pyspark-in-databricks-2b31d9276811 • 1 version/place of code • 1 place for code improvements • Normal JDBC → bulk JDBC • pyODBC → spark built-in JDBC driver° • Extended functionality • Table switching • Bulk Copy • Metadata templated SQL executed Example of export to SQL via overwrite
  38. 38. Retake Technical challenges
  39. 39. Retake Challenges ▪ +/- 350 tables → 350 notebooks ▪ Quick reload times ▪ Cost efficient ▪ Changing team + different experience levels ▪ Keep consistent way of working ▪ Central logging of the data pipelines ▪ Data governance + data catalog ▪ Housekeeping: optimize, initialize, set properties DAG based loading of subtree Runs on a single jobcluster Bare minimum of coding: only ETL transformations All dealt with by the framework Single function call based on metadata Better code quality, development speed and performance by splitting the repetitive housekeeping/orchestration (framework) from the value adding business logic (=ETLtranformations)
  40. 40. Thank you! Sebastiaan Leysen Data Platform Lead David De Wilde Solution Architect Interested in more? Visit delaware.pro Don’t lose control, organize your data flows Be lazy, split logic from housekeeping for reusability Enrich event-driven architecture with central data store
  41. 41. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×