Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

CI/CD for Data - Building Data Development Environment with lakeFS

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 24 Anuncio

CI/CD for Data - Building Data Development Environment with lakeFS

Descargar para leer sin conexión

A property of data pipelines one might observe is that they rarely stay still. Instead, there are near-constant updates to some aspect of the infrastructure they run on, or in the logic they use to transform data, to give two examples.

To efficiently apply the necessary changes to a pipeline requires running it parallel to production to test the effect of a change. Most data engineers would agree that the best way to do this is far from a solved problem.

Most attempts at doing this fall on one of two extremes--either executed with overly simplified hardcoded sample data that let through errors that will appear with production data. Or, executed in a maintenance-intensive dev environment that requires duplicating the production data, which also ends up massively increasing the risk of a breach or data privacy violation.

The open source project lakeFS let's one find the much-needed middle ground for testing data pipelines by making it possible to instantly clone a data environment through zero-copy cloning operation. This enables a safe and automated development environment for data pipelines that avoids the pitfalls of copying or mocking datasets, and using production pipelines to test.

In this session, you will learn how to use lakeFS to quickly set up a development environment and use it to develop/test data products without risking production data.

A property of data pipelines one might observe is that they rarely stay still. Instead, there are near-constant updates to some aspect of the infrastructure they run on, or in the logic they use to transform data, to give two examples.

To efficiently apply the necessary changes to a pipeline requires running it parallel to production to test the effect of a change. Most data engineers would agree that the best way to do this is far from a solved problem.

Most attempts at doing this fall on one of two extremes--either executed with overly simplified hardcoded sample data that let through errors that will appear with production data. Or, executed in a maintenance-intensive dev environment that requires duplicating the production data, which also ends up massively increasing the risk of a breach or data privacy violation.

The open source project lakeFS let's one find the much-needed middle ground for testing data pipelines by making it possible to instantly clone a data environment through zero-copy cloning operation. This enables a safe and automated development environment for data pipelines that avoids the pitfalls of copying or mocking datasets, and using production pipelines to test.

In this session, you will learn how to use lakeFS to quickly set up a development environment and use it to develop/test data products without risking production data.

Anuncio
Anuncio

Más Contenido Relacionado

Similares a CI/CD for Data - Building Data Development Environment with lakeFS (20)

Más de ScyllaDB (20)

Anuncio

Más reciente (20)

CI/CD for Data - Building Data Development Environment with lakeFS

  1. 1. CI/CD for Data – Building Dev/Test Data Environments With lakeFS Vino SD, Developer Advocate
  2. 2. About Me ■ I am Vino Duraisamy. ■ SWE -> Data/ML Engineer -> Developer Advocate ■ Open-source contributor @lakeFS @vinodhini_sd @vinodhini-sd vinodhini-sd.medium.com
  3. 3. A Day in the Life of a Data Engineer! Expectation Reality
  4. 4. What Can Go Wrong? ■ EMR/Spark upgrade ■ Schema changes ■ Change in business logic ■ Infrastructure changes ■ Troubleshooting failed spark jobs ■ Backfilling historic data ■ Non-idempotent pipelines ■ Maintaining legacy DAGs/pipelines
  5. 5. ■ Unit Testing ■ Integration Testing ■ E2E Testing ■ Realistic Input Data ■ Mock data ■ Sampled production data ■ Copy all of Production data Effective Testing Strategy
  6. 6. CI/CD for Data– Apply engineering best practices to data engineering
  7. 7. Build ■ Place all data assets under data version control Test ■ Create isolated data environments on-demand ■ Automate all your tests Deploy ■ Data quality checks passed ■ Automatic rollbacks are in place CI/CD Best Practices
  8. 8. Build ■ Place all data assets under data version control Test (CI) ■ Create isolated data environments on-demand ■ Automate all your tests Deploy ■ Data quality checks passed ■ Automatic rollbacks are in place CI/CD Best Practices
  9. 9. Build ■ Place all data assets under data version control Test (CI) ■ Create isolated data environments on-demand ■ Automate all your tests Deploy (CD) ■ Data quality checks passed ■ Automatic rollbacks are in place CI/CD Best Practices
  10. 10. Build – Version Control
  11. 11. Data Repo Branches of the whole repository Data Versioning at the object level Place All Data Assets Under Version Control
  12. 12. Git for Data with lakeFS s3://data-repo/collections/foo lakefs://data-repo/main/collections/foo lakectl branch create lakefs://repo/experiment-1 --source lakefs://repo/main # output: # created branch 'experiment-1', # pointing to commit ID: 'd1e9adc71c10a’
  13. 13. How lakeFS Works
  14. 14. Test - CI
  15. 15. Ingestion Experimentation Safe CI: Create Isolated Data Environments On-Demand
  16. 16. Automated Tests Using lakeFS Hooks
  17. 17. Demo Time
  18. 18. Deploy - CD
  19. 19. ■ Conceptually similar to Github actions ■ Hooks run as a standalone webserver - performs tests when triggered ■ Run custom tests or use test suites like Great Expectations, etc., Automating Data Quality Checks Using lakeFS Hooks
  20. 20. ■ post_merge event: Failed Hook runs can programmatically revert the changes to a branch. ■ Reliability of data in Prod. $lakectl revert main^1 CD: Automatic Rollbacks in Place
  21. 21. Easily revert corrupted data. Time travel to debug a snapshot Safely test your pipelines on an isolated branch from production Production identical isolated branch and hooks to test your data Data Lakes with lakeFS
  22. 22. .io
  23. 23. Thank You

×