SlideShare una empresa de Scribd logo
1 de 21
Descargar para leer sin conexión
Presented By:
Kundan Kumar
Software Consultant
Spark With Delta
Lake
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Respect Knolx session timings, you
are requested not to join sessions
after a 5 minutes threshold post
the session start time.
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep your mobile devices in silent
mode, feel free to move out of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoid unwanted chit chat during
the session.
Agenda
01 What & Why Delta Lake
02 Features Of Delta Lake
03 Delta Lake Transaction
Log
04 Demo
Brings Data Reliability and Performance to Data
Lakes
What is Data Lake?
A Data Lake is a centralized repository that can store large amount of
structured, semi-structured, and unstructured data. It is a place to store
every type of data in its native format with no fixed limits on account size or
file. It offers high data quantity to increase analytic performance and native
integration.
Why Delta Lake?
1. Data reliability challenges with data lake.
2013 2014 2018
Failed production
job
Orphan Data
No Schema
Enforcement
2. ACID Transactions: Critical Feature missing from Spark.
What is Delta Lake?
Delta Lake is basically a open source storage layer that would sit on top of your
existing data lake and is fully compatible with Apache Spark APIs. It brings ACID
transactions to Apache Spark and big data workloads.
Features of Delta Lake
1. ACID Transaction
2. Scalable metedata handling
3. Open Format
4. Time Travel
5. Schema Enforcement & Evolution
6. Updates and Deletes
7. Unified Batch and Streaming
ACID Transactions
Delta Lake brings Atomicity, Consistency, Isolation and Durability (ACID)
transactions to your data lakes. It provides serializability, the strongest level of
isolation level and ensure that readers never see inconsistent data.
Transaction Log
The Delta Lake transaction log (also known as the DeltaLog) is an ordered
record of every transaction that has ever been performed on a Delta Lake
table since its inception. It is a single source of truth.
Optimistic Concurrency Control
Optimistic concurrency control is a method of dealing with concurrent
transactions that assumes that transactions (changes) made to a table by
different users can complete without conflicting with one another.
Time Travel (Data Versioning)
Delta Lake time travel allows us to query an older snapshot of a Delta Lake table. This
time-traveling can be achieved using 2 approaches:
1. Using a version number
2. Using a timestamp
Time travel has many use cases, including:
● Time travel makes it easy to do rollbacks in case of bad writes, playing an important role in fixing
mistakes in our data.
● It helps in re-creating analysis, reports, or outputs (for example, the output of a machine learning
model). This could be useful for debugging or auditing, especially in regulated industries.
● It also simplifies time-series analytics. For instance, in finding out how many new customers were
added over the last week.
Scalable Metadata Handling: In big data, even the metadata itself can be "big
data". Delta Lake treats metadata just like data, leveraging Spark's distributed
processing power to handle all its metadata. As a result, Delta Lake can handle
petabyte-scale tables with billions of partitions and files at ease.
Open Format: All data in Delta Lake is stored in Apache Parquet format enabling
Delta Lake to leverage the efficient compression and encoding schemes that are
native to Parquet. Apache Parquet is column-oriented and designed to bring
efficient columnar storage of data compared to row-based like CSV.
.
id name age
123 xyz 21
321 abc 20
123 xyz 21 321 abc 20
123 321 xyz abc 21 20
Row Orriented
Column Oriented
Schema Enforcement: Schema enforcement, also known as schema validation,
is a safeguard in Delta Lake that ensures data quality by rejecting writes to a
table that do not match the table’s schema. To determine whether a write to a
table is compatible, Delta Lake uses the following rules:
1. Cannot contain any additional columns that are not present in the target table’s schema.
2. Cannot have column data types that differ from the column data types in the target table.
3. Cannot contain column names that differ only by case.
Schema Evolution: Delta Lake enables you to make changes to a table schema
that can be applied automatically.
Updates, Deletes and Merges
Delta Lake supports Scala / Java APIs to merge, update and delete datasets.
This allows you to easily comply with GDPR and CCPA.
Updates: We can update data that matches a predicate in a Delta Lake table.
Deletes: We can remove data that matches a predicate from a Delta Lake
table.
Merges: we can upsert data from a Spark DataFrame into a Delta Lake table
using the merge operation.
Unified Batch and Streaming Source and Sink
A table in Delta Lake is both a batch table, as well as a streaming source and
sink. Streaming data ingest, batch historic backfill, and interactive queries all
just work out of the box.
DEMO
Q/A
References
1. Welcome to the Delta Lake documentation — Delta Lake
Documentation
2. Spark: ACID compliant or not
3. Spark: ACID Transaction with Delta Lake
4. Time Travel: Data versioning in Delta Lake
Thank You !

Más contenido relacionado

La actualidad más candente

Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Getting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsGetting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsDatabricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkDatabricks
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lakeMykola Zerniuk
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 

La actualidad más candente (20)

Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Getting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsGetting Started with Databricks SQL Analytics
Getting Started with Databricks SQL Analytics
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 

Similar a Spark with Delta Lake

Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksKnoldus Inc.
 
Remote DBA Experts 11g Features
Remote DBA Experts 11g FeaturesRemote DBA Experts 11g Features
Remote DBA Experts 11g FeaturesRemote DBA Experts
 
Sql server-dba
Sql server-dbaSql server-dba
Sql server-dbaNaviSoft
 
Oracle to Netezza Migration Casestudy
Oracle to Netezza Migration CasestudyOracle to Netezza Migration Casestudy
Oracle to Netezza Migration CasestudyAsis Mohanty
 
Help! my sql server log file is too big!!! tech republic
Help! my sql server log file is too big!!!   tech republicHelp! my sql server log file is too big!!!   tech republic
Help! my sql server log file is too big!!! tech republicKaing Menglieng
 
Why & how to optimize sql server for performance from design to query
Why & how to optimize sql server for performance from design to queryWhy & how to optimize sql server for performance from design to query
Why & how to optimize sql server for performance from design to queryAntonios Chatzipavlis
 
Migration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication TechnologyMigration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication TechnologyDonna Guazzaloca-Zehl
 
Sql 2016 - What's New
Sql 2016 - What's NewSql 2016 - What's New
Sql 2016 - What's Newdpcobb
 
Understanding the SQL aspects of Spark - Spark SQL.pptx
Understanding the SQL aspects of Spark - Spark SQL.pptxUnderstanding the SQL aspects of Spark - Spark SQL.pptx
Understanding the SQL aspects of Spark - Spark SQL.pptxKnoldus Inc.
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfIlham31574
 
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)Dave Stokes
 
Mutable data @ scale
Mutable data @ scaleMutable data @ scale
Mutable data @ scaleOri Reshef
 
Sql server 2016 new features
Sql server 2016 new featuresSql server 2016 new features
Sql server 2016 new featuresAjeet Singh
 
MySQL 8.0 Featured for Developers
MySQL 8.0 Featured for DevelopersMySQL 8.0 Featured for Developers
MySQL 8.0 Featured for DevelopersDave Stokes
 
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...StreamNative
 
Sql interview question part 10
Sql interview question part 10Sql interview question part 10
Sql interview question part 10kaashiv1
 
Data warehousing guidelines for bi and BAM solutions
Data warehousing guidelines for bi and BAM solutionsData warehousing guidelines for bi and BAM solutions
Data warehousing guidelines for bi and BAM solutionsShehap Elnagar
 

Similar a Spark with Delta Lake (20)

Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
 
Remote DBA Experts 11g Features
Remote DBA Experts 11g FeaturesRemote DBA Experts 11g Features
Remote DBA Experts 11g Features
 
Sql server-dba
Sql server-dbaSql server-dba
Sql server-dba
 
Oracle to Netezza Migration Casestudy
Oracle to Netezza Migration CasestudyOracle to Netezza Migration Casestudy
Oracle to Netezza Migration Casestudy
 
Help! my sql server log file is too big!!! tech republic
Help! my sql server log file is too big!!!   tech republicHelp! my sql server log file is too big!!!   tech republic
Help! my sql server log file is too big!!! tech republic
 
Why & how to optimize sql server for performance from design to query
Why & how to optimize sql server for performance from design to queryWhy & how to optimize sql server for performance from design to query
Why & how to optimize sql server for performance from design to query
 
Migration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication TechnologyMigration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication Technology
 
Sql 2016 - What's New
Sql 2016 - What's NewSql 2016 - What's New
Sql 2016 - What's New
 
Understanding the SQL aspects of Spark - Spark SQL.pptx
Understanding the SQL aspects of Spark - Spark SQL.pptxUnderstanding the SQL aspects of Spark - Spark SQL.pptx
Understanding the SQL aspects of Spark - Spark SQL.pptx
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
 
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
 
Mutable data @ scale
Mutable data @ scaleMutable data @ scale
Mutable data @ scale
 
Sql server 2016 new features
Sql server 2016 new featuresSql server 2016 new features
Sql server 2016 new features
 
Sql server 2016 new features
Sql server 2016 new featuresSql server 2016 new features
Sql server 2016 new features
 
MySQL 8.0 Featured for Developers
MySQL 8.0 Featured for DevelopersMySQL 8.0 Featured for Developers
MySQL 8.0 Featured for Developers
 
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
 
Sql interview question part 10
Sql interview question part 10Sql interview question part 10
Sql interview question part 10
 
Ebook10
Ebook10Ebook10
Ebook10
 
Data warehousing guidelines for bi and BAM solutions
Data warehousing guidelines for bi and BAM solutionsData warehousing guidelines for bi and BAM solutions
Data warehousing guidelines for bi and BAM solutions
 

Más de Knoldus Inc.

Supply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptxSupply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptxKnoldus Inc.
 
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingKnoldus Inc.
 
Akka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionAkka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionKnoldus Inc.
 
Entity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxEntity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxKnoldus Inc.
 
Introduction to Redis and its features.pptx
Introduction to Redis and its features.pptxIntroduction to Redis and its features.pptx
Introduction to Redis and its features.pptxKnoldus Inc.
 
GraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfGraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfKnoldus Inc.
 
NuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxNuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxKnoldus Inc.
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingKnoldus Inc.
 
K8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesK8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesKnoldus Inc.
 
Introduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptxIntroduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptxKnoldus Inc.
 
Robusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxRobusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxKnoldus Inc.
 
Optimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxOptimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxKnoldus Inc.
 
Azure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxAzure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxKnoldus Inc.
 
CQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxCQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxKnoldus Inc.
 
ETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationKnoldus Inc.
 
Scripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationScripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationKnoldus Inc.
 
Getting started with dotnet core Web APIs
Getting started with dotnet core Web APIsGetting started with dotnet core Web APIs
Getting started with dotnet core Web APIsKnoldus Inc.
 
Introduction To Rust part II Presentation
Introduction To Rust part II PresentationIntroduction To Rust part II Presentation
Introduction To Rust part II PresentationKnoldus Inc.
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Configuring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAConfiguring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAKnoldus Inc.
 

Más de Knoldus Inc. (20)

Supply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptxSupply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptx
 
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
 
Akka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionAkka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On Introduction
 
Entity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxEntity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptx
 
Introduction to Redis and its features.pptx
Introduction to Redis and its features.pptxIntroduction to Redis and its features.pptx
Introduction to Redis and its features.pptx
 
GraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfGraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdf
 
NuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxNuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptx
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable Testing
 
K8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesK8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose Kubernetes
 
Introduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptxIntroduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptx
 
Robusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxRobusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptx
 
Optimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxOptimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptx
 
Azure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxAzure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptx
 
CQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxCQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptx
 
ETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake Presentation
 
Scripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationScripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics Presentation
 
Getting started with dotnet core Web APIs
Getting started with dotnet core Web APIsGetting started with dotnet core Web APIs
Getting started with dotnet core Web APIs
 
Introduction To Rust part II Presentation
Introduction To Rust part II PresentationIntroduction To Rust part II Presentation
Introduction To Rust part II Presentation
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Configuring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAConfiguring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRA
 

Último

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 

Último (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Spark with Delta Lake

  • 1. Presented By: Kundan Kumar Software Consultant Spark With Delta Lake
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes Punctuality Respect Knolx session timings, you are requested not to join sessions after a 5 minutes threshold post the session start time. Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter. Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call. Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3. Agenda 01 What & Why Delta Lake 02 Features Of Delta Lake 03 Delta Lake Transaction Log 04 Demo
  • 4. Brings Data Reliability and Performance to Data Lakes
  • 5. What is Data Lake? A Data Lake is a centralized repository that can store large amount of structured, semi-structured, and unstructured data. It is a place to store every type of data in its native format with no fixed limits on account size or file. It offers high data quantity to increase analytic performance and native integration.
  • 6. Why Delta Lake? 1. Data reliability challenges with data lake. 2013 2014 2018 Failed production job Orphan Data No Schema Enforcement
  • 7. 2. ACID Transactions: Critical Feature missing from Spark.
  • 8. What is Delta Lake? Delta Lake is basically a open source storage layer that would sit on top of your existing data lake and is fully compatible with Apache Spark APIs. It brings ACID transactions to Apache Spark and big data workloads.
  • 9. Features of Delta Lake 1. ACID Transaction 2. Scalable metedata handling 3. Open Format 4. Time Travel 5. Schema Enforcement & Evolution 6. Updates and Deletes 7. Unified Batch and Streaming
  • 10. ACID Transactions Delta Lake brings Atomicity, Consistency, Isolation and Durability (ACID) transactions to your data lakes. It provides serializability, the strongest level of isolation level and ensure that readers never see inconsistent data.
  • 11. Transaction Log The Delta Lake transaction log (also known as the DeltaLog) is an ordered record of every transaction that has ever been performed on a Delta Lake table since its inception. It is a single source of truth.
  • 12. Optimistic Concurrency Control Optimistic concurrency control is a method of dealing with concurrent transactions that assumes that transactions (changes) made to a table by different users can complete without conflicting with one another.
  • 13. Time Travel (Data Versioning) Delta Lake time travel allows us to query an older snapshot of a Delta Lake table. This time-traveling can be achieved using 2 approaches: 1. Using a version number 2. Using a timestamp Time travel has many use cases, including: ● Time travel makes it easy to do rollbacks in case of bad writes, playing an important role in fixing mistakes in our data. ● It helps in re-creating analysis, reports, or outputs (for example, the output of a machine learning model). This could be useful for debugging or auditing, especially in regulated industries. ● It also simplifies time-series analytics. For instance, in finding out how many new customers were added over the last week.
  • 14. Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease. Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet. Apache Parquet is column-oriented and designed to bring efficient columnar storage of data compared to row-based like CSV. . id name age 123 xyz 21 321 abc 20 123 xyz 21 321 abc 20 123 321 xyz abc 21 20 Row Orriented Column Oriented
  • 15. Schema Enforcement: Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table’s schema. To determine whether a write to a table is compatible, Delta Lake uses the following rules: 1. Cannot contain any additional columns that are not present in the target table’s schema. 2. Cannot have column data types that differ from the column data types in the target table. 3. Cannot contain column names that differ only by case. Schema Evolution: Delta Lake enables you to make changes to a table schema that can be applied automatically.
  • 16. Updates, Deletes and Merges Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA. Updates: We can update data that matches a predicate in a Delta Lake table. Deletes: We can remove data that matches a predicate from a Delta Lake table. Merges: we can upsert data from a Spark DataFrame into a Delta Lake table using the merge operation.
  • 17. Unified Batch and Streaming Source and Sink A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • 18. DEMO
  • 19. Q/A
  • 20. References 1. Welcome to the Delta Lake documentation — Delta Lake Documentation 2. Spark: ACID compliant or not 3. Spark: ACID Transaction with Delta Lake 4. Time Travel: Data versioning in Delta Lake