SlideShare una empresa de Scribd logo
1 de 27
Descargar para leer sin conexión
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large Datasets with Apache Spark
Taming the Search: A Practical Way of Enforcing
GDPR and CCPA in Very Large Datasets with
Apache Spark
Jun Ma and Miao Wang
Adobe
▪ Background
▪ Platform Data Management
Architecture
▪ Challenges
▪ Bloom Filter: Build, Maintain
and Apply with Spark
▪ Performance Evaluation
▪ Tradeoffs While Applying Bloom
Filter
▪ Future Work
▪ Q & A
Agenda
Worldwide scope: All companies that collect,
store and process personal data
Guarantees that physical persons can
access and erase their data
Effective date: May 2018
Penalty of up to 4% of worldwide turnover
Or €20M (whichever is higher)
• California Consumer Privacy Act (CCPA): Effective 01/01/2020.
Background
Adobe Experience Platform - Data Flow
Data scanning
Jobs with Identity
Columns as Keys
Our Data Management Architecture
For Access & Delete
Data Size Cost Scalability
Challenges
Data Lake Storage
Accounts* Categories
Account data size Account # of files and
folders (metadata)
Small < 1TB < 1 M
Medium 1-5 TB 1-5 M
Large 5-50 TB 5-50 M
Extra Large 50-400 TB 50-100 M
* A customer may have multiple data lake accounts for different business purposes
Challenge - Data Size
37 278
15637
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
1 TB 10 TB 1 PB
Price($)
Price($)
Ref: https://medium.com/adobetech/search-optimization-for-large-data-sets-for-gdpr-7c2f52d4ea1f
Challenge - Compute Cost
1 user,
0.0000000027%
Others,
99.9999999973%
Data Size
1 user Others
Take Audience Manager dataset as an example
• Data Size: 700 TB
• Number of users: 35 Billion
• Avg size per user: 20KB
Problem - Finding a Needle in a Haystack
Stats: min/max?
Bucketing?
Dictionary encoding?
Data Skipping
• A probabilistic data structure to test whether an element is a member of a set
• A bit array of m bits
• An empty Bloom Filter will have all bits set to 0
• k different hash functions
• Key parameters to determine Bloom Filter size and accuracy
• Number of Distinct Values (NDV)
• False Positive Probability (FPP)
Ref: https://redislabs.com/blog/rebloom-bloom-filter-datatype-redis/ Possible to have false positive, but no false negative
Solution - Bloom Filter
Build Bloom Filter
• When shall we build Bloom
Filter?
• At what level of directories,
shall we build Bloom Filters?
• How and where do we store
Bloom Filters to work with
different file formats?
• How do we support Bloom
Filter of Complex Types?
Maintain Bloom Filter &
Choose Key Parameters
• How do we maintain Bloom
Filters while appending new
data & deleting existing data?
• How do we choose Bloom
Filter key configurations to
balance the filter file size and
accuracy?
Apply Bloom Filter with
Spark
• How do we apply Bloom Filter
for file skipping within Spark
jobs?
• Do we need to apply Bloom
Filter to Spark SQL query
planning for GDPR/CCPA use
cases?
Key Design Concerns
of Using Bloom Filters for Data Skipping
Bloom Filters are built at ingestion time.
Pros:
• Less overhead overall. Scan data only
once.
• Less operational cost. No need for
separate spark job & scheduling
service for Bloom Filter.
• Zero delay
Cons:
• More overhead & failure point to
ingestion
Data
Producer
Spark Job to
Ingest Data
Data
Files
Bloom
Filters
Batches
Data Lake
Design Concerns
When shall we build Bloom Filter?
At which level shall we build Bloom Filter?
▪ Bloom Filters are built at file level
▪ Stored in a separate metadata
directory
▪ Partitioned in the same way as
data file
▪ One bloom filter per file per
column
▪ Can consider one per file to
mitigate small file problem
Dataset layout with Bloom Filter
Design Concerns
Solution
▪ Consider key and value as
two separate columns
▪ Build Blooom Filter on
idMap.value.id
Schema with map type
How to build Bloom Filter on Map type?
Design Concerns
▪ Bloom Filter size is determined by
▪ false positive probability(FPP)
▪ num of distinct value(NDV)
▪ Formula:
▪ def optimalNumOfBytes(ndv: Long, fpp: Double) = -ndv / math.log(1 - math.pow(fpp, 1.0 / 8))
▪ Based on data we have in production, NDV = ~2.1M
FPP Optimal Bloom Filter Size(MB)
0.1 1.5
0.05 1.8
0.01 2.5
0.001 3.9
How to configure Bloom Filter to balance size and accuracy?
Design Concerns
▪ Implemented in Apache Iceberg[0], a light-weight table format to manage
table metadata, integrated with Spark
▪ Write path
▪ Pre-define the id columns to build Bloom Filters
▪ Set ndv & fpp for each column using Iceberg API
▪ Write dataFrame as iceberg format using
[0] https://iceberg.incubator.apache.org
How to apply Bloom Filter within Spark Job?
Design Concerns
table.updateSchema()
.addBloomFilter(fieldname, fpp, ndv)
df.write()
.format("iceberg")
.save(<table loc>)
▪ Read path
▪ Load table as iceberg table and pass Bloom Filter query to Iceberg reader using spark option
▪ Why using spark option?
spark
.read
.format("iceberg")
.option("iceberg.bloomFilter.input",
"""[{"field": "idMap.value.id", "type”: "long", "values": ["1"]}]""")
.load(< table loc>)
How to apply Bloom Filter within Spark Job?
Design Concerns
▪ Measured with 1.5 TB dataset
▪ One month obfuscated customer event data
▪ File count 1775
▪ Average file size 833MB
▪ Built bloom filter on one id column
Performance Evaluation
1,118 1130
0
200
400
600
800
1,000
1,200
Ingestion without Bloom Filter Ingestion with Bloom Filter
Duration of Ingestion(s)
Time Needed
Performance – Ingestion Overhead
Ingestion time increased 1.1%
to build bloom filter.
1494
15
0
200
400
600
800
1000
1200
1400
1600
Data size Bloom Filter Size
Storage Used(GB)
Time Needed
Performance – Storage Overhead
Storage overhead is 1% of data size
Id Distribution
Total Number
of Files
Contains a
Given Id
Id Percentage Aggregated
Percentage
1 50.8% 50.8%
(1, 10] 44.2% 95%
(10, 31] 4.0% 99.0%
(31, 179] 0.9% 99.9%
1 month customer event dataset
Total Number
of Files
Contains a
Given Id
Id Percentage Aggregated
Percentage
1 88.1% 88.1%
(1, 9] 10.9% 99.0%
(9, 150] 0.9% 99.9%
6 month customer event dataset
80.80
1.34 2.39 2.98
16.54
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
Without Bloom
Filter
1 file
appearance with
Bloom Filter
10 file
appearance with
Bloom Filter
31 file
appearance with
Bloom Filter
179 file
appearance with
Bloom Filter
Compute Cost to Scan Dataset
(DBU)
Cost(DBU)
Performance - Scan Dataset
Appearance in
Files
Percentage Aggregated
Percentage
1 50.8% 50.8%
(1, 10] 44.2% 95%
(10, 31] 4.0% 99.0%
(31, 179] 0.9% 99.9%
Support larger dataset Reduce costFaster processing
Value Bloom Filter Brings to GDPR & CCPA Compliance
Ongoing Work
▪ Combine bloom filter files for the same data file into one, to avoid having
many small files
▪ Parallelize the process to load Bloom Filters by moving from driver to
executors
▪ Extend the BF use case from GDPR to general query
Jun Ma: juma@adobe.com
Miao Wang: miwang@adobe.com
Contact

Más contenido relacionado

La actualidad más candente

Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkDatabricks
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Managing ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache SparkManaging ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache SparkDatabricks
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSADatabricks
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformDatabricks
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsDatabricks
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenDatabricks
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleDatabricks
 
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...Databricks
 
How Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleHow Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleDatabricks
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
 
Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...
Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...
Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...Databricks
 
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInDatabricks
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
Lessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformLessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformDatabricks
 

La actualidad más candente (20)

Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Managing ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache SparkManaging ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache Spark
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science Workflows
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At Scale
 
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
 
How Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleHow Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at Scale
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
 
Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...
Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...
Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...
 
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Lessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformLessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics Platform
 

Similar a Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large Datasets with Apache Spark

Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...Data Con LA
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Databricks
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxPriyadarshini648418
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
 
Gloc gangler 2018._v4
Gloc gangler 2018._v4Gloc gangler 2018._v4
Gloc gangler 2018._v4Secure-24
 
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from ...
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from  ...Admin Tech Clash: Discussing Best (and Worst) Administration Practices from  ...
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from ...Christoph Adler
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...Amazon Web Services
 
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from ...
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from  ...Admin Tech Clash: Discussing Best (and Worst) Administration Practices from  ...
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from ...Christoph Adler
 
Large Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesLarge Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesCidar Mendizabal
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataHakka Labs
 
apidays Paris 2022 - Sustainable API Green Score, Yannick Tremblais (Groupe R...
apidays Paris 2022 - Sustainable API Green Score, Yannick Tremblais (Groupe R...apidays Paris 2022 - Sustainable API Green Score, Yannick Tremblais (Groupe R...
apidays Paris 2022 - Sustainable API Green Score, Yannick Tremblais (Groupe R...apidays
 
Dynamics CRM high volume systems - lessons from the field
Dynamics CRM high volume systems - lessons from the fieldDynamics CRM high volume systems - lessons from the field
Dynamics CRM high volume systems - lessons from the fieldStéphane Dorrekens
 
Metadata Matters! What it is and How to Manage it
Metadata Matters! What it is and How to Manage itMetadata Matters! What it is and How to Manage it
Metadata Matters! What it is and How to Manage itSafe Software
 
Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)Ivo Andreev
 
Data ware house architecture
Data ware house architectureData ware house architecture
Data ware house architectureDeepak Chaurasia
 
INTERFACE by apidays 2023 - API Green Score, Yannick Tremblais, Groupe Rocher
INTERFACE by apidays 2023 - API Green Score, Yannick Tremblais, Groupe RocherINTERFACE by apidays 2023 - API Green Score, Yannick Tremblais, Groupe Rocher
INTERFACE by apidays 2023 - API Green Score, Yannick Tremblais, Groupe Rocherapidays
 

Similar a Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large Datasets with Apache Spark (20)

Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
Gloc gangler 2018._v4
Gloc gangler 2018._v4Gloc gangler 2018._v4
Gloc gangler 2018._v4
 
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from ...
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from  ...Admin Tech Clash: Discussing Best (and Worst) Administration Practices from  ...
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from ...
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
 
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from ...
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from  ...Admin Tech Clash: Discussing Best (and Worst) Administration Practices from  ...
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from ...
 
Large Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesLarge Data Volume Salesforce experiences
Large Data Volume Salesforce experiences
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
 
apidays Paris 2022 - Sustainable API Green Score, Yannick Tremblais (Groupe R...
apidays Paris 2022 - Sustainable API Green Score, Yannick Tremblais (Groupe R...apidays Paris 2022 - Sustainable API Green Score, Yannick Tremblais (Groupe R...
apidays Paris 2022 - Sustainable API Green Score, Yannick Tremblais (Groupe R...
 
Dynamics CRM high volume systems - lessons from the field
Dynamics CRM high volume systems - lessons from the fieldDynamics CRM high volume systems - lessons from the field
Dynamics CRM high volume systems - lessons from the field
 
Metadata Matters! What it is and How to Manage it
Metadata Matters! What it is and How to Manage itMetadata Matters! What it is and How to Manage it
Metadata Matters! What it is and How to Manage it
 
Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)
 
PHP/MySQL First Session Material
PHP/MySQL First Session MaterialPHP/MySQL First Session Material
PHP/MySQL First Session Material
 
Data ware house architecture
Data ware house architectureData ware house architecture
Data ware house architecture
 
INTERFACE by apidays 2023 - API Green Score, Yannick Tremblais, Groupe Rocher
INTERFACE by apidays 2023 - API Green Score, Yannick Tremblais, Groupe RocherINTERFACE by apidays 2023 - API Green Score, Yannick Tremblais, Groupe Rocher
INTERFACE by apidays 2023 - API Green Score, Yannick Tremblais, Groupe Rocher
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Último

Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...Neo4j
 
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...ferisulianta.com
 
Understanding the Impact of video length on student performance
Understanding the Impact of video length on student performanceUnderstanding the Impact of video length on student performance
Understanding the Impact of video length on student performancePrithaVashisht1
 
Bengaluru Tableau UG event- 2nd March 2024 Q1
Bengaluru Tableau UG event- 2nd March 2024 Q1Bengaluru Tableau UG event- 2nd March 2024 Q1
Bengaluru Tableau UG event- 2nd March 2024 Q1bengalurutug
 
Enabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsEnabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsNeo4j
 
Báo cáo Social Media Benchmark 2024 cho dân Marketing
Báo cáo Social Media Benchmark 2024 cho dân MarketingBáo cáo Social Media Benchmark 2024 cho dân Marketing
Báo cáo Social Media Benchmark 2024 cho dân MarketingMarketingTrips
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptxSTOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptxFurkanTasci3
 
2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-ProfitsTimothy Spann
 
Unleashing Datas Potential - Mastering Precision with FCO-IM
Unleashing Datas Potential - Mastering Precision with FCO-IMUnleashing Datas Potential - Mastering Precision with FCO-IM
Unleashing Datas Potential - Mastering Precision with FCO-IMMarco Wobben
 
Using DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data WarehouseUsing DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data WarehouseThinkInnovation
 
Stochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptxStochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptxjkmrshll88
 
Brain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptxBrain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptxShammiRai3
 
Paul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdfPaul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdfdcphostmaster
 
Empowering Decisions A Guide to Embedded Analytics
Empowering Decisions A Guide to Embedded AnalyticsEmpowering Decisions A Guide to Embedded Analytics
Empowering Decisions A Guide to Embedded AnalyticsGain Insights
 
Data Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potxData Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potxEmmanuel Dauda
 
How to Build an Experimentation Culture for Data-Driven Product Development
How to Build an Experimentation Culture for Data-Driven Product DevelopmentHow to Build an Experimentation Culture for Data-Driven Product Development
How to Build an Experimentation Culture for Data-Driven Product DevelopmentAggregage
 
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdfNeo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdfNeo4j
 
Microeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdfMicroeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdfmxlos0
 

Último (20)

Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
 
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
 
Target_Company_Data_breach_2013_110million
Target_Company_Data_breach_2013_110millionTarget_Company_Data_breach_2013_110million
Target_Company_Data_breach_2013_110million
 
Understanding the Impact of video length on student performance
Understanding the Impact of video length on student performanceUnderstanding the Impact of video length on student performance
Understanding the Impact of video length on student performance
 
Bengaluru Tableau UG event- 2nd March 2024 Q1
Bengaluru Tableau UG event- 2nd March 2024 Q1Bengaluru Tableau UG event- 2nd March 2024 Q1
Bengaluru Tableau UG event- 2nd March 2024 Q1
 
Enabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsEnabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge Graphs
 
Báo cáo Social Media Benchmark 2024 cho dân Marketing
Báo cáo Social Media Benchmark 2024 cho dân MarketingBáo cáo Social Media Benchmark 2024 cho dân Marketing
Báo cáo Social Media Benchmark 2024 cho dân Marketing
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptxSTOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
 
2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits
 
Unleashing Datas Potential - Mastering Precision with FCO-IM
Unleashing Datas Potential - Mastering Precision with FCO-IMUnleashing Datas Potential - Mastering Precision with FCO-IM
Unleashing Datas Potential - Mastering Precision with FCO-IM
 
Using DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data WarehouseUsing DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data Warehouse
 
Stochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptxStochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptx
 
Brain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptxBrain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptx
 
Paul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdfPaul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdf
 
Empowering Decisions A Guide to Embedded Analytics
Empowering Decisions A Guide to Embedded AnalyticsEmpowering Decisions A Guide to Embedded Analytics
Empowering Decisions A Guide to Embedded Analytics
 
Data Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potxData Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potx
 
How to Build an Experimentation Culture for Data-Driven Product Development
How to Build an Experimentation Culture for Data-Driven Product DevelopmentHow to Build an Experimentation Culture for Data-Driven Product Development
How to Build an Experimentation Culture for Data-Driven Product Development
 
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdfNeo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
 
Microeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdfMicroeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdf
 

Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large Datasets with Apache Spark

  • 2. Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large Datasets with Apache Spark Jun Ma and Miao Wang Adobe
  • 3. ▪ Background ▪ Platform Data Management Architecture ▪ Challenges ▪ Bloom Filter: Build, Maintain and Apply with Spark ▪ Performance Evaluation ▪ Tradeoffs While Applying Bloom Filter ▪ Future Work ▪ Q & A Agenda
  • 4. Worldwide scope: All companies that collect, store and process personal data Guarantees that physical persons can access and erase their data Effective date: May 2018 Penalty of up to 4% of worldwide turnover Or €20M (whichever is higher) • California Consumer Privacy Act (CCPA): Effective 01/01/2020. Background
  • 6. Data scanning Jobs with Identity Columns as Keys Our Data Management Architecture For Access & Delete
  • 7. Data Size Cost Scalability Challenges
  • 8. Data Lake Storage Accounts* Categories Account data size Account # of files and folders (metadata) Small < 1TB < 1 M Medium 1-5 TB 1-5 M Large 5-50 TB 5-50 M Extra Large 50-400 TB 50-100 M * A customer may have multiple data lake accounts for different business purposes Challenge - Data Size
  • 9. 37 278 15637 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 1 TB 10 TB 1 PB Price($) Price($) Ref: https://medium.com/adobetech/search-optimization-for-large-data-sets-for-gdpr-7c2f52d4ea1f Challenge - Compute Cost
  • 10. 1 user, 0.0000000027% Others, 99.9999999973% Data Size 1 user Others Take Audience Manager dataset as an example • Data Size: 700 TB • Number of users: 35 Billion • Avg size per user: 20KB Problem - Finding a Needle in a Haystack
  • 12. • A probabilistic data structure to test whether an element is a member of a set • A bit array of m bits • An empty Bloom Filter will have all bits set to 0 • k different hash functions • Key parameters to determine Bloom Filter size and accuracy • Number of Distinct Values (NDV) • False Positive Probability (FPP) Ref: https://redislabs.com/blog/rebloom-bloom-filter-datatype-redis/ Possible to have false positive, but no false negative Solution - Bloom Filter
  • 13. Build Bloom Filter • When shall we build Bloom Filter? • At what level of directories, shall we build Bloom Filters? • How and where do we store Bloom Filters to work with different file formats? • How do we support Bloom Filter of Complex Types? Maintain Bloom Filter & Choose Key Parameters • How do we maintain Bloom Filters while appending new data & deleting existing data? • How do we choose Bloom Filter key configurations to balance the filter file size and accuracy? Apply Bloom Filter with Spark • How do we apply Bloom Filter for file skipping within Spark jobs? • Do we need to apply Bloom Filter to Spark SQL query planning for GDPR/CCPA use cases? Key Design Concerns of Using Bloom Filters for Data Skipping
  • 14. Bloom Filters are built at ingestion time. Pros: • Less overhead overall. Scan data only once. • Less operational cost. No need for separate spark job & scheduling service for Bloom Filter. • Zero delay Cons: • More overhead & failure point to ingestion Data Producer Spark Job to Ingest Data Data Files Bloom Filters Batches Data Lake Design Concerns When shall we build Bloom Filter?
  • 15. At which level shall we build Bloom Filter? ▪ Bloom Filters are built at file level ▪ Stored in a separate metadata directory ▪ Partitioned in the same way as data file ▪ One bloom filter per file per column ▪ Can consider one per file to mitigate small file problem Dataset layout with Bloom Filter Design Concerns
  • 16. Solution ▪ Consider key and value as two separate columns ▪ Build Blooom Filter on idMap.value.id Schema with map type How to build Bloom Filter on Map type? Design Concerns
  • 17. ▪ Bloom Filter size is determined by ▪ false positive probability(FPP) ▪ num of distinct value(NDV) ▪ Formula: ▪ def optimalNumOfBytes(ndv: Long, fpp: Double) = -ndv / math.log(1 - math.pow(fpp, 1.0 / 8)) ▪ Based on data we have in production, NDV = ~2.1M FPP Optimal Bloom Filter Size(MB) 0.1 1.5 0.05 1.8 0.01 2.5 0.001 3.9 How to configure Bloom Filter to balance size and accuracy? Design Concerns
  • 18. ▪ Implemented in Apache Iceberg[0], a light-weight table format to manage table metadata, integrated with Spark ▪ Write path ▪ Pre-define the id columns to build Bloom Filters ▪ Set ndv & fpp for each column using Iceberg API ▪ Write dataFrame as iceberg format using [0] https://iceberg.incubator.apache.org How to apply Bloom Filter within Spark Job? Design Concerns table.updateSchema() .addBloomFilter(fieldname, fpp, ndv) df.write() .format("iceberg") .save(<table loc>)
  • 19. ▪ Read path ▪ Load table as iceberg table and pass Bloom Filter query to Iceberg reader using spark option ▪ Why using spark option? spark .read .format("iceberg") .option("iceberg.bloomFilter.input", """[{"field": "idMap.value.id", "type”: "long", "values": ["1"]}]""") .load(< table loc>) How to apply Bloom Filter within Spark Job? Design Concerns
  • 20. ▪ Measured with 1.5 TB dataset ▪ One month obfuscated customer event data ▪ File count 1775 ▪ Average file size 833MB ▪ Built bloom filter on one id column Performance Evaluation
  • 21. 1,118 1130 0 200 400 600 800 1,000 1,200 Ingestion without Bloom Filter Ingestion with Bloom Filter Duration of Ingestion(s) Time Needed Performance – Ingestion Overhead Ingestion time increased 1.1% to build bloom filter.
  • 22. 1494 15 0 200 400 600 800 1000 1200 1400 1600 Data size Bloom Filter Size Storage Used(GB) Time Needed Performance – Storage Overhead Storage overhead is 1% of data size
  • 23. Id Distribution Total Number of Files Contains a Given Id Id Percentage Aggregated Percentage 1 50.8% 50.8% (1, 10] 44.2% 95% (10, 31] 4.0% 99.0% (31, 179] 0.9% 99.9% 1 month customer event dataset Total Number of Files Contains a Given Id Id Percentage Aggregated Percentage 1 88.1% 88.1% (1, 9] 10.9% 99.0% (9, 150] 0.9% 99.9% 6 month customer event dataset
  • 24. 80.80 1.34 2.39 2.98 16.54 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 Without Bloom Filter 1 file appearance with Bloom Filter 10 file appearance with Bloom Filter 31 file appearance with Bloom Filter 179 file appearance with Bloom Filter Compute Cost to Scan Dataset (DBU) Cost(DBU) Performance - Scan Dataset Appearance in Files Percentage Aggregated Percentage 1 50.8% 50.8% (1, 10] 44.2% 95% (10, 31] 4.0% 99.0% (31, 179] 0.9% 99.9%
  • 25. Support larger dataset Reduce costFaster processing Value Bloom Filter Brings to GDPR & CCPA Compliance
  • 26. Ongoing Work ▪ Combine bloom filter files for the same data file into one, to avoid having many small files ▪ Parallelize the process to load Bloom Filters by moving from driver to executors ▪ Extend the BF use case from GDPR to general query
  • 27. Jun Ma: juma@adobe.com Miao Wang: miwang@adobe.com Contact