Enriching data vs filtering in Spark for credit card rewards processing

•

0 recomendaciones•432 vistas

This document discusses approaches to processing data in Spark for a credit card rewards use case at Capital One. It describes filtering data at each stage versus enriching the data. Enriching avoids issues with filtering like difficulty debugging and back tracing data by keeping enriched data from each stage together rather than filtering it out. Enriching provides more insight into why data drops out at each stage and is now used successfully in Capital One's production Spark job that processes millions of transactions daily to award customer rewards.

Datos y análisis

Enriching the data vs Filtering in Spark
Gokul Prabagaren
Master Software Engineer
CapitalOne

About Me
● Master Software Engineer @ Capitalone
● Building Spark Applications since Spark 1.2
● Contributor of @CapitalOneTech Medium
blogs on Big Data processing
● @gocoolp on Twitter
● @gocool_p on LinkedIn

Agenda
● Rewards use case in CapitalOne
● Filtering Approach
● Issues with Filtering Approach
● How Enriching approach solves the issue
● Conclusion & Questions

Rewards Use case in CapitalOne
▪ CapitalOne develops its software as Open
source first in cloud.
▪ We will be operating fully on Cloud soon!
▪ We use Apache Spark extensively for variety
of batch,streaming and machine-learning
workloads.

Rewards Use case in CapitalOne
▪ Use case
▪ One of Core Credit Card Rewards Spark
Application.
▪ Consumes daily credit card transactions
and computes the Rewards

Filtering the data Approach
● This approach uses Spark inner-join
at each stage

Issues with Filtering Approach
▪ Hard to debug the application post deployment
▪ Back tracing of data is not possible as computation happens in-memory
▪ Counts at each stage can only provide how many got processed.But not why the remaining got
dropped in that stage.
How did we overcome these issues ?

Enriching the data approach
● This approach uses Spark left-outer join
● Instead of filtering the data from dataset at
each stage.Enriching approach keeps
enriching the data from right side dataset

Advantage of Enriching over filtering
● Data from each stage is enriched into original dataset.It captures the state information,makes it
easy to debug/analyse later
● Same data columns/flags captured at each stage gives more granular details to know why
particular data got dropped at that stage
● No need of additional costly counts action at each stage.

Conclusion
● We made the switch to use Enriching approach in our Spark job in production.
● It is successfully processing millions of credit card transaction daily.
● Awarding millions of miles,cash and points as Rewards to Capital One customers.

Enriching data vs filtering in Spark for credit card rewards processing

Más contenido relacionado

La actualidad más candente

Everyday Probabilistic Data Structures for HumansDatabricks

Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks

Scaling Data and ML with Apache Spark and FeastDatabricks

Productionizing Machine Learning with a Microservices ArchitectureDatabricks

Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher ScientificDatabricks

Spark Summit EU talk by Brij Bhushan RavatSpark Summit

Is This Thing On? A Well State Model for the PeopleDatabricks

From HDFS to S3: Migrate Pinterest Apache Spark ClustersDatabricks

Building a Real-Time Feature Store at iFoodDatabricks

Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks

Superworkflow of Graph Neural Networks with K8S and FugueDatabricks

The Revolution Will be StreamedDatabricks

Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerDatabricks

Running Apache Spark Jobs Using KubernetesDatabricks

Understanding and Improving Code GenerationDatabricks

Sputnik: Airbnb’s Apache Spark Framework for Data EngineeringDatabricks

Spark Summit EU talk by Josef HabdankSpark Summit

Memory Optimization and Reliable Metrics in ML Pipelines at NetflixDatabricks

Tuning ML Models: Scaling, Workflows, and ArchitectureDatabricks

Lessons Learned from Modernizing USCIS Data Analytics PlatformDatabricks

La actualidad más candente (20)

Everyday Probabilistic Data Structures for Humans

Building a Streaming Microservice Architecture: with Apache Spark Structured ...

Scaling Data and ML with Apache Spark and Feast

Productionizing Machine Learning with a Microservices Architecture

Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific

Spark Summit EU talk by Brij Bhushan Ravat

Is This Thing On? A Well State Model for the People

From HDFS to S3: Migrate Pinterest Apache Spark Clusters

Building a Real-Time Feature Store at iFood

Running Apache Spark on Kubernetes: Best Practices and Pitfalls

Superworkflow of Graph Neural Networks with K8S and Fugue

The Revolution Will be Streamed

Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler

Running Apache Spark Jobs Using Kubernetes

Understanding and Improving Code Generation

Sputnik: Airbnb’s Apache Spark Framework for Data Engineering

Spark Summit EU talk by Josef Habdank

Memory Optimization and Reliable Metrics in ML Pipelines at Netflix

Tuning ML Models: Scaling, Workflows, and Architecture

Lessons Learned from Modernizing USCIS Data Analytics Platform

Similar a Enriching data vs filtering in Spark for credit card rewards processing

Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDatabricks

Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceWei Di

Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...Databricks

Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward

ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...Databricks

[Rakuten TechConf2014] [A-4] Rakuten IchibaRakuten Group, Inc.

Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks

AI projects - Lifecyle & Best PracticesVincent de Stoecklin

Transforming B2B Sales with Spark Powered Sales IntelligenceSongtao Guo

Ad109 - XPages Performance and Scalabilityddrschiw

Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019VMware Tanzu

Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...PATHALAMRAJESH

Excalibur: best practices for virtual desktop operations leveraging Citrix Di...Citrix

Next-Generation Kubernetes Optimization: Optimize Live 2.0StormForge .io

Machine learning at scale - Webinar By zekeLabszekeLabs Technologies

J sai subrahmanyam_ResumeSubrahmanyam Janapati

ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureFei Chen

GraphQL Munich Meetup #1 - How We Use GraphQL At CommercetoolsNicola Molinari

Improving Mobile Payments With Real time Sparkdatamantra

QUES#19 Automation and Quality 2022.pdfsonalsingh547884

Similar a Enriching data vs filtering in Spark for credit card rewards processing (20)

Delight: An Improved Apache Spark UI, Free, and Cross-Platform

Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence

Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...

Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...

ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...

[Rakuten TechConf2014] [A-4] Rakuten Ichiba

Best Practices for Enabling Speculative Execution on Large Scale Platforms

AI projects - Lifecyle & Best Practices

Transforming B2B Sales with Spark Powered Sales Intelligence

Ad109 - XPages Performance and Scalability

Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019

Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...

Excalibur: best practices for virtual desktop operations leveraging Citrix Di...

Next-Generation Kubernetes Optimization: Optimize Live 2.0

Machine learning at scale - Webinar By zekeLabs

J sai subrahmanyam_Resume

ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure

GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools

Improving Mobile Payments With Real time Spark

QUES#19 Automation and Quality 2022.pdf

Más de Databricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Último

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten

B2 Creative Industry Response Evaluation.docxStephen266013

Ravak dropshipping via API with DroFx.pptxolyaivanovalion

VidaXL dropshipping via API with DroFx.pptxolyaivanovalion

定制英国白金汉大学毕业证（UCB毕业证书）成绩单原版一比一ffjhghh

Invezz.com - Grow your wealth with trading signalsInvezz1

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

April 2024 - Crypto Market Report's Analysismanisha194592

Industrialised data - the key to AI success.pdfLars Albertsson

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Introduction-to-Machine-Learning (1).pptxfirstjob4

Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor

Carero dropshipping via API with DroFx.pptxolyaivanovalion

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor

Enriching data vs filtering in Spark for credit card rewards processing

1. Enriching the data vs Filtering in Spark Gokul Prabagaren Master Software Engineer CapitalOne

2. About Me ● Master Software Engineer @ Capitalone ● Building Spark Applications since Spark 1.2 ● Contributor of @CapitalOneTech Medium blogs on Big Data processing ● @gocoolp on Twitter ● @gocool_p on LinkedIn

3. Agenda ● Rewards use case in CapitalOne ● Filtering Approach ● Issues with Filtering Approach ● How Enriching approach solves the issue ● Conclusion & Questions

4. Rewards Use case in CapitalOne ▪ CapitalOne develops its software as Open source first in cloud. ▪ We will be operating fully on Cloud soon! ▪ We use Apache Spark extensively for variety of batch,streaming and machine-learning workloads.

5. Rewards Use case in CapitalOne ▪ Use case ▪ One of Core Credit Card Rewards Spark Application. ▪ Consumes daily credit card transactions and computes the Rewards

6. Filtering the data Approach ● This approach uses Spark inner-join at each stage

7. Filtering Approach Example

8. Issues with Filtering Approach ▪ Hard to debug the application post deployment ▪ Back tracing of data is not possible as computation happens in-memory ▪ Counts at each stage can only provide how many got processed.But not why the remaining got dropped in that stage. How did we overcome these issues ?

9. Enriching the data approach ● This approach uses Spark left-outer join ● Instead of filtering the data from dataset at each stage.Enriching approach keeps enriching the data from right side dataset

10. Enriching Approach Example

11. Enriching Approach Example…….

12. Advantage of Enriching over filtering ● Data from each stage is enriched into original dataset.It captures the state information,makes it easy to debug/analyse later ● Same data columns/flags captured at each stage gives more granular details to know why particular data got dropped at that stage ● No need of additional costly counts action at each stage.

13. Conclusion ● We made the switch to use Enriching approach in our Spark job in production. ● It is successfully processing millions of credit card transaction daily. ● Awarding millions of miles,cash and points as Rewards to Capital One customers.

Enriching data vs filtering in Spark for credit card rewards processing

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Enriching data vs filtering in Spark for credit card rewards processing

Similar a Enriching data vs filtering in Spark for credit card rewards processing (20)

Más de Databricks

Más de Databricks (20)

Último

Último (20)

Enriching data vs filtering in Spark for credit card rewards processing