SlideShare una empresa de Scribd logo
1 de 16
Descargar para leer sin conexión
Configuration Driven
Reporting On a Large
Dataset Using Apache
Spark
Arvind Das, Senior Engineer, American Express
can connect with me @ http://linkedin.com/in/arvind-das-a8720b49
Zheng Gu, Engineer, American Express
can connect with me @ http://linkedin.com/in/zheng-gu-895bb4157
Agenda
• Introduction
• Need For Dynamic
Configuration based
Reporting
• Overall Design
• Components Involved
• Transformation at Scale
• Templating at Scale
Introduction- What is Reporting Framework?
Reporting framework entails dynamic scheduling of partner-specific
reports, transforming, aggregating and filtering the data into different
DataFrames using inbuilt as well as user-defined functions leveraging
Spark's in memory and parallel processing capabilities. This also
encompasses applying business rules and converting it into different
formats by embedding FreeMarker as template engine into the
framework
STATISTICS AND GENERAL NEED
PATTERN: Need for Configuration based Reporting
Different reports & feeds generation involve a common
pattern:
• How the input dataset is read
• Optional enhancement of the Dataset with a referential
data lookup
• Sequence of transformation rules
• Application of a template on final data
{ Control Various Parameters Of reporting, as dynamic configuration, external To the Actual Framework}
Common Reporting Framework
Different reports & feeds involve different:
• Partner/stakeholder configurations
• Frequency of generation
• Input Dataset and schema definition
• Aggregation rules
• Templates
Technical Components
• Configurations driving the reporting is maintained in config management system outside of
the framework
• Core reporting framework in a sequence of activities which runs as a Spark job
• K8s based scheduler app manages the job scheduling and frequencies based on
partner/downstream contracts
• FreeMarker template engine embedded into the framework, reads externally provided
template file
• Framework publishes the final report to s3 object store
A Sample Configuration File
{
"report-name": {
"title": "",
"type": "",
"id": "",
"schema": "sample-schema",
"look-up-dataset": [“”, “”],
"transformation-rule": {
"step1": , "step2": },
"report-template": ["report.ftlh"],
"sample-schema": [ ]
}
}
Report Meta-data
Schema to a
report
Step transform
rules
Schema elements
Report template
Lookup Dataset
Deep-Dive
Apply Schema Stage
• Create a Spark SQL Query from the Schema provided
• Filter out columns which are not needed for the report using Spark SQL
• Reduce the size of DataFrame
ID Name Sex Birth
1 name1 male 01/01/19
70
2 name2 female 01/01/19
70
Name Sex
name1 male
name2 female
DF1 DF2
select name, sex from DF1
{
"report-name": {
"title": "",
"type": "",
"id": "",
"schema": "sample-schema",
"look-up-dataset": [“”, “”],
"transformation-rule": {
"step1": , "step2": },
"report-template": [""],
"sample-schema": [ “name”, “sex”]
}
}
Data Lookup Stage
• Reports might have need for data that is not part of the input;
• For example, Static Data
• Join the data with the input Dataset based on a common key
ID Name Sex
1 name1 male
2 name2 female
DF1
ID Phone
1 123-456-7890
2 098-765-4321
sample-lookup
ID Name Sex Phone
1 name1 male 123-456-7890
2 name2 female 098-765-4321
DF after lookup
{
"report-name": {
"title": "",
"type": "",
"id": "",
"schema": "sample-schema",
"look-up-dataset": [“sample-lookup”],
"transformation-rule": {
"step1": , "step2": },
"report-template": [""],
"sample-schema": [ “name”, “sex”]
}
}
Apply Transformation Rules Stage
• Report needs aggregations at different levels. Several transformation rules are needed, and each
transformation returns a DataFrame
txn_ID txn_typ
e
cat_ID amoun
t
1 Credit 100 50
2 Debit 102 30
3 Credit 100 20
4 Credit 102 10
5 Credit 105 100
6 Debit 102 20
7 Credit 102 30
8 Credit 105 50
9 Credit 105 60
10 Debit 100 10
11 Debit 104 5
original DF
step1 DF
step2 DF
step3 DF
txn_ID txn_typ
e
cat_ID amoun
t
1 Credit 100 50
2 Debit 102 30
3 Credit 100 20
4 Credit 102 10
5 Credit 105 100
6 Debit 102 20
7 Credit 102 30
8 Credit 105 50
9 Credit 105 60
10 Debit 100 10
cat_id amount count
100 80 3
102 90 4
105 210 3
txn_type amount count
Credit 320 7
Debit 60 3
{
"report-name": {
"title": "",
…………….
"transformation-rule": {
"step1": select * from original_DF where txn_id <= 10,
"step2": select cat_id, sum(amount) as amount, count(amount)
as count from step1 group by cat_id,
"step3": select txn_type, sum(amount) as amount,
count(amount) as count from step 1 group by txn_type
},
"sample-schema": [ “name”, “sex”]
}
}
Apply Transformation Rules (continued…)
Along with Spark SQL functions, UDFs (User Defined Functions) provide customized querying abilities
• Some sample UDFs:
• Decimalize:
• Calculate the actual transaction amount
• Parameters: transaction amount, decimalization factor
select decimalize(amount, decimal_factor) from DF;
• Signage:
• Apply rules to change transaction amount to positive or negative value
• Be used for further aggregation
select signage(amount, xx, xx…) from DF;
Apply Template
• Choose a Template Engine
• Need to transform the DataFrame to the format that can be referred by the
template
• Each DataFrame will be transferred into List<T>
DataFrame Dataset<T> List<T>
<#list step2 as step2>
State: ${step2.state}
Count: ${step2.count}
</#list>
List<T> step2
Template
Engine
State: CA
Count: 4
State: MA
Count: 3
State: AZ
Count: 3
template:
data:
output:
{
"report-name": {
"title": "",
"type": "",
"id": "",
"schema": "sample-schema",
"look-up-dataset": [“”, “”],
"transformation-rule": {
"step1": , "step2": },
"report-template": [”sample-report.ftlh"],
"sample-schema": [ “name”, “sex”]
}
}
Template Options:
Velocity, Thymeleaf, StringTemplate, FreeMarker
We selected FreeMarker, because:
• Template engine for generic template
• Supported by the Apache Software Foundation (ASF)
• Is widely used across Apache projects
• New frequent release. Latest release was in Feb 2021
• Good documentations available
Success Metrics
• Time to build a new report reduced from a month to a week
• Customers of the reports doesn’t have to know the internal nuances of report build
• We do not need to have a highly skilled technical team for a report build
• Performance Of Processing up to 10 million records per report achieved
Thank you
Find your place in technology on #TeamAmex
https://jobs.americanexpress.com/tech

Más contenido relacionado

La actualidad más candente

Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoAlluxio, Inc.
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Getting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsGetting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsDatabricks
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Batch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing DifferenceBatch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing Differencejeetendra mandal
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks
 

La actualidad más candente (20)

Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks Autoloader
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Getting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsGetting Started with Databricks SQL Analytics
Getting Started with Databricks SQL Analytics
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Batch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing DifferenceBatch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing Difference
 
Database migration
Database migrationDatabase migration
Database migration
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
 

Similar a Configuration Driven Reporting On Large Dataset Using Apache Spark

Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark MLAhmet Bulut
 
Lsmw ppt in SAP ABAP
Lsmw ppt in SAP ABAPLsmw ppt in SAP ABAP
Lsmw ppt in SAP ABAPAabid Khan
 
UNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data MiningUNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data MiningNandakumar P
 
Ps training mannual ( configuration )
Ps training mannual ( configuration )Ps training mannual ( configuration )
Ps training mannual ( configuration )Soumya De
 
Cognos framework manager
Cognos framework managerCognos framework manager
Cognos framework managermaxonlinetr
 
IPC Data Analysis and Extraction
IPC Data Analysis and ExtractionIPC Data Analysis and Extraction
IPC Data Analysis and Extractionpzybrick
 
Get started with data migration
Get started with data migrationGet started with data migration
Get started with data migrationThinqloud
 
3._DWH_Architecture__Components.ppt
3._DWH_Architecture__Components.ppt3._DWH_Architecture__Components.ppt
3._DWH_Architecture__Components.pptBsMath3rdsem
 
Exploring Neo4j Graph Database as a Fast Data Access Layer
Exploring Neo4j Graph Database as a Fast Data Access LayerExploring Neo4j Graph Database as a Fast Data Access Layer
Exploring Neo4j Graph Database as a Fast Data Access LayerSambit Banerjee
 
How to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer CertificationHow to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer Certificationelephantscale
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in sparkChester Chen
 
SAS Training session - By Pratima
SAS Training session  -  By Pratima SAS Training session  -  By Pratima
SAS Training session - By Pratima Pratima Pandey
 
Optimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemOptimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemDataWorks Summit
 
DQ Product Usage Methodology Highlights_v6_ltd
DQ Product Usage Methodology Highlights_v6_ltdDQ Product Usage Methodology Highlights_v6_ltd
DQ Product Usage Methodology Highlights_v6_ltdDigendra Vir Singh (DV)
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...Amazon Web Services
 
Ax 2012 R3 Legacy Data Migration
Ax 2012 R3 Legacy Data MigrationAx 2012 R3 Legacy Data Migration
Ax 2012 R3 Legacy Data MigrationJayanta Sarkar
 
Deep Dive into Azure Data Factory v2
Deep Dive into Azure Data Factory v2Deep Dive into Azure Data Factory v2
Deep Dive into Azure Data Factory v2Eric Bragas
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSyed Hadoop
 

Similar a Configuration Driven Reporting On Large Dataset Using Apache Spark (20)

Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
 
Lsmw ppt in SAP ABAP
Lsmw ppt in SAP ABAPLsmw ppt in SAP ABAP
Lsmw ppt in SAP ABAP
 
UNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data MiningUNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data Mining
 
Ps training mannual ( configuration )
Ps training mannual ( configuration )Ps training mannual ( configuration )
Ps training mannual ( configuration )
 
Cognos framework manager
Cognos framework managerCognos framework manager
Cognos framework manager
 
IPC Data Analysis and Extraction
IPC Data Analysis and ExtractionIPC Data Analysis and Extraction
IPC Data Analysis and Extraction
 
Get started with data migration
Get started with data migrationGet started with data migration
Get started with data migration
 
3._DWH_Architecture__Components.ppt
3._DWH_Architecture__Components.ppt3._DWH_Architecture__Components.ppt
3._DWH_Architecture__Components.ppt
 
Exploring Neo4j Graph Database as a Fast Data Access Layer
Exploring Neo4j Graph Database as a Fast Data Access LayerExploring Neo4j Graph Database as a Fast Data Access Layer
Exploring Neo4j Graph Database as a Fast Data Access Layer
 
How to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer CertificationHow to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer Certification
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
 
SAS Training session - By Pratima
SAS Training session  -  By Pratima SAS Training session  -  By Pratima
SAS Training session - By Pratima
 
Stored procedure in sql server
Stored procedure in sql serverStored procedure in sql server
Stored procedure in sql server
 
Optimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemOptimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystem
 
DQ Product Usage Methodology Highlights_v6_ltd
DQ Product Usage Methodology Highlights_v6_ltdDQ Product Usage Methodology Highlights_v6_ltd
DQ Product Usage Methodology Highlights_v6_ltd
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
 
Datastage Introduction To Data Warehousing
Datastage Introduction To Data WarehousingDatastage Introduction To Data Warehousing
Datastage Introduction To Data Warehousing
 
Ax 2012 R3 Legacy Data Migration
Ax 2012 R3 Legacy Data MigrationAx 2012 R3 Legacy Data Migration
Ax 2012 R3 Legacy Data Migration
 
Deep Dive into Azure Data Factory v2
Deep Dive into Azure Data Factory v2Deep Dive into Azure Data Factory v2
Deep Dive into Azure Data Factory v2
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Último

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 

Último (20)

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 

Configuration Driven Reporting On Large Dataset Using Apache Spark

  • 1. Configuration Driven Reporting On a Large Dataset Using Apache Spark Arvind Das, Senior Engineer, American Express can connect with me @ http://linkedin.com/in/arvind-das-a8720b49 Zheng Gu, Engineer, American Express can connect with me @ http://linkedin.com/in/zheng-gu-895bb4157
  • 2. Agenda • Introduction • Need For Dynamic Configuration based Reporting • Overall Design • Components Involved • Transformation at Scale • Templating at Scale
  • 3. Introduction- What is Reporting Framework? Reporting framework entails dynamic scheduling of partner-specific reports, transforming, aggregating and filtering the data into different DataFrames using inbuilt as well as user-defined functions leveraging Spark's in memory and parallel processing capabilities. This also encompasses applying business rules and converting it into different formats by embedding FreeMarker as template engine into the framework
  • 5. PATTERN: Need for Configuration based Reporting Different reports & feeds generation involve a common pattern: • How the input dataset is read • Optional enhancement of the Dataset with a referential data lookup • Sequence of transformation rules • Application of a template on final data { Control Various Parameters Of reporting, as dynamic configuration, external To the Actual Framework} Common Reporting Framework Different reports & feeds involve different: • Partner/stakeholder configurations • Frequency of generation • Input Dataset and schema definition • Aggregation rules • Templates
  • 6. Technical Components • Configurations driving the reporting is maintained in config management system outside of the framework • Core reporting framework in a sequence of activities which runs as a Spark job • K8s based scheduler app manages the job scheduling and frequencies based on partner/downstream contracts • FreeMarker template engine embedded into the framework, reads externally provided template file • Framework publishes the final report to s3 object store
  • 7. A Sample Configuration File { "report-name": { "title": "", "type": "", "id": "", "schema": "sample-schema", "look-up-dataset": [“”, “”], "transformation-rule": { "step1": , "step2": }, "report-template": ["report.ftlh"], "sample-schema": [ ] } } Report Meta-data Schema to a report Step transform rules Schema elements Report template Lookup Dataset
  • 9. Apply Schema Stage • Create a Spark SQL Query from the Schema provided • Filter out columns which are not needed for the report using Spark SQL • Reduce the size of DataFrame ID Name Sex Birth 1 name1 male 01/01/19 70 2 name2 female 01/01/19 70 Name Sex name1 male name2 female DF1 DF2 select name, sex from DF1 { "report-name": { "title": "", "type": "", "id": "", "schema": "sample-schema", "look-up-dataset": [“”, “”], "transformation-rule": { "step1": , "step2": }, "report-template": [""], "sample-schema": [ “name”, “sex”] } }
  • 10. Data Lookup Stage • Reports might have need for data that is not part of the input; • For example, Static Data • Join the data with the input Dataset based on a common key ID Name Sex 1 name1 male 2 name2 female DF1 ID Phone 1 123-456-7890 2 098-765-4321 sample-lookup ID Name Sex Phone 1 name1 male 123-456-7890 2 name2 female 098-765-4321 DF after lookup { "report-name": { "title": "", "type": "", "id": "", "schema": "sample-schema", "look-up-dataset": [“sample-lookup”], "transformation-rule": { "step1": , "step2": }, "report-template": [""], "sample-schema": [ “name”, “sex”] } }
  • 11. Apply Transformation Rules Stage • Report needs aggregations at different levels. Several transformation rules are needed, and each transformation returns a DataFrame txn_ID txn_typ e cat_ID amoun t 1 Credit 100 50 2 Debit 102 30 3 Credit 100 20 4 Credit 102 10 5 Credit 105 100 6 Debit 102 20 7 Credit 102 30 8 Credit 105 50 9 Credit 105 60 10 Debit 100 10 11 Debit 104 5 original DF step1 DF step2 DF step3 DF txn_ID txn_typ e cat_ID amoun t 1 Credit 100 50 2 Debit 102 30 3 Credit 100 20 4 Credit 102 10 5 Credit 105 100 6 Debit 102 20 7 Credit 102 30 8 Credit 105 50 9 Credit 105 60 10 Debit 100 10 cat_id amount count 100 80 3 102 90 4 105 210 3 txn_type amount count Credit 320 7 Debit 60 3 { "report-name": { "title": "", ……………. "transformation-rule": { "step1": select * from original_DF where txn_id <= 10, "step2": select cat_id, sum(amount) as amount, count(amount) as count from step1 group by cat_id, "step3": select txn_type, sum(amount) as amount, count(amount) as count from step 1 group by txn_type }, "sample-schema": [ “name”, “sex”] } }
  • 12. Apply Transformation Rules (continued…) Along with Spark SQL functions, UDFs (User Defined Functions) provide customized querying abilities • Some sample UDFs: • Decimalize: • Calculate the actual transaction amount • Parameters: transaction amount, decimalization factor select decimalize(amount, decimal_factor) from DF; • Signage: • Apply rules to change transaction amount to positive or negative value • Be used for further aggregation select signage(amount, xx, xx…) from DF;
  • 13. Apply Template • Choose a Template Engine • Need to transform the DataFrame to the format that can be referred by the template • Each DataFrame will be transferred into List<T> DataFrame Dataset<T> List<T> <#list step2 as step2> State: ${step2.state} Count: ${step2.count} </#list> List<T> step2 Template Engine State: CA Count: 4 State: MA Count: 3 State: AZ Count: 3 template: data: output: { "report-name": { "title": "", "type": "", "id": "", "schema": "sample-schema", "look-up-dataset": [“”, “”], "transformation-rule": { "step1": , "step2": }, "report-template": [”sample-report.ftlh"], "sample-schema": [ “name”, “sex”] } } Template Options: Velocity, Thymeleaf, StringTemplate, FreeMarker We selected FreeMarker, because: • Template engine for generic template • Supported by the Apache Software Foundation (ASF) • Is widely used across Apache projects • New frequent release. Latest release was in Feb 2021 • Good documentations available
  • 14. Success Metrics • Time to build a new report reduced from a month to a week • Customers of the reports doesn’t have to know the internal nuances of report build • We do not need to have a highly skilled technical team for a report build • Performance Of Processing up to 10 million records per report achieved
  • 16. Find your place in technology on #TeamAmex https://jobs.americanexpress.com/tech