Configuration Driven Reporting On Large Dataset Using Apache Spark

Configuration Driven
Reporting On a Large
Dataset Using Apache
Spark
Arvind Das, Senior Engineer, American Express
can connect with me @ http://linkedin.com/in/arvind-das-a8720b49
Zheng Gu, Engineer, American Express
can connect with me @ http://linkedin.com/in/zheng-gu-895bb4157

Agenda
• Introduction
• Need For Dynamic
Configuration based
Reporting
• Overall Design
• Components Involved
• Transformation at Scale
• Templating at Scale

Introduction- What is Reporting Framework?
Reporting framework entails dynamic scheduling of partner-specific
reports, transforming, aggregating and filtering the data into different
DataFrames using inbuilt as well as user-defined functions leveraging
Spark's in memory and parallel processing capabilities. This also
encompasses applying business rules and converting it into different
formats by embedding FreeMarker as template engine into the
framework

PATTERN: Need for Configuration based Reporting
Different reports & feeds generation involve a common
pattern:
• How the input dataset is read
• Optional enhancement of the Dataset with a referential
data lookup
• Sequence of transformation rules
• Application of a template on final data
{ Control Various Parameters Of reporting, as dynamic configuration, external To the Actual Framework}
Common Reporting Framework
Different reports & feeds involve different:
• Partner/stakeholder configurations
• Frequency of generation
• Input Dataset and schema definition
• Aggregation rules
• Templates

Technical Components
• Configurations driving the reporting is maintained in config management system outside of
the framework
• Core reporting framework in a sequence of activities which runs as a Spark job
• K8s based scheduler app manages the job scheduling and frequencies based on
partner/downstream contracts
• FreeMarker template engine embedded into the framework, reads externally provided
template file
• Framework publishes the final report to s3 object store

A Sample Configuration File
{
"report-name": {
"title": "",
"type": "",
"id": "",
"schema": "sample-schema",
"look-up-dataset": [“”, “”],
"transformation-rule": {
"step1": , "step2": },
"report-template": ["report.ftlh"],
"sample-schema": [ ]
}
}
Report Meta-data
Schema to a
report
Step transform
rules
Schema elements
Report template
Lookup Dataset

Apply Schema Stage
• Create a Spark SQL Query from the Schema provided
• Filter out columns which are not needed for the report using Spark SQL
• Reduce the size of DataFrame
ID Name Sex Birth
1 name1 male 01/01/19
70
2 name2 female 01/01/19
70
Name Sex
name1 male
name2 female
DF1 DF2
select name, sex from DF1
{
"report-name": {
"title": "",
"type": "",
"id": "",
"step1": , "step2": },
"report-template": [""],
"sample-schema": [ “name”, “sex”]
}
}

Data Lookup Stage
• Reports might have need for data that is not part of the input;
• For example, Static Data
• Join the data with the input Dataset based on a common key
ID Name Sex
1 name1 male
2 name2 female
DF1
ID Phone
1 123-456-7890
2 098-765-4321
sample-lookup
ID Name Sex Phone
1 name1 male 123-456-7890
2 name2 female 098-765-4321
DF after lookup
{
"report-name": {
"title": "",
"type": "",
"id": "",
"look-up-dataset": [“sample-lookup”],
"step1": , "step2": },
"report-template": [""],
}
}

Apply Transformation Rules Stage
• Report needs aggregations at different levels. Several transformation rules are needed, and each
transformation returns a DataFrame
txn_ID txn_typ
e
cat_ID amoun
t
1 Credit 100 50
2 Debit 102 30
3 Credit 100 20
4 Credit 102 10
5 Credit 105 100
6 Debit 102 20
7 Credit 102 30
8 Credit 105 50
9 Credit 105 60
10 Debit 100 10
11 Debit 104 5
original DF
step1 DF
step2 DF
step3 DF
txn_ID txn_typ
e
cat_ID amoun
t
1 Credit 100 50
2 Debit 102 30
3 Credit 100 20
4 Credit 102 10
5 Credit 105 100
6 Debit 102 20
7 Credit 102 30
8 Credit 105 50
9 Credit 105 60
10 Debit 100 10
cat_id amount count
100 80 3
102 90 4
105 210 3
txn_type amount count
Credit 320 7
Debit 60 3
{
"report-name": {
"title": "",
…………….
"step1": select * from original_DF where txn_id <= 10,
"step2": select cat_id, sum(amount) as amount, count(amount)
as count from step1 group by cat_id,
"step3": select txn_type, sum(amount) as amount,
count(amount) as count from step 1 group by txn_type
},
}
}

Apply Transformation Rules (continued…)
Along with Spark SQL functions, UDFs (User Defined Functions) provide customized querying abilities
• Some sample UDFs:
• Decimalize:
• Calculate the actual transaction amount
• Parameters: transaction amount, decimalization factor
select decimalize(amount, decimal_factor) from DF;
• Signage:
• Apply rules to change transaction amount to positive or negative value
• Be used for further aggregation
select signage(amount, xx, xx…) from DF;

Apply Template
• Choose a Template Engine
• Need to transform the DataFrame to the format that can be referred by the
template
• Each DataFrame will be transferred into List<T>
DataFrame Dataset<T> List<T>
<#list step2 as step2>
State: ${step2.state}
Count: ${step2.count}
</#list>
List<T> step2
Template
Engine
State: CA
Count: 4
State: MA
Count: 3
State: AZ
Count: 3
template:
data:
output:
{
"report-name": {
"title": "",
"type": "",
"id": "",
"step1": , "step2": },
"report-template": [”sample-report.ftlh"],
}
}
Template Options:
Velocity, Thymeleaf, StringTemplate, FreeMarker
We selected FreeMarker, because:
• Template engine for generic template
• Supported by the Apache Software Foundation (ASF)
• Is widely used across Apache projects
• New frequent release. Latest release was in Feb 2021
• Good documentations available

Success Metrics
• Time to build a new report reduced from a month to a week
• Customers of the reports doesn’t have to know the internal nuances of report build
• We do not need to have a highly skilled technical team for a report build
• Performance Of Processing up to 10 million records per report achieved

Find your place in technology on #TeamAmex
https://jobs.americanexpress.com/tech

Configuration Driven Reporting On Large Dataset Using Apache Spark

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Configuration Driven Reporting On Large Dataset Using Apache Spark

Similar a Configuration Driven Reporting On Large Dataset Using Apache Spark (20)

Más de Databricks

Más de Databricks (20)

Último

Último (20)

Configuration Driven Reporting On Large Dataset Using Apache Spark