SlideShare una empresa de Scribd logo
1 de 13
Extensible JSON validation
at scale with Cerberus and PySpark
Bartosz Konieczny
@waitingforcode
First things first
Bartosz Konieczny
#dataEngineer #ApacheSparkEnthusiast #AWSuser
#waitingforcode.com #becomedataengineer.com
#@waitingforcode
#github.com/bartosz25 /data-generator /spark-scala-playground ...
Cerberus
API
● from cerberus import Validator
○ def __init(..., schema, ignore_none_values, allow_unknown,
purge_unknown, error_handler)
○ def validate(self, document, schema=None, update=False,
normalize=True)
● from cerberus.errors import BaseErrorHandler
○ def __call__(self, errors)
● min/max
'id': {'type': 'integer', 'min': 1}
● RegEx
'email': {'type': 'string', 'regex': '^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-
]+.[a-zA-Z0-9-.]+$'}
● empty, contains (all values)
'items': {'type': 'list', 'items': [{'type': 'string'}], 'empty':
False, 'contains': ['item 1', 'item 2']}
● allowed, forbidden
{'role': {'type': 'string', 'allowed': ['agent', 'client',
'supplier']}}
{'role': {'forbidden': ['owner']}}
● required
'first_order': {'type': 'datetime', 'required': False}
Validation rules
class ExtendedValidator(Validator):
def _validate_productexists(self, lookup_table,
field, value):
if lookup_table == 'memory':
existing_items = ['item1', 'item2',
'item3', 'item4']
not_existing_items = list(filter(lambda
item_name: item_name not in existing_items,
value))
if not_existing_items:
self._error(field, "{} items don't
exist in the lookup table"
.format(not_existing_items))
Custom validation rule
extend Validator
prefix custom rules with
_validate_{rule_name}
methods
call def _error(self,
*args) to add errors
Validation process
{
'id': { 'type': 'integer', 'min': 1},
'first_order': {
'type': 'datetime', 'required':
False
},
Validator
#validate{"id": -3, "amount": 30.97, ...} True / False
#errors
Cerberus and
PySpark
PySpark integration - whole pipeline
dataframe_schema = StructType(
fields=[ # ...
StructField("source", StructType(
fields=[StructField("site", StringType(), True),
StructField("api_version", StringType(), True)] ), False)
])
def sum_errors_number(errors_count_1, errors_count_2):
merged_dict = {dict_key: errors_count_1.get(dict_key, 0) +
errors_count_2.get(dict_key, 0) for dict_key in
set(errors_count_1) | set(errors_count_2)}
return merged_dict
spark = SparkSession.builder.master("local[4]")
.appName("...").getOrCreate()
errors_distribution = spark.read
.json(input, schema=dataframe_schema, lineSep='n')
.rdd.mapPartitions(check_for_errors)
.reduceByKey(sum_errors_number).collectAsMap()
potential
data quality
issue
collect data to
the driver /!
PySpark integration - extended Cerberus
UNKNOWN_NETWORK = ErrorDefinition(333, 'network_exists')
class ExtendedValidator(Validator):
def _validate_networkexists(self, allowed_values, field, value):
if value not in allowed_values:
self._error(field, UNKNOWN_NETWORK, {})
class ErrorCodesHandler(SchemaErrorHandler):
def __call__(self, validation_errors):
def concat_path(document_path):
return '.'.join(document_path)
output_errors = {}
for error in validation_errors:
if error.is_group_error:
for child_error in error.child_errors:
output_errors[concat_path(child_error.document_path)] =
child_error.code
else:
output_errors[concat_path(error.document_path)] = error.code
return output_errors
extended
Validator,
custom
validation
rule
not the
same as
previously
custom
output for
#errors
call
PySpark integration - .mapPartitions function
def check_for_errors(rows):
validator = ExtendedValidator(schema,
error_handler=ErrorCodesHandler())
def default_dictionary():
return defaultdict(int)
errors = defaultdict(default_dictionary)
for row in rows:
validation_result =
validator.validate(row.asDict(recursive=True),
normalize=False)
if not validation_result:
for error_field, error_code in
validator.errors.items():
errors[error_field][error_code] += 1
return [(k, dict(v)) for k, v in errors.items()]
disabled
normalization
Resources
● Cerberus: http://docs.python-cerberus.org/en/stable/
● Github Cerberus+PySpark demo: https://github.com/bartosz25/paris.py-cerberus-pyspark-talk
● Github data generator: https://github.com/bartosz25/data-generator
● PySpark + Cerberus series: https://www.waitingforcode.com/tags/cerberus-pyspark
Thank you !
@waitingforcode / waitingforcode.com

Más contenido relacionado

La actualidad más candente

PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...Edureka!
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Carole Gunst
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks
 
Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...
Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...
Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...Cathrine Wilhelmsen
 
Best Practices: Hadoop migration to Azure HDInsight
Best Practices: Hadoop migration to Azure HDInsightBest Practices: Hadoop migration to Azure HDInsight
Best Practices: Hadoop migration to Azure HDInsightRevin Chalil
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Web Services
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
Apache Spark Data Validation
Apache Spark Data ValidationApache Spark Data Validation
Apache Spark Data ValidationDatabricks
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPDatabricks
 
How to create a successful data archiving strategy for your Salesforce Org.
How to create a successful data archiving strategy for your Salesforce Org.How to create a successful data archiving strategy for your Salesforce Org.
How to create a successful data archiving strategy for your Salesforce Org.DataArchiva
 
Data Mesh @ Yelp - 2019
Data Mesh @ Yelp - 2019Data Mesh @ Yelp - 2019
Data Mesh @ Yelp - 2019Steven Moy
 
What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?Andrii Soldatenko
 
Oracle business intelligence overview
Oracle business intelligence overviewOracle business intelligence overview
Oracle business intelligence overviewnvvrajesh
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframeJaemun Jung
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Snowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for EveryoneSnowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for EveryoneAngel Abundez
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 

La actualidad más candente (20)

PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
 
Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...
Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...
Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...
 
Best Practices: Hadoop migration to Azure HDInsight
Best Practices: Hadoop migration to Azure HDInsightBest Practices: Hadoop migration to Azure HDInsight
Best Practices: Hadoop migration to Azure HDInsight
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Apache Spark Data Validation
Apache Spark Data ValidationApache Spark Data Validation
Apache Spark Data Validation
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
 
How to create a successful data archiving strategy for your Salesforce Org.
How to create a successful data archiving strategy for your Salesforce Org.How to create a successful data archiving strategy for your Salesforce Org.
How to create a successful data archiving strategy for your Salesforce Org.
 
Data Mesh @ Yelp - 2019
Data Mesh @ Yelp - 2019Data Mesh @ Yelp - 2019
Data Mesh @ Yelp - 2019
 
What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?
 
Oracle business intelligence overview
Oracle business intelligence overviewOracle business intelligence overview
Oracle business intelligence overview
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Snowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for EveryoneSnowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for Everyone
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 

Similar a Using Cerberus and PySpark to validate semi-structured datasets

AST - the only true tool for building JavaScript
AST - the only true tool for building JavaScriptAST - the only true tool for building JavaScript
AST - the only true tool for building JavaScriptIngvar Stepanyan
 
Building an api using golang and postgre sql v1.0
Building an api using golang and postgre sql v1.0Building an api using golang and postgre sql v1.0
Building an api using golang and postgre sql v1.0Frost
 
Future of Web Apps: Google Gears
Future of Web Apps: Google GearsFuture of Web Apps: Google Gears
Future of Web Apps: Google Gearsdion
 
Alexey Tsoy Meta Programming in C++ 16.11.17
Alexey Tsoy Meta Programming in C++ 16.11.17Alexey Tsoy Meta Programming in C++ 16.11.17
Alexey Tsoy Meta Programming in C++ 16.11.17LogeekNightUkraine
 
用Tornado开发RESTful API运用
用Tornado开发RESTful API运用用Tornado开发RESTful API运用
用Tornado开发RESTful API运用Felinx Lee
 
Zend Framework Study@Tokyo #2
Zend Framework Study@Tokyo #2Zend Framework Study@Tokyo #2
Zend Framework Study@Tokyo #2Shinya Ohyanagi
 
From mysql to MongoDB(MongoDB2011北京交流会)
From mysql to MongoDB(MongoDB2011北京交流会)From mysql to MongoDB(MongoDB2011北京交流会)
From mysql to MongoDB(MongoDB2011北京交流会)Night Sailer
 
The battle of Protractor and Cypress - RunIT Conference 2019
The battle of Protractor and Cypress - RunIT Conference 2019The battle of Protractor and Cypress - RunIT Conference 2019
The battle of Protractor and Cypress - RunIT Conference 2019Ludmila Nesvitiy
 
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, EverAltitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, EverFastly
 
Design Summit - Rails 4 Migration - Aaron Patterson
Design Summit - Rails 4 Migration - Aaron PattersonDesign Summit - Rails 4 Migration - Aaron Patterson
Design Summit - Rails 4 Migration - Aaron PattersonManageIQ
 
JS Fest 2019. Anjana Vakil. Serverless Bebop
JS Fest 2019. Anjana Vakil. Serverless BebopJS Fest 2019. Anjana Vakil. Serverless Bebop
JS Fest 2019. Anjana Vakil. Serverless BebopJSFestUA
 
Apache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customizationApache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customizationBartosz Konieczny
 
Expert JavaScript tricks of the masters
Expert JavaScript  tricks of the mastersExpert JavaScript  tricks of the masters
Expert JavaScript tricks of the mastersAra Pehlivanian
 
Python postgre sql a wonderful wedding
Python postgre sql   a wonderful weddingPython postgre sql   a wonderful wedding
Python postgre sql a wonderful weddingStéphane Wirtel
 
Python magicmethods
Python magicmethodsPython magicmethods
Python magicmethodsdreampuf
 
Tools for Making Machine Learning more Reactive
Tools for Making Machine Learning more ReactiveTools for Making Machine Learning more Reactive
Tools for Making Machine Learning more ReactiveJeff Smith
 
Flashback, el primer malware masivo de sistemas Mac
Flashback, el primer malware masivo de sistemas MacFlashback, el primer malware masivo de sistemas Mac
Flashback, el primer malware masivo de sistemas MacESET Latinoamérica
 
Cnam azure 2014 mobile services
Cnam azure 2014   mobile servicesCnam azure 2014   mobile services
Cnam azure 2014 mobile servicesAymeric Weinbach
 

Similar a Using Cerberus and PySpark to validate semi-structured datasets (20)

AST - the only true tool for building JavaScript
AST - the only true tool for building JavaScriptAST - the only true tool for building JavaScript
AST - the only true tool for building JavaScript
 
Building an api using golang and postgre sql v1.0
Building an api using golang and postgre sql v1.0Building an api using golang and postgre sql v1.0
Building an api using golang and postgre sql v1.0
 
Future of Web Apps: Google Gears
Future of Web Apps: Google GearsFuture of Web Apps: Google Gears
Future of Web Apps: Google Gears
 
Alexey Tsoy Meta Programming in C++ 16.11.17
Alexey Tsoy Meta Programming in C++ 16.11.17Alexey Tsoy Meta Programming in C++ 16.11.17
Alexey Tsoy Meta Programming in C++ 16.11.17
 
用Tornado开发RESTful API运用
用Tornado开发RESTful API运用用Tornado开发RESTful API运用
用Tornado开发RESTful API运用
 
Zend Framework Study@Tokyo #2
Zend Framework Study@Tokyo #2Zend Framework Study@Tokyo #2
Zend Framework Study@Tokyo #2
 
From mysql to MongoDB(MongoDB2011北京交流会)
From mysql to MongoDB(MongoDB2011北京交流会)From mysql to MongoDB(MongoDB2011北京交流会)
From mysql to MongoDB(MongoDB2011北京交流会)
 
The battle of Protractor and Cypress - RunIT Conference 2019
The battle of Protractor and Cypress - RunIT Conference 2019The battle of Protractor and Cypress - RunIT Conference 2019
The battle of Protractor and Cypress - RunIT Conference 2019
 
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, EverAltitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
 
Design Summit - Rails 4 Migration - Aaron Patterson
Design Summit - Rails 4 Migration - Aaron PattersonDesign Summit - Rails 4 Migration - Aaron Patterson
Design Summit - Rails 4 Migration - Aaron Patterson
 
JS Fest 2019. Anjana Vakil. Serverless Bebop
JS Fest 2019. Anjana Vakil. Serverless BebopJS Fest 2019. Anjana Vakil. Serverless Bebop
JS Fest 2019. Anjana Vakil. Serverless Bebop
 
Apache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customizationApache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customization
 
Expert JavaScript tricks of the masters
Expert JavaScript  tricks of the mastersExpert JavaScript  tricks of the masters
Expert JavaScript tricks of the masters
 
Python postgre sql a wonderful wedding
Python postgre sql   a wonderful weddingPython postgre sql   a wonderful wedding
Python postgre sql a wonderful wedding
 
Python magicmethods
Python magicmethodsPython magicmethods
Python magicmethods
 
Postman On Steroids
Postman On SteroidsPostman On Steroids
Postman On Steroids
 
Tools for Making Machine Learning more Reactive
Tools for Making Machine Learning more ReactiveTools for Making Machine Learning more Reactive
Tools for Making Machine Learning more Reactive
 
DataMapper
DataMapperDataMapper
DataMapper
 
Flashback, el primer malware masivo de sistemas Mac
Flashback, el primer malware masivo de sistemas MacFlashback, el primer malware masivo de sistemas Mac
Flashback, el primer malware masivo de sistemas Mac
 
Cnam azure 2014 mobile services
Cnam azure 2014   mobile servicesCnam azure 2014   mobile services
Cnam azure 2014 mobile services
 

Último

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 

Último (20)

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 

Using Cerberus and PySpark to validate semi-structured datasets

  • 1. Extensible JSON validation at scale with Cerberus and PySpark Bartosz Konieczny @waitingforcode
  • 2. First things first Bartosz Konieczny #dataEngineer #ApacheSparkEnthusiast #AWSuser #waitingforcode.com #becomedataengineer.com #@waitingforcode #github.com/bartosz25 /data-generator /spark-scala-playground ...
  • 4. API ● from cerberus import Validator ○ def __init(..., schema, ignore_none_values, allow_unknown, purge_unknown, error_handler) ○ def validate(self, document, schema=None, update=False, normalize=True) ● from cerberus.errors import BaseErrorHandler ○ def __call__(self, errors)
  • 5. ● min/max 'id': {'type': 'integer', 'min': 1} ● RegEx 'email': {'type': 'string', 'regex': '^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9- ]+.[a-zA-Z0-9-.]+$'} ● empty, contains (all values) 'items': {'type': 'list', 'items': [{'type': 'string'}], 'empty': False, 'contains': ['item 1', 'item 2']} ● allowed, forbidden {'role': {'type': 'string', 'allowed': ['agent', 'client', 'supplier']}} {'role': {'forbidden': ['owner']}} ● required 'first_order': {'type': 'datetime', 'required': False} Validation rules
  • 6. class ExtendedValidator(Validator): def _validate_productexists(self, lookup_table, field, value): if lookup_table == 'memory': existing_items = ['item1', 'item2', 'item3', 'item4'] not_existing_items = list(filter(lambda item_name: item_name not in existing_items, value)) if not_existing_items: self._error(field, "{} items don't exist in the lookup table" .format(not_existing_items)) Custom validation rule extend Validator prefix custom rules with _validate_{rule_name} methods call def _error(self, *args) to add errors
  • 7. Validation process { 'id': { 'type': 'integer', 'min': 1}, 'first_order': { 'type': 'datetime', 'required': False }, Validator #validate{"id": -3, "amount": 30.97, ...} True / False #errors
  • 9. PySpark integration - whole pipeline dataframe_schema = StructType( fields=[ # ... StructField("source", StructType( fields=[StructField("site", StringType(), True), StructField("api_version", StringType(), True)] ), False) ]) def sum_errors_number(errors_count_1, errors_count_2): merged_dict = {dict_key: errors_count_1.get(dict_key, 0) + errors_count_2.get(dict_key, 0) for dict_key in set(errors_count_1) | set(errors_count_2)} return merged_dict spark = SparkSession.builder.master("local[4]") .appName("...").getOrCreate() errors_distribution = spark.read .json(input, schema=dataframe_schema, lineSep='n') .rdd.mapPartitions(check_for_errors) .reduceByKey(sum_errors_number).collectAsMap() potential data quality issue collect data to the driver /!
  • 10. PySpark integration - extended Cerberus UNKNOWN_NETWORK = ErrorDefinition(333, 'network_exists') class ExtendedValidator(Validator): def _validate_networkexists(self, allowed_values, field, value): if value not in allowed_values: self._error(field, UNKNOWN_NETWORK, {}) class ErrorCodesHandler(SchemaErrorHandler): def __call__(self, validation_errors): def concat_path(document_path): return '.'.join(document_path) output_errors = {} for error in validation_errors: if error.is_group_error: for child_error in error.child_errors: output_errors[concat_path(child_error.document_path)] = child_error.code else: output_errors[concat_path(error.document_path)] = error.code return output_errors extended Validator, custom validation rule not the same as previously custom output for #errors call
  • 11. PySpark integration - .mapPartitions function def check_for_errors(rows): validator = ExtendedValidator(schema, error_handler=ErrorCodesHandler()) def default_dictionary(): return defaultdict(int) errors = defaultdict(default_dictionary) for row in rows: validation_result = validator.validate(row.asDict(recursive=True), normalize=False) if not validation_result: for error_field, error_code in validator.errors.items(): errors[error_field][error_code] += 1 return [(k, dict(v)) for k, v in errors.items()] disabled normalization
  • 12. Resources ● Cerberus: http://docs.python-cerberus.org/en/stable/ ● Github Cerberus+PySpark demo: https://github.com/bartosz25/paris.py-cerberus-pyspark-talk ● Github data generator: https://github.com/bartosz25/data-generator ● PySpark + Cerberus series: https://www.waitingforcode.com/tags/cerberus-pyspark
  • 13. Thank you ! @waitingforcode / waitingforcode.com

Notas del editor

  1. SchemaErrorHandler ⇒ child of BasicErrorHandler; BasicErrorHandler child of BaseErrorHandler
  2. define what is the type (datetime, int, string, ..) mapping ⇒ type, items, empty ⇒ validator types https://docs.python-cerberus.org/en/stable/validation-rules.html
  3. explain why normalize=False calls __normalize_mapping * you can specify a new attribute - normalization will rename it if needed >>> v = Validator({'foo': {'rename': 'bar'}}) >>> v.normalized({'foo': 0}) {'bar': 0} * later it purges all unknown fields if Validator({'foo': {'type': 'string'}}, purge_unknown=True) is specified * applies the defaults also * can also call a coercion, a callable that can for instance convert the type of a field Coercion allows you to apply a callable (given as object or the name of a custom coercion method) to a value before the document is validated check if validator.clear_caches can help
  4. explain why normalize=False calls __normalize_mapping * you can specify a new attribute - normalization will rename it if needed >>> v = Validator({'foo': {'rename': 'bar'}}) >>> v.normalized({'foo': 0}) {'bar': 0} * later it purges all unknown fields if Validator({'foo': {'type': 'string'}}, purge_unknown=True) is specified * applies the defaults also * can also call a coercion, a callable that can for instance convert the type of a field Coercion allows you to apply a callable (given as object or the name of a custom coercion method) to a value before the document is validated check if validator.clear_caches can help