This short presentation shows one of ways to to integrate Cerberus and PySpark. It was initially given at Paris.py meetup (https://www.meetup.com/Paris-py-Python-Django-friends/events/264404036/)
10. PySpark integration - extended Cerberus
UNKNOWN_NETWORK = ErrorDefinition(333, 'network_exists')
class ExtendedValidator(Validator):
def _validate_networkexists(self, allowed_values, field, value):
if value not in allowed_values:
self._error(field, UNKNOWN_NETWORK, {})
class ErrorCodesHandler(SchemaErrorHandler):
def __call__(self, validation_errors):
def concat_path(document_path):
return '.'.join(document_path)
output_errors = {}
for error in validation_errors:
if error.is_group_error:
for child_error in error.child_errors:
output_errors[concat_path(child_error.document_path)] =
child_error.code
else:
output_errors[concat_path(error.document_path)] = error.code
return output_errors
extended
Validator,
custom
validation
rule
not the
same as
previously
custom
output for
#errors
call
11. PySpark integration - .mapPartitions function
def check_for_errors(rows):
validator = ExtendedValidator(schema,
error_handler=ErrorCodesHandler())
def default_dictionary():
return defaultdict(int)
errors = defaultdict(default_dictionary)
for row in rows:
validation_result =
validator.validate(row.asDict(recursive=True),
normalize=False)
if not validation_result:
for error_field, error_code in
validator.errors.items():
errors[error_field][error_code] += 1
return [(k, dict(v)) for k, v in errors.items()]
disabled
normalization
SchemaErrorHandler ⇒ child of BasicErrorHandler; BasicErrorHandler child of BaseErrorHandler
define what is the type (datetime, int, string, ..)
mapping ⇒ type, items, empty ⇒ validator types
https://docs.python-cerberus.org/en/stable/validation-rules.html
explain why normalize=False
calls __normalize_mapping
* you can specify a new attribute - normalization will rename it if needed >>> v = Validator({'foo': {'rename': 'bar'}})
>>> v.normalized({'foo': 0})
{'bar': 0}
* later it purges all unknown fields if Validator({'foo': {'type': 'string'}}, purge_unknown=True) is specified
* applies the defaults also
* can also call a coercion, a callable that can for instance convert the type of a field
Coercion allows you to apply a callable (given as object or the name of a custom coercion method) to a value before the document is validated
check if validator.clear_caches can help
explain why normalize=False
calls __normalize_mapping
* you can specify a new attribute - normalization will rename it if needed >>> v = Validator({'foo': {'rename': 'bar'}})
>>> v.normalized({'foo': 0})
{'bar': 0}
* later it purges all unknown fields if Validator({'foo': {'type': 'string'}}, purge_unknown=True) is specified
* applies the defaults also
* can also call a coercion, a callable that can for instance convert the type of a field
Coercion allows you to apply a callable (given as object or the name of a custom coercion method) to a value before the document is validated
check if validator.clear_caches can help