Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

GDPR and Hadoop

The European General Data Protection Regulation (GDPR) will come into effect in May 2018 and it will impact all organizations that store or process personal data of EU citizens. The European Commission is exporting European data protection principles to the rest of the world while widening the definition of personal data and enforcing privacy by design. These changes will not only have an impact on the organizations but also on the software which is used for data processing. How does it affect the Hadoop ecosystem?

Distributed data processing at scale is one of Hadoop’s core features and we will explore how the GDPR could potentially affect it. We will also take a look at the technical aspects of the rights of data subjects and see if and how we can address those, with a particular focus on open-source technologies.

This talk will give you an overview of the key themes of the GDPR including the rights of the data subject and will investigate the technical implications for data processing within the Hadoop ecosystem.

  • Sé el primero en comentar

GDPR and Hadoop

  1. 1. ​GDPR and Hadoop ​The elephant in the room ​Janosch Woschitz ​2017-09-27
  2. 2. 2 • GDPR Overview • Rights of the data subject • Challenges within Hadoop ecosystem • Technical considerations Agenda
  3. 3. 3 • Complex and detailed topic • This is NOT legal advice • A lot of opinions and interpretations about GDPR • Talk is not covering all aspects of GDPR • Process matters, documentation is your friend Disclaimer Take it with a grain of salt
  4. 4. 4 “Regulation (EU) 2016/679 of the European Parliament [...] on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation)” • Establishes data protection as a fundamental right • Creates unified data protection law for all EU member states • Enables EU citizens to be in control of their personal data General Data Protection Regulation GDP what? - Official title of the GDPR, http://eur-lex.europa.eu/eli/reg/2016/679/oj
  5. 5. 5 • Applies if the data controller or processor (organization) or the data subject (person) is based in the EU • Applies to organizations based outside the European Union if they process or monitor personal data of EU citizens • Employees might be EU citizens as well General Data Protection Regulation Who is affected?
  6. 6. 6 • Officially published on May 4th 2016 • Applicable from May 25th 2018 across the EU (including UK) • “Regulation” instead of “Directive” → no need for national implementing legislation, directly applicable to all EU countries • Evaluated and reviewed on May 25th 2020 General Data Protection Regulation When does it happen?
  7. 7. 7 • Better data protection and portability for consumers • Fines for non-compliance will be – up to €10M or 2% revenue for minor violations – up to €20M or 4% revenue for major violations • Any individual has the right to raise a complaint against any organisation (Art. 77) General Data Protection Regulation Why should I care?
  8. 8. 8 Privacy by design Better data protection, you said? • Privacy by design and by default, essential data protection • Breach notification within 72 hours • Data minimization and access limitation • Data Protection Officer (DPO) and Data Privacy Impact Assessments (DPIAs) • Active, specific and unambiguous consent “the controller shall [...] implement appropriate technical and organisational measures [...] in an effective manner [...] in order to meet the requirements of this Regulation and protect the rights of data subjects.” - Article 25, GDPR
  9. 9. 9 Personal data? https://pixabay.com/en/family-drawing-children-cat-paper-879432/
  10. 10. 10 Personal data (examples) It all depends on context • Location or web surfing data • Video surveillance and images • Personal interests or behavioural patterns • A child's drawing depicting its family • Publication of x-ray plates together with the patient's first name • Damage caused by graffiti in public transportation • X1234 drinks a glass of wine more than 3 times a week, drives a Bentley and has a Windows 10 phone
  11. 11. 11 Source: Facebook • Right of access and data portability – free of charge – structured, commonly used and machine readable • Right to erasure – “without undue delay” • Right to object, to restrict, to rectify, ... Data citizen rights Rights of the data subject
  12. 12. GDPR and Hadoop
  13. 13. 13 Hadoop ecosystem & beyond The known Hadoopverse (excerpt) and much more ...
  14. 14. 14 Data processing on Hadoop Bird’s eye view • Various data sources and ingestion tools • Diverse input formats, structured & unstructured • Diverse processing tools • Liberal data access, local data science • Write-append and immutable data structures • Redundant data Ingest Process Access
  15. 15. 15 Challenges by example • Customer data from RDBMS to HDFS • Streaming device location data to Kafka
  16. 16. 16 “userId”: 123 “firstName”: “Janosch” “dateOfBirth”: “1984-01-01” “userId”: 123 “firstName”: “Janosch” “dateOfBirth”: “1984-01-01” Challenges by example Ingest table from RDBMS daily import (e.g. via sqoop) “userId”: 123 “firstName”: “Janosch” “dateOfBirth”: “1984-01-01” “userId”: 123 “firstName”: “Janosch” “dateOfBirth”: “1984-01-01” today -1 day -2 days Big DataSmaller Data
  17. 17. 17 Problems & Solution approaches • Right to be forgotten • Access limitation • Bound to consent • ... • Anonymization • Hashing • Encryption • ...
  18. 18. 18 “userId”: 123 “firstName”: “Janosch” “dateOfBirth”: “1984-01-01” “userId”: 123 “firstName”: “Janosch” “dateOfBirth”: “1984-01-01” Challenges by example Encrypt, a.k.a. Lost Key Pattern daily import (e.g. via sqoop) “userId”: 123 “firstName”: “Janosch” “dateOfBirth”: “1984-01-01” “userId”: 123 “firstName”: “54DCF13E4...” “dateOfBirth”: “D3DFBCE...” today -1 day -2 days 123
  19. 19. 19 deviceId: 123pushes data to Kafka topic 123 B “deviceId”: 123 “lat”: 52.510781 “lon”: 13.371735 Challenges by example Deletion in log based systems Edge device 456 A 123 D 123 ∅ Kafka topic Consumer B, C, D, ∅ offset 2 123 C 3 4 5 6
  20. 20. 20 deviceId: 123pushes data to Kafka topic 123 D4 “deviceId”: 123 “lat”: 52.510781 “lon”: 13.371735 Challenges by example Encrypt on write Edge device 123 Z3 456 T3 123 6H Kafka topic Consumer A, B, C, D offset 1 123 N7 2 3 4 5 123 ?
  21. 21. 21 Vendor recommendations Distributions to the rescue! • Hortonworks - "GDPR: The Good, Bad and Ugly", Jun 20 2017 • Cloudera - "Simplify your response to GDPR", Aug 24 2017 • GDPR compliance via partner solutions • Only partial answers Source: Cloudera Inc.
  22. 22. 22 GDPR recommendations simplified Kudu Sentry Navigator Data Science Workbench HDFS / ... Ranger Atlas Zeppelin + lots of partner solutions
  23. 23. 23 Data privacy and open source Pragmatic considerations • Secured cluster • Raw data in encryption zones with very limited access • Anonymize for further processing wherever possible • Proper retention policies, batch delete requests and perform regular clean-ups • Integrate with Atlas and Ranger → tagging, filtering and masking • Custom solutions for glue and missing pieces
  24. 24. 24 Summary • No comprehensive open-source solution available • Proprietary services target specific problem domains, integration still necessary • Some time until legal dust settled • Idea: Avro (logical types) + Vault (or similar) + Ranger + Atlas? The road ahead
  25. 25. 2525 © 2017 Teradata
  26. 26. 26 Hadoop Security Primer In just one slide • Authentication - Kerberos • Authorization - Ranger, Sentry, ACLs • Auditing / Monitoring - Ranger, Navigator, ... • Encryption of data in motion - KMS, Navigator, ... • Encryption of data at rest - Encryption zones, SEDs, ... • Hadoop Security (Ben Spivey, Joey Echeverria) • Hadoop and Kerberos: The Madness beyond the Gate
  27. 27. 27 Personal data According to GDPR “any information relating to an identified or identifiable natural person (‘data subject’); An identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.” - Article 4, GDPR

    Sé el primero en comentar

    Inicia sesión para ver los comentarios

  • dartov

    Oct. 11, 2017
  • tonipoikela

    Mar. 22, 2018
  • bepcyc

    Mar. 24, 2018
  • YiShengLian

    May. 10, 2020

The European General Data Protection Regulation (GDPR) will come into effect in May 2018 and it will impact all organizations that store or process personal data of EU citizens. The European Commission is exporting European data protection principles to the rest of the world while widening the definition of personal data and enforcing privacy by design. These changes will not only have an impact on the organizations but also on the software which is used for data processing. How does it affect the Hadoop ecosystem? Distributed data processing at scale is one of Hadoop’s core features and we will explore how the GDPR could potentially affect it. We will also take a look at the technical aspects of the rights of data subjects and see if and how we can address those, with a particular focus on open-source technologies. This talk will give you an overview of the key themes of the GDPR including the rights of the data subject and will investigate the technical implications for data processing within the Hadoop ecosystem.

Vistas

Total de vistas

2.788

En Slideshare

0

De embebidos

0

Número de embebidos

0

Acciones

Descargas

61

Compartidos

0

Comentarios

0

Me gusta

4

×