OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey

Automating Data Pipeline Security

Carta’s Data Team is Hiring 🎉

Automating Data Pipeline Security
Privacy

3 Big Ideas
1. Privacy has a strange history.
2. Privacy-ﬁrst systems are designed by people with a professional ethic.
3. Privacy can be automated away.
Automating security in your data pipeline
privacy

16

“The actio iniuriarum was, in Roman law,
a delict which served to protect the
non-patrimonial aspects of a person's
existence – who a person is rather than
what a person has.”

“Audit defensibility is too low a
bar when it comes to our
customer’s privacy.”

GDPR
EU General Data Protection Regulation
● Right of access
● Pseudonymisation
● Right of erasure
● Records of processing activities
● Privacy by design
CCPA
California Consumer Privacy Act
● Know what personal information is being
collected
● Right to erasure
● Know whether their personal information is
being shared, and if so, with whom
● Opt-out of the sale of their personal
information
Privacy Regulation

“The security posture of your
weakest vendor is the security
posture of your entire
organization.”

● Airflow DAGs to move data into S3
and Redshift
● DAG: Directed Acyclic Graph
● Operator/Task: A node in the graph
● Airflow runs dbt
Workflow manager from Airbnb
Apache Airflow

Apache Airflow
● Open source boilerplate for running Airflow
in Docker
● Used at Carta
Dockerized Airflow

How do we keep up with the sensitive
columns being added in source data?
Automating the blacklist updates
Stale Blacklist

● dbt tests fail when the result set is
not empty.
● The records returned by dbt test
are the oﬀending records.
Automated data tests
dbt test

We have a custom access management
system called Gatekeeper.
Tools for requesting and granting access
Automating Access

This example uses our IAM Service
Account custom Terraform module to
create a new Revenue Service account
user with access to a single S3 data lake
bucket.
Automate Data Lake access
Terraform Modules

Data Warehouse Migrations
● sql-migrate: Excellent cli and
migrations library written in Go.
● Extended to support Jinja
templating.
We can rebuild the Warehouse from code.

Pseudonymity
Disguised identity or “false name”
©2019 Alex Ewerlöf "GDPR pseudonymization techniques"

Pseudonymity: Obfuscation
👍 Easy to do in any language.
👍 No impact to downstream systems.
👎 Can be unscrambled.
Scrambling or mixing up data

Pseudonymity: Masking
👍 Simple.
👍 Owner can verify the last 4 digits.
👎 Some pieces of the real data are stored.
Obscure part of the data

Pseudonymity: Tokenization
👍 Popular libraries like Faker.
👍 All original data is replaced.
👎 No way to recover the original data.
Replace real data with fake data

Pseudonymity: Blurring
👍 95% of this image is left unblurred.
👎 Possible to reverse blurring.
Blur a subset of the data

Pseudonymity: Encryption
👍 The original data can be recovered.
👍 Manage fewer permissions downstream.
👎 Asymmetric vs Symmetric trade-oﬀs.
Two-way transformation of the data

AWS Key Management Service
● Generate a new data key for encrypting and
decrypting data protected by a master key.
● Or manually rotate the master key and
re-encrypt the data.
Automate key creation and rotation

Encrypted Columns
● pgcrypto allows us to encrypt sensitive
columns before the data lands in our S3
data lake.
● This example is encrypting the birth_date
column in Postgres.
Postgres pgcrypto

“Last Mile” Decryption
● Access to encrypted columns is limited to
analysts with the encryption key.
● This example is decrypting the birth_date
column in Redshift.
Decrypt sensitive data at query time

Encrypted Column Problems
Some things to consider...
1. Symmetric or Asymmetric encryption scheme?
2. Should we manually rotate our master key?
3. How many keys should we use and how should they be organized?
4. Should our analysts and data scientists need to think about keys?
5. When and how do we re-encrypt data? When an employee with
access to keys leaves the company?

carta.com/jobs
@troyharvey
troy.harvey@carta.com

OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey

OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (15)

Similar a OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey

Similar a OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey (20)

Último

Último (20)

OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey