Carta helps companies manage and secure their cap table and equity plans. Highly sensitive data. And in a post-GDPR world, data engineers play a critical role in protecting data and limiting access at each step in a data pipeline. In this session, Troy will walk through the steps that Carta’s data team has taken to secure the data pipeline using open source tools. You will leave with a checklist of things to consider when building a data lake, data warehouse, or deploying a data orchestration system. Some of the technologies covered include Apache Airflow, dbt, Docker, S3, Redshift, and Looker. Become a better steward of your customer’s data.
8. 3 Big Ideas
1. Privacy has a strange history.
2. Privacy-first systems are designed by people with a professional ethic.
3. Privacy can be automated away.
Automating security in your data pipeline
privacy
16. 16

“The actio iniuriarum was, in Roman law,
a delict which served to protect the
non-patrimonial aspects of a person's
existence – who a person is rather than
what a person has.”
25. GDPR
EU General Data Protection Regulation
● Right of access
● Pseudonymisation
● Right of erasure
● Records of processing activities
● Privacy by design
CCPA
California Consumer Privacy Act
● Know what personal information is being
collected
● Right to erasure
● Know whether their personal information is
being shared, and if so, with whom
● Opt-out of the sale of their personal
information
Privacy Regulation
32. ● Airflow DAGs to move data into S3
and Redshift
● DAG: Directed Acyclic Graph
● Operator/Task: A node in the graph
● Airflow runs dbt
Workflow manager from Airbnb
Apache Airflow
33. Apache Airflow
● Open source boilerplate for running Airflow
in Docker
● Used at Carta
Dockerized Airflow
34.
35. How do we keep up with the sensitive
columns being added in source data?
Automating the blacklist updates
Stale Blacklist
36. ● dbt tests fail when the result set is
not empty.
● The records returned by dbt test
are the offending records.
Automated data tests
dbt test
37. ● dbt tests fail when the result set is
not empty.
● The records returned by dbt test
are the offending records.
Automated data tests
dbt test
38. We have a custom access management
system called Gatekeeper.
Tools for requesting and granting access
Automating Access
39. This example uses our IAM Service
Account custom Terraform module to
create a new Revenue Service account
user with access to a single S3 data lake
bucket.
Automate Data Lake access
Terraform Modules
40. Data Warehouse Migrations
● sql-migrate: Excellent cli and
migrations library written in Go.
● Extended to support Jinja
templating.
We can rebuild the Warehouse from code.
42. Pseudonymity: Obfuscation
👍 Easy to do in any language.
👍 No impact to downstream systems.
👎 Can be unscrambled.
Scrambling or mixing up data
43. Pseudonymity: Masking
👍 Simple.
👍 Owner can verify the last 4 digits.
👎 Some pieces of the real data are stored.
Obscure part of the data
44. Pseudonymity: Tokenization
👍 Popular libraries like Faker.
👍 All original data is replaced.
👎 No way to recover the original data.
Replace real data with fake data
45. Pseudonymity: Blurring
👍 95% of this image is left unblurred.
👎 Possible to reverse blurring.
Blur a subset of the data
46. Pseudonymity: Encryption
👍 The original data can be recovered.
👍 Manage fewer permissions downstream.
👎 Asymmetric vs Symmetric trade-offs.
Two-way transformation of the data
47. AWS Key Management Service
● Generate a new data key for encrypting and
decrypting data protected by a master key.
● Or manually rotate the master key and
re-encrypt the data.
Automate key creation and rotation
48. Encrypted Columns
● pgcrypto allows us to encrypt sensitive
columns before the data lands in our S3
data lake.
● This example is encrypting the birth_date
column in Postgres.
Postgres pgcrypto
49. “Last Mile” Decryption
● Access to encrypted columns is limited to
analysts with the encryption key.
● This example is decrypting the birth_date
column in Redshift.
Decrypt sensitive data at query time
50. Encrypted Column Problems
Some things to consider...
1. Symmetric or Asymmetric encryption scheme?
2. Should we manually rotate our master key?
3. How many keys should we use and how should they be organized?
4. Should our analysts and data scientists need to think about keys?
5. When and how do we re-encrypt data? When an employee with
access to keys leaves the company?
51. 3 Big Ideas
1. Privacy has a strange history.
2. Privacy-first systems are designed by people with a professional ethic.
3. Privacy can be automated away.
Automating security in your data pipeline
privacy