The NotPetya, SolarWinds, and Kaseya cybersecurity attacks were all executed by injection of malicious code in software shipped by vendors to thousands of companies. These attacks have made the public more aware of the importance of secure software supply chains. But the path from awareness to ensuring a secure supply chain is long. Developers have gotten used to the convenience of easily downloading third party software into containers, and it is challenging to tighten supply chain security in a company with a sprawl of open source components.
Scling is a small data engineering startup, and since we ask our customers to entrust us with their data, we must take security seriously. We have been securing our software supply chain since the company was founded. We have no venture capital, and our customers expect quick development iteration cycles, so we have solved supply chain security with minimal effort and minimal impact on developer productivity. In this presentation, we share how we have addressed the different supply chain attack vectors, e.g. Python and JVM packages, with technical solutions. We will present how we automate third party software upgrades to stay up to date with security upgrades while minimising the risk of downloading rogue code.
3. www.scling.com
What do we contribute?
● Internet, digitalisation + many good little things
● Ability to measure and manipulate populations at scale
● Monetising bad security
○ Stolen CPU cycles → money
○ Ransomware
3
https://spinbackup.com/blog/24-biggest-ransomware-attacks-in-2019/
https://blog.chainalysis.com/reports/2022-crypto-crime-report-preview-ransomware/
https://www.theguardian.com/news/2018/mar/17/ca
mbridge-analytica-facebook-influence-us-election
4. www.scling.com
vs
Risk-management rarely wins
Employees have conflicting definitions of success
Security vs productivity
4
Revenue-generation
Features
Delivery speed
Security reviews
Pentests
Password reauthentication
Phishing campaigns
Firewalls
…
5. www.scling.com
A simple recipe for application security:
- While we value items on the right, we value items on the left more.
- Invent alternatives that are aligned with speed
- Give employees aligned definitions of success
Security AND productivity
5
SSO
Password managers
Infrastructure as code
Hardware MFA
Ephemeral containers
…
Security reviews
Pentests
Password reauthentication
Phishing campaigns
Firewalls
…
7. www.scling.com
Quality and ops
7
Aligning quality with speed
TDD
Continuous
delivery
Agile
Dev-friendly
ops tooling
Test
automation
XP
Cross-functional
teams
DevOps
Trunk-based
Continuous
integration
Containers
8. www.scling.com
● Scaled processes
● Machine tools
● Challenges: scale,
logistics, legal,
organisation, faults, ...
Manual, mechanised, industrialised
8
● Muscle-powered
● Few tools
● Human touch for every
step
● Direct human control
● Machine tools
● Low investment, direct
return
9. www.scling.com
IT craft to factory
9
Security Waterfall
Application
delivery
Traditional
operations
Traditional
QA
Infrastructure
DevSecOps Agile
Containers
DevOps CI/CD
Infrastructure
as code
10. www.scling.com
● Toyota: Low defect rates AND high margins per vehicle
● State of DevOps report: High reliability AND high deployment rate
○ We have industrialised software engineering
Quality, speed - choose two
10
Quality
vs
Speed
Quality
AND
Speed
1000x span in
availability metrics
11. www.scling.com
Themes of good presentations, IMHO
● We have seen lots of X / X from a different angle. Here are some patterns.
● We have context Y. Here is how we work.
● We did a thing Z. Here is what we learnt.
11
We need to share how we work
in order to make faster progress.
13. www.scling.com
Data industrialisation
13
DW
~10 year capability gap
"data factory engineering"
Enterprise big data failures
"Modern data stack" -
traditional workflows, new technology
4GL / UML phase of data engineering
Data engineering education
14. www.scling.com
How data leaders work
14
Data processed offline
Online
Data factory
Data platform & lake
data
Data
innovation &
functionality
100+K daily
datasets
30% staff
BigQuery daily
users
Value from data!
16. www.scling.com
Efficiency is sacred
● Productivity is our unique selling point
○ Client value from data is unpredictable
○ Clients don't know what they want
○ Quick experiments & pivot
● Minimal operational overhead
○ Pipelines / person
○ Datasets / day / person
● Nothing must undermine our USP
16
17. www.scling.com
Our security strategy
● Invest where it improves productivity
○ Cloud single sign on
○ Cloud identity management
○ Workload identities over secret tokens
○ Hardware multifactor authentication
○ Infrastructure as code
○ Patch management *
● Homogeneity over autonomy
○ Few technologies
○ Few processes
○ Processes encoded in code *
17
● Minimal attack surface *
● Strict asset management
○ Digital assets as code
○ Process to align assets with code
○ Explicit manual asset management
● Lean on Google
18. www.scling.com
Minimising attack surfaces
● Few ecosystems
○ Ubuntu
○ Scala + Spark
○ Python
● Few components
○ Reuse over perfect match
● Few versions
○ Single version per third party component
○ Opens gates to dependency hell *
■ Control or autonomous cells
18
20. www.scling.com
Which version?
● Version specifications
○ Exact version
■ Good for application stability
○ Range
○ Latest
■ Good for patch latency
● Specification choice tradeoffs
○ Provider trust
○ Patch latency
20
● Upgrade tradeoffs
○ Vulnerability patching
○ Rogue code
○ Bugs fixed
○ Bugs introduced
○ Necessary work
● Our goal:
○ Exact version
○ Transitive dependencies locked
○ Automatically updated
● Let's pursue!
21. www.scling.com
Levels of up to date
● No new version of A exists
● New A version exists. Application verified ok with upgrade.
● New A version exists. Unclear whether upgrade breaks application.
● New A version exists. Upgrade breaks application.
○ We use a deprecated API.
○ New version has bug.
● New A version exists. Upgrade breaks dependency B.
○ New version of B exists.
○ No new version of B exists.
○ A and B must atomically upgrade
21
22. www.scling.com
A bot friendly task
● There is some order that moves us forward through hell
● Slow trial and error cycle
○ Compile or test takes minutes
● There are bots
○ Dependabot, Scala steward
■ Way too complex (100/20 KLOC, 1000s lines of doc / examples)
○ Do not cover our needs
■ Application correctness
■ Our ecosystems
22
23. www.scling.com
With a strong process
● we can reason and automate
○ Trial and error forward
● Process strength
○ Faulty change is detected before prod
○ Non-code changes unlikely to affect correctness
○ Self-bootstrapping
23
24. www.scling.com
Strong process challenges
● Everything not covered by tests
● Test infrastructure / setup defined by code
○ How to test?
○ How to bootstrap?
● Indeterministic processes / components
○ Mostly deterministic is ok
24
Extended test suite:
● Testsuite bootstrap
● Continuous deployment testsuite
● Non-production functionality
○ Dev tooling
○ Web
○ …
25. www.scling.com
Our build process
● Monorepo + trunk-based
○ Platforms + all client code and pipelines
○ Single version of platform
● All tests verified* for every change
○ Tests do not require cloud resources
● Build + test speed challenging
○ Spark → seconds upstart time → slow tests
● Simple recipe for speed:
○ Avoid doing things → caching
○ Do things in parallel
25
26. www.scling.com
Bazel
● Designed for monorepos & strong process
○ Lazy tree evaluation
○ Isolated sandboxes
● Unmatched performance features
○ Isolation → reliable caching
○ Test result caching
○ Remote caching
○ Parallelism
○ Remote execution
26
● Great for stuff used by Google
● Catching up on
○ Docker
○ Scala
○ Third-party dependencies
27. www.scling.com
Dependency version control
● Transitive, locked
○ Python
○ JVM
○ Lock files in version control
● Not transitive, locked
○ Direct downloads
○ Bazel plugins
○ Container base images
○ version.bzl file
■ → bazel, python, bash
27
● Apt packages
○ Latest*
● Some Google components
○ VM base images, misc
○ Latest
● Employee devices
○ Manual
● Unmanaged leftovers
○ SaaS
○ Otherwise minimal exposure
32. www.scling.com
Can we make apt install deterministic?
● apt-get typically provides latest
○ Determined by Packages.gz
○ Download during build breaks determinism & caching?
● Distroless bazel package_manager:
○ Exact Packages.gz specification
○ Debian: Versioned Packages.gz
○ Ubuntu: Only latest Packages.gz
● Compromise on determinism
○ Download Packages.gz before build
○ Caching still ok
● Not running apt scripts seemed to work. For a while.
○ Subtle low-level container failures
○ Abandoned
32
33. www.scling.com
● Single unified platform
○ Monorepo + trunk-based process
○ Separate instance per client
○ All test suites run on every change
● Factories are adapted to constraints and important properties
○ Ok: Security, risk, quality, availability, compliance
○ No: Preferred technology, work processes
Scling collaboration models
33
Refinement factory
● Raw data in
● Valuable data out
● Non-technical clients
● "Easy" domain
Joint factory
● Hybrid teams
● Domain experts
● Data apprentices
● Scling runs data platform
Client factory
● Start as joint factory
● Goal: Client independent
34. www.scling.com
Divided, multi-tenant platform
34
Orion
base data platform
GCP (but portable to other clouds)
Isolated
client
instance
Isolated
client
instance
Isolated
client
instance Saturn
non-essential
operational tooling
ion CLI tool
scli CLI tool
40. www.scling.com
Resolution classifications
● No new version of A exists
● New A version exists. Application verified ok with upgrade.
● New A version exists. Unclear whether upgrade breaks application.
● New A version exists. Upgrade breaks application.
○ We use a deprecated API.
○ New version has bug.
● New A version exists. Upgrade breaks dependency B.
○ New version of B exists.
○ No new version of B exists.
○ A and B must atomically upgrade
40
not found
test failure
success
test failure
test failure
test failure
transient
transient
transient
transient
47. www.scling.com
Google SLSA evaluation
● Supply-chain Levels for Software Artifacts
○ Maturity model
● SLSA 1: yes
● SLSA 2: yes
● SLSA 3: some
○ Prioritising speed over Ephemeral Environment,
Isolated, Non-Falsifiable
● SLSA 4: some
○ Parameterless
○ Dependencies complete (except apt)
47
48. www.scling.com
Concluding remarks
● Challenges?
○ Operational tuning to balance rate vs €
○ Google cloud_sql_proxy patch update took us down
○ Diva dependencies need custom solutions
○ Which test failure to address?
● Future?
○ Upgrade conditional on container scanning?
○ Dead dependency detection?
● Open source? No.
○ Specific to our environment
○ Bot is easy. Just do it.
○ Strong process challenging. But rewarding.
○ Offer: A copy of the code for a C-level lunch date. :-)
48