Google has highly optimized engineering processes developed over decades of building software at massive scale. They use practices like continuous integration/delivery, automated testing of all code changes, containerization, and Site Reliability Engineering. Much of Google's internal tools like Kubernetes, Tensorflow, and Borg that manage these processes are now available publicly on Google Cloud Platform. Migrating to Google Cloud allows companies to leverage the same infrastructure Google uses to build software securely and reliably at large scale.
Exploring the Future Potential of AI-Enabled Smartphone Processors
DevOps & SRE at Google Scale
1. Google Cloud Platform 1
By Kaushik Bhattacharya, Customer Engineer
Google Cloud, the Netherlands
kbhattacTweets
DevOps & SRE at Google Scale
How Google does it & How can you benefit from this
2. 2
1.How the
engineering
processes at
Google works
Engineering at Google
3.From open
source to Google
Cloud for
enterprises
2. Our learnings,
how we contribute
back to open
source
12. Google Cloud Platform 12
Code Development
Product idea
Writing code
public class foo {}
13. Google Cloud Platform 13
What it takes to be a Google engineer
Working on problems with SPEED AND SCALE is a challenge.
Engineers keep raising the bar on the tools and infrastructure.
Google Culture:
• Collaboration and co-development
• Sharing between products and teams (tools, libraries, services)
• Engineers have autonomy.
• Agile/Scrum, daily stand-up meetings
15. Google Repository statistics
As of Jan 2015
Total number of files 1+ billion
Number of source files 9 million
Lines of code 2+ billion
Depth of history 35 million commits
Size of content 86 terabytes
16.
17. Google Cloud Platform 17
Advantages of monolithic repo
● Unified versioning - One source of truth
● Extensive code sharing and reuse
● Collaboration across teams
● Simplified dependency management
● Large scale refactoring
● Flexible team boundaries & code
ownership
● Code visibility
18. Google Cloud Platform 18
Automated Test / Analysis
Google uses its own version control
system called: Piper
Sync
workspace
Write
code
Code
Review
Commit
Read/Write Access per folder Code Quality & Syntax Check
(by humans and by tooling)
Create personal copy
Auto Rollback if needed
MANDATORY
A single code tree, with fast access to the code through tooling.
All types of code languages.
Everyone, works in Trunk. - Branches are for releases.
22. Google Cloud Platform 22
Build systems
Why do we need build systems?
Well code has a lot of dependencies
and you don’t want to compile and link
these all manually.
The steps of a general build system:
1. Loading
2. Analysis
3. Execution by build system
23. Google Cloud Platform 23
Google’s continuous build and test system
Google has its own continuous build & test system.
Remember, at Google we develop everything at HEAD in the repo.
Endless CPU, Cross User Caching, because of Cloud Computing.
24. Google Cloud Platform 24
Devops at Google
Product idea
Writing code
Testing
Building
Deploying
25. Each week Google launches over
4 billion containers.
Google is using container technology
for more than 10 years.
26. Enter the container
Virtual machine
OS
Dependencies
Application Code
Hardware
Bare-metal server
OS
Dependencies
Application Code
Hardware
Container
OS
Dependencies
Application Code
Hardware
27. Google Cloud Platform 27
So, you mean Docker?
2004 2016
● Docker is a popular software container platform.
● Containers are a way to package software in a
format that can run isolated on a shared operating
system.
28. Enter the container… and new challenges
● Scheduling, scaling across clusters of servers
● Networking and connectivity
● Security and Access control
● Logging, Monitoring, and Debugging
● Health checks and uptime preservation
● ...
29. Google Cloud Platform 29
Large-scale cluster management at
Google with Borg
2004 2016
● It’s software that manages all production machines at Google and
runs jobs (binaries) that engineers give it on them.
● Borg ran pretty much everything inside the company, including
Google Search, Gmail, Google Maps, Google Docs...
● These binaries are run in a container environment.
● When tasks die, they are automatically started up again, and they
may run on a different machine.
30. Google Cloud Platform 30
Site Reliability Engineering
Product idea
Writing code
Testing
Building
Deploying
SRE
31. “Hope is not a strategy.
Engineering solutions to design, build, and run large-scale
systems scalably, reliably and efficiently is a strategy,
and a good one.”
32. 32
Site Reliability Engineering
● Site Reliability Engineering is a specialized job
function that focuses on the reliability and
maintainability of large systems.
● SRE is also a mindset, and a set of engineering
approaches to running better production systems
● Google has SRE teams of site reliability engineers
responsible for a service globally available.
https://landing.google.com/sre/book.html
38. Google Cloud Platform 38
From Google to OSS
2004 2016
Internal Google
● Borg Container Orchestration
● Machine Learning
● Go Lang
● Google Chrome
● Stubby
● Dapper
● GFS/BigTable
Open Source
● Kubernetes
● Tensorflow
● Go Lang
● Chromium
● gRPC
● Zipkin
● HDFS/HBase
39. 39
Tensorflow
Tensorflow is what we use for our own internal
machine learning projects, and now it’s available
to you!
Google made it open source.
More than 480 contributions
10,000 commits in a year
53k star rating
Tutorials to get started at
https://www.tensorflow.org
40. 40
Kubernetes abstracts away the hardware
infrastructure and exposes your whole data center
as a single enormous computing resource.
● Multiple container engines (Docker, rkt,
Windows)
● Cloud and bare-metal environments
● Container Engine = Managed Kubernetes in
Google Cloud
Kubernetes
https://kubernetes.io
41. 41
● A complete framework for connecting, securing, managing and
monitoring services
● Secure and monitor traffic for microservices and legacy services without
requiring any changes to application code
● An open platform with key contributions from Google, IBM, Lyft and
others
● Allows developers to authenticate and secure the communications
between different applications using a TLS connection
● Multi-environment and multi-platform, but Kubernetes first
Istio (A Service Mesh)
43. Google Cloud Platform 43
From OSS to Google Cloud
2004 2016
Open Source
● Kubernetes
● Istio
● Tensorflow
● MySQL / Postgresql
● Spark / Hadoop
● Apache Beam
● Spinnaker
Google Cloud
● Google Kubernetes Engine
● ML Engine/Auto ML
● Cloud SQL
● Dataproc
● Dataflow
47. 47
Google has two
decades of
experience with
building secure
software on large
scale.
Conclusion
Your company can
make use of the
same infrastructure
like Google does.
Scalable, Secure and
Open.
The learnings are
shared through
whitepapers and
contributed back
through open source.