The document summarizes SafetyCulture's journey migrating from Google App Engine to Amazon Web Services. It discusses moving a large monolithic codebase of 500,000+ lines of code to 12 microservices built with Node.js on AWS. It describes rebuilding the client API to maintain backwards compatibility while improving performance and scaling. It also details the two stage process used to migrate data from Google Datastore to Couchbase Server, including validation steps. The migration was completed in 28 hours with no downtime and has resulted in business growth and cost savings since launching on AWS.
4. Large monolith - 500,000+ LOC
We outgrew Google App Engine
➔ Poor documentation and support
➔ Immature API’s (everything is beta!)
➔ Bumping into limitations
➔ Proprietary technology
Feature development and fixes slow
Needed more flexibility
Scaling the engineering team was hard
SafetyCloud Google App Engine
5. Google Cloud Platform
Web Frontend
Web API
Email
Client API
Binary Data
SafetyCloud
‘The Monolith’
SafetyCloud Monolithic Architecture
6. SafetyCulture The Goal
Improve our product
➔ More reliable and performant syncing
➔ New modern user interface
➔ Feature equivalent
➔ Full backwards compatibility with iAuditor
Address the problems a large monolithic codebase brings
Scalable, flexible, open technologies
Strong partner for infrastructure
7. SafetyCulture The Solution
10 Microservices built with Node.js
Single Page App built with Ember.js
Document store with Couchbase
Document indexing with ElasticSearch
Scalable cloud based infrastructure with Amazon Web Services
8. Amazon Web Services
Web API
Email
Client API
Binary Data
Web Frontend
SafetyCulture Microservice Architecture
10. Client API
HTTP API for SafetyCulture iOS and Android Applications
● Authentication
● Document Synchronisation
● User Management
● Document Permissions
11. Client API Change Considerations
Consumed by over 500,000 devices
Many users in legacy versions of consuming clients:
● 2% of users on version older than 1 year
● 8.5% of users on version older than 6 months
● 25.3% of users on version older than 1 month
Consuming clients relied on undocumented quirks and edge
cases to function - these needed to be maintained
12. Language
Server Framework
Database
Query Engine
Binary Storage
Scaling
Client API Rebuild
Original API
Python
WebApp2
Google Datastore
SQL-Like Queries
Google Blobstore
Vertical + Horizontal
Rebuilt API
Coffeescript
Hapi.js
Couchbase Server
MapReduce Indexes
Amazon S3
Horizontal
13. Client API Maintaining Backwards Compatibility
API Specification-based implementation
External specification of original API became the internal implementation
specification of the new API.
Manual and Automated testing
Automated unit and integration tests.
Production device testing with large scale, multi-hour real-world tests.
Replay-based Regression testing
Production device traffic was observed, recorded and replayed with a custom-
built tool. Allowed us to identify request/response behaviours.
14. Client API Rebuild Outcomes
Built, tested, deployed in under 9 engineering-months
Client API Codebase: 10000+ LOC
Regression Test Codebase: 22000+ LOC
Seamlessly continued working with legacy clients
Horizontally scales to easily meet peak demand
Now serves 1,200,000 requests/day
16. Google Datastore
Non-relational key-value store
Proprietary software
Eventually consistent
1MB value limit
Basic indexing and querying
Couchbase Server
Non-relational document store
Open-source project
Eventually consistent
Configurable document limit
MapReduce-based indexing
PRODUCTION
22. Google Datastore
129,607,422 KV Entities
121 Query Indexes
1900 Ops/sec average
Couchbase Server
2,596,011 Documents
25 MapReduce Indexes
260 Ops/sec average
PRODUCTION
Couchbase
Server
Google
Datastore
24. Instant Switchover
“Google App Engine one day, Amazon Web Services the next”
28 Hour Switchover Process
➔ Downtime required
➔ Minimum Load Period - Saturday to Sunday
➔ Required 15 engineering Staff
➔ Additional support staff
SafetyCulture Moving Clouds
25.
26. 12 microservices
➔ Unique scaling requirements for each
➔ Stateless and fault tolerant
Infrastructure
➔ 30+ Virtual machines serving simultaneously
➔ 14 Load balancers
Use AWS services where possible
➔ DynamoDB, ELBs, ASGs, CloudWatch, Route53...
SafetyCulture The Infrastructure
27. SafetyCulture Development
Continuous integration and delivery
➔ ~500 deploys in under five months
➔ Zero downtime deploys
Better team workflow
➔ Agile development methodology
➔ Every pull request gets reviewed and tested
➔ Microservices allow for faster and isolated development
➔ Features hidden behind feature flags
28. SafetyCulture The Business
A better product for customers
➔ Faster and more reliable
➔ Clean and modern UI
➔ More features and fixes being released
In the five months since launch
➔ 100% growth in database records
➔ 50% user growth
➔ 40% saving in infrastructure costs
29. May the safe be with you...
safetyculture.io
@safetycultrehq