Building a multi-tenant PaaS is not a walk in the part, certainly if the platform you are building on is constantly changing.
In this session I'll walk you through the adventure we've been on where you'll learn about the challenges we've had and how we approached them and whether or not our decisions worked out or not.
– How to design for scale
– How to operate the platform
– How to grow a platform mindset and force ownership
– How to run tests for your whole platform
– How to design for multi-tenancy
– How to approach constant change
– etc
Cloud projects are never finished so you'd better come prepared.
2. Azure Architect
Microsoft Azure MVP & Advisor
Writes on blog.tomkerkhove.be
Creator of Promitor.io
Hi, I’m Tom Kerkhove
2
@TomKerkhove
tomkerkhove
3. Agenda
| Scalability
| Tenancy
| Monitoring
| Shipping Features
| Webhooks
| Embrace Change
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 3
6. Scale up/down
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 6
Scale
| Easiest way of scaling is to get a bigger box
| The only trade-off is that it means your app will be unavailable for a while
| At some point you’ll run out of “bigger boxes”
7. Scale out / in
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 7
Scale
| Provide multiple copies of your application based on your workload
| No impact on your uptime, but more complex
| My preferred way of scaling, but your application needs to be designed for it
8. Choose the right compute infrastructure
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 8
Functions
Functions
Container
Instances
Cloud
Services
Service
Service
Fabric
Mesh
App
Service
Fabric
Kubernetes
Cluster
VM Scale
Sets
VMs
Bare Metal
| As control increases, so does complexity
| Every service has it’s own characteristics
| How you run your application
| How you package your application
| How you scale your application
9. Designing for scale
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 9
Scale
Order Function
Order Function
Order Function
Order Function
Order Function
Order Function
Azure Functions
10. Designing for scale with serverless
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 10
Scale
| The good
| The service handles scaling for you
| The bad
| The service handles scaling for you
| Does not provide a lot of awareness
| The ugly
| Dangerous to burn a lot of money
11. Designing for scale with PaaS
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 11
Scale
Instance
Orders Role
Cloud Services
Instance
InstanceInstance
InstanceInstance
InstanceInstance
Autoscaler
Message
Count > 1,
Add instance
Scale!
12. Designing for scale with PaaS
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 12
Scale
| The good
| You need to define how it scales
| Provides you with scaling awareness
| The bad
| You need to define how it scales
| Hard to determine the perfect scaling rules
| The ugly
| Be aware of “flapping” (http://bit.ly/monitor-autoscale-best-practices)
| Be aware of infinite scaling loops
Use an Azure Monitor Autoscale
13. 13
Scale
Node 1
Cluster
Node 2 Node 3
Pod
Pod
Pod
Pod
Pod
Pod
Pod
Pod
Pod
Pod
Pod
Pod
Pod
Pod
Pod
Designing for scale with CPaaS
Custom Metric
Provider(s)
Horizontal Pod
Autoscaler(s)
Cluster
Autoscaler
14. Designing for scale with CPaaS
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 14
Scale
| The good
| Share resources across different teams
| Serverless scaling capabilities are coming with Virtual Kubelet & Virtual Nodes
| The bad
| You are in charge of providing enough resources
| With great power, comes great responsibilities
| No autoscaling out-of-the-box
| Scaling on different levels
| The ugly
| Takes a lot of effort to ramp up on how to scale
| There’s a lot to manage
15. Create awareness around your autoscaling
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 15
Scale
| Avoid burning money, get notified before it’s too late!
| Gain insights in your autoscaling rules
| Either configured by you or managed by Azure (ie. Azure Functions)
| Learn from them and tweak them
| Detect autoscaling loops in TEST instead of during live-site issue
| Choose the approach that fits your needs
| Configure Azure Monitor notifications
| Use built-in metrics to visualize and alert on
| Provide your own tooling around it
16. February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 16
Source: http://blog.tdwright.co.uk/2018/09/06/beware-runonstartup-in-azure-functions-a-serverless-horror-story/
17. Create awareness around your autoscaling
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 17
Scale
18. Tips
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 18
Scale
| Resource consolidation pattern does not play nice with autoscaling
| Configure maximum instance count for your autoscaling
| Provide representable metrics of your remaining work
| Azure Monitor Autoscale is a hidden gem in Azure, use it!
| Does all the great things an autoscaler should do
| Learn the scalability capabilities & trade-offs of your technologies
| Kubernetes and Service Fabric are harder to scale, but more transparent than Functions
| Provide budget alerts, if feasible
20. Multi-tenancy is all about choices
20
Tenancy
| Multi-tenancy is more than data sharding
| How will you deploy your application?
| How much isolation does it require between tenants?
| How much customization will we allow?
| Do our tenants need access to their data?
| Will it run in multiple regions? Will one region require multiple deployments?
| What is our pricing model? Do we need to reflect this in our tenancy?
| You will need to make big decisions up front, which can turn out to be bad
ones – Be prepared!
21. Choosing a tenancy model
21
Tenancy
App Stamp A Stamp B
Stamp C Stamp D
App
Tenant A
App
Tenant B
#n…#1 #n…#1BA
Full isolation between tenants by
deploying everything for every tenant
Run a multi-tenant application, but use
sharded data layer
App deployed in multiple stamps &
geographies with sharded data layer
22. Choosing a sharding strategy
22
Tenancy
| Spread all your data across multiple smaller databases instead of one big one
| Good example of scaling out to handle load
| A shard key is used to determine the shard based on the chosen strategy:
| Lookup, Range & Hash strategy
| Choose your strategy wisely and think about your query patterns
| Does your customer need access to it? Then you should shard per tenant!
| You cannot easily change your strategy later on
| More information: http://bit.ly/sharding-pattern
23. Locating shards
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 23
Tenancy
Shard
Manager
How do I connect to “Sello”?
Order
Processor
#5#4#3#2#1 #6 #7 #8
Use this connection string
Get Secret
“Sql-Tenant-Sello”
24. Using shard managers
24
Tenancy
| Provide catalog of all shard in the platform
| Determine current shard based on shard key & chosen approach
| Metadata is stored in a store of choice
| Be careful where you store your secrets
| Choosing a good shard manager
| Build your own, ie on top of Azure Key Vault
| Use existing tool, ie Azure SQL Database Elastic Tools
25. Cost-efficient sharding
25
Tenancy
#5#4#3#2#1 #6 #7 #8
Elastic Pool
| In a PaaS world you need to pay for every data store instance you have.
S1–5%
S1–15%
S1–5%
S1–25%
S1–50%
S1–7%
S1–21%
S1–41%
Use an Azure SQL Elastic Pools
| We pay ~€200 for 160 DTU, but only use ~20 % of it
26. Cost-efficient sharding
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 26
Tenancy
#5#4#3#2#1 #6 #7 #8
Elastic Pool
| Enforce resource limitation on a per-shard level
7DTU
15DTU
8DTU
25DTU
24DTU
7DTU
40DTU
21DTU
93DTU
I need more resources!!!
We ran out, we only
have 200 DTU
25DTU
27. Cost-efficient sharding
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 27
Tenancy
#5#4#3#2#1 #6 #7 #8
Elastic Pool
| Enforce resource limitation on a per-shard level
7DTU
15DTU
8DTU
25DTU
24DTU
7DTU
40DTU
21DTU
50DTU
I need more resources!!!
You’ve had enough
37DTU
28. | Provide multiple resource pools to reduce impact of noisy neighbors
Cost-efficient sharding
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 28
Tenancy
#5#4#3#2#1 #6 #7 #8
Resource Pool A Resource Pool B Resource Pool C
7DTU
15DTU
8DTU
24DTU
7DTU
21DTU
50DTU
37DTU
7DTU
15DTU
8DTU
24DTU
7DTU
37DTU
29. | Reflect your pricing model in your resource pooling
Cost-efficient sharding
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 29
Tenancy
#5#4#3#2#1 #6 #7 #8
Basic Pool
37DTU
24DTU
5DTU
51DTU
63DTU
77DTU
158DTU
121DTU
Standard Pool Premium Pool
Provision 250 Standard DTUs , capped at 100Provision 100 Basic DTUs , capped at 50 Provision 500 Premium DTUs,
capped at 250
30. Cost-efficient sharding
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 30
| Consider moving all shards in a resource pool
| Configure maximum consumption per database
| Consider using multiple resource pools to reduce impact of noisy neighbor
| Resource pools are a great way to reflect your pricing model
| Monitor your pools as you would do for individual databases
Tenancy
31. Determining tenants
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 31
| Determining the tenant that is consuming your service via your API gateway
| Map the authentication key to the registered application and use its context
Tenancy
API
Gateway
API
DB
POST api/v1/orders
X-API-Key ABC
POST api/v1/orders
X-Tenant Sello
Shard
Manager
What shard is “Sello”?
Bill Bracket owns ABC.
Part of “Sello” group
32. Determining tenants
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure
Tenancy
<policies>
<inbound>
<base/>
<set-header name="X-Tenant" exists-action="override">
<value>
@(string.Join(";", (from item in context.User.Groups where
item.Name.ToLower().Contains("sello -") select item.Name.Replace("Sello - ", String.Empty).Trim())))
</value>
</set-header>
</inbound>
<backend>
<base/>
</backend>
<outbound>
<base/>
</outbound>
<on-error>
<base/>
</on-error>
</policies>
34. Monitoring
| Azure Monitor is your one-stop-shop for monitoring in Azure
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 34
Karl Ots gives a nice overview in his “Monitoring real-life Azure applications: When to use what and why” talk - https://www.youtube.com/watch?v=93-4soieZzI
35. Correlate your telemetry
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 35
Monitoring
Frontend API
Order
Processor
DB
Get Products
Create Order
Operation ABC
Operation DEF
Session XYZ
36. Correlate your telemetry
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 36
Monitoring
Frontend API
Order
Processor
DB
Get Products
Create Order
Operation ABC
Operation DEF
Session XYZ
Cycle 123
37. Correlate your telemetry
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 37
| Use different layers of correlation ids & make sure the terminology is
consistent
| Ideally you have “session id”, “operation id” & “cycle id”
| Always return your correlation ids to your consumers
| Easier to troubleshoot in case of support cases
| Correlated all your telemetry to provide a logical flow, not just traces
Monitoring
38. Enriching your telemetry
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 38
| Provide app-specific information to all telemetry
| Think about service name, instance name, tenant name, business-specific ids, …
| Use telemetry initializers to automatically add information
Monitoring
39. GDPR
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 39
| Never track personal identifiable information
| Think email address, username, first & last name, etc
| You can
| Stop tracking any information
| Obfuscate/hash all information by using a custom Telemetry Processor
Monitoring
40. Application overview with App Map
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 40
| Great way to see all your dependencies, how their health is
| Living documentation of your platform
Monitoring
41. Health checks
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 41
| Report status of your application - Is it still healthy? Is it ready?
| This can be used for liveness & readiness probes when autoscaling
| Go as far as you want – Simple HTTP 200 or list of all dependencies
| Be wary of performance impact
| Always provide throttling to block noisy consumers
| Think about your connection management
| No direct business value, until it’s too late
Monitoring
43. Availability monitoring
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 43
| Good way to monitor availability and latency across the world
| Configure alerts when availability drops
| Health checks are a good fit
Monitoring
44. Handling alerts
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 44
| Always automate alert creation, they are part of your infrastructure as well
| Build a centralized alert handling process
| Azure Logic Apps is a good fit for this
| Unless an alert needs different alert handling
| Different alerts have different contracts
| Use adapters to receive notifications
| Map to internal metric contract
| Handle via centralized alert handler
Monitoring
Azure Monitor
Classic Alert
Adapter
Azure Monitor
Metrics Alert
Adapter
Centralized
Alert Handler
...
Time to move it! Azure Classic Alerts will be deprecated by June 2019.
https://docs.microsoft.com/en-gb/azure/azure-monitor/platform/monitoring-classic-retirement
45. Write Root Cause Analysis (RCA)
45
| Train your team for PROD outages, write RCAs in all environments
| Provides a structured way of analysing your platform
| Acts as a knowledge transfer to customers & team
| Not only what happened and how it was mitigated, also how you analysed it
| Use them to detect recurring issues
| Define action points and follow-up on them
| How could this have been prevented?
| Did we have enough telemetry & alerts?
| Who was impacted? Do we need to send out communications?
Growing a platform mindset
47. Automate everything
47
| Infrastructure-as-code is a must
| Every aspect of your infrastructure should be automatically provisioned; including alerts,
scaling rules, etc.
| Living documentation of your infrastructure
| Everything should be in source control, reviewed and versioned; it is part of your platform.
| Manual interventions are evil, but not unavoidable
| Use “Deployment Notes” to track these and make it part of your release process
| Tenant onboarding should be automated from the start
| Provide APIs to manage your platform
| We all know “We’ll do it later” means “This never going to happen”
Shipping Features
49. Deployment rings
49
| A way of releasing to production environments in “rings”.
| Allows you to have early adopters before rolling out to the masses
| More information: bit.ly/deployment-rings
Shipping Features
50. Deployment rings
50
Shipping Features
App Stamp A Stamp B
Stamp C Stamp D
App
Tenant A
App
Tenant B
#n…#1 #n…#1BA
Full isolation between tenants by
deploying everything for every tenant
Run a multi-tenant application, but use
sharded data layer
App deployed in multiple stamps &
geographies with sharded data layer
51. Deployment rings
51
| What is defined as a “Ring” is up to you
| This can be per tenant, a group of tenant of category ie. early adaptors
| Depends on your tenancy-approach as well
Shipping Features
52. Tracking releases in Application Insights
52
| Annotate metrics with new releases
Shipping Features
53. Tracking releases in Application Insights
53
| Free extension available in Marketplace - http://bit.ly/ai-release-annotations
Shipping Features
54. Verifying deployments
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 54
| Use Deployment Gates to verify deployments
| Checking for Azure Alerts kicking off is a must
Shipping Features
55. Verifying deployments
55
| Use Deployment Gates to verify deployments
| Re-use your health checks and call via REST
Shipping Features
56. Verifying deployments
56
| Swap deployment slots when instance is warmed up
| Careful with slots, resources are shared which can impact production slot
Shipping Features
58. Consuming webhooks
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 58
| Generated URLs are evil, provide good DNS names of your services
| And this goes for everything, not only webhooks
Webhooks
API
(12345.provider.com)
3rd
Party
POST https://12345.provider.com/api/webhooks
Where did
123456 go?!
API
(67890.provider.com)
Use an API Gateway
59. Consuming webhooks
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 59
| Do not reduce your API security because of your 3rd Party
Webhooks
API
(12345.provider.com)
3rd
Party
POST http://12345.provider.com/api/webhooks
Where is
your cert?
Use an API Gateway
60. Consuming webhooks
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 60
| Always route webhooks through an API gateway
| This decouples the webhook from your internal architecture
Webhooks
API
Gateway
API
3rd
Party
POST http://sello.com/api/v1/webhooks POST https://sello.provider.com/api/v1/webhooks
Authentication: Certificate
61. Consuming webhooks
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 61
| Always route webhooks through an API gateway
| This decouples the webhook from your internal architecture
Webhooks
API
Gateway
Orders
API
3rd
Party
POST http://sello.com/api/v1/webhooks
POST https://orders.sello.provider.com/api/v1/webhooks
Authentication: Certificate
Stock
APISome gateways support auth
key via query parameters
62. Provide user-friendly webhooks
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 62
| Provide a way for consumers to provide context during registration
| This context will be delivered along with the effective payload
| Allows them
| Provide a self-service CRUD API to register new subscriptions
| Registering is one thing, opting-out or getting an overview is another story
| This can be important for migration reasons or in case of bugs causing bad subscriptions
| Pass your correlation id via Request-Id header
| Provide an invocation history
| This helps consumers troubleshoot why events were not delivered
| Allow them to search on correlation id
Webhooks
63. Spaghetti infrastructure 2.0?
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 63
Webhooks
Warehouse
Service
Order
Service
Stock
Service
Shipping
Service
Invoice
Service
Payment
Service
64. Spaghetti infrastructure 2.0?
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 64
| Long-term this can start to become a burden
| A lot of bookkeeping to know who to update, how we should authenticate, etc
| No central place to route all webhooks through
| Your platform needs to be robust
| What if subscriber II is not responding? Let’s build a retry mechanism!
| Who says subscriber II owns foo.bar.com?
| Webhooks should use a “I don’t care, here’s an update” approach
Webhooks
65. Event Routers
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 65
Webhooks
Warehouse
Service
Order
Service
Stock
Service
Shipping
Service
Invoice
Service
Payment
Service
Event
Router
66. Event Routers
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 66
| Event Routers do all the heavy lifting for you
| Provide a centralized hub for all things events
| Bookkeeping of whom subscribes to what webhooks and events
| They will retry sending events when they did not reply
| They will perform webhook validations
| Publishers can publish events to the event router and takes it from there
| Great for for internal usage, but harder to use with 3rd parties
| Webhook validation is not always easy to setup
Webhooks
67. Tips
February 2019 Adventures of building a (multi-tenant) PaaS on Microsoft Azure 67
| Webhooks are not durable, if you are not around you will miss it.
| If you need to ensure at-least-once delivery, consider using a broker instead
| Store audit entry of webhook that are being pushed
| Can be important in case of a dispute
| Optionally even include the response of the consumer
| Don’t only allow global registrations, consider serving more granular updates
| For example, I want updates of one flight instead of all flights
| Provide rate limiting on your webhook endpoints
| Don’t let your platform go down by your 3rd party provider
| Webhooks are contracts as well
| Provide good documentation and version them
Webhooks
70. How we used to ship
70
Embrace Change
| Major releases were done every 1-3 years
| Feature requests were in the works for long time
| Not easy to shift product focus based on market change
| Agile to the rescue!
| Move faster and learn from your users more quickly
| New features are rolled out ever sprint, week, day
| Easier to make decisions
71. We live in a world of constant change
71
Embrace Change
| Our underlying infrastructure is constantly moving & changing
| Cloud vendors are competing to offer unique services
| Get it out early and see if customers want this; if not, kill it
| Staying up to date is a lot harder
72. The lifecycle of a service
72
Embrace Change
Private Preview
• Rough version of
product
• Shared under NDA to
limited group
Public Preview
• Available to the masses
General Available
• Covered by SLA
• Supported version
The End
• Deprecation
• Silent Sunsetting
• Reincarnation in 2.0
73. The end of the road
73
Embrace Change
| Official deprecation
| Officially announced as deprecated
| Migration is required before service shutdown
| Reincarnation
| A new version of the service arises in a new version
| Can be part of service or service in total
| Silent deprecation
| No further development in the product
| Service is still running smoothly
| Does not mean you should stop using it
Azure Access Control Service,
Azure Scheduler
Azure Data Factory (Service)
Azure Functions (Host)
Can also be tooling cfr.
Azure DevOps Cloud-
based Loadtesting
74. Choosing an Alternative
74
Embrace Change
| Be careful with the latest shiny technology
| “This new technology fixes our problem!” and potentially gives you a lot of new ones
| Make sure it’s already stable
| Questions you should ask yourself
| Is it operable?
| What is the learning curve? Is it worthwhile?
| Does it have a vendor lock-in?
| Build or buy
| There is no silver bullet
75. Cloud platforms are never finished
75
Embrace Change
| There is no such thing as a cloud project, use a product mindset
| Nothing is written in stone, dare to revise decisions you’ve made
| Prepare for your migrations
| Make sure you have enough time
| Create a small POC to see if the technology is the right fit
| Change is coming, so you’d better be prepared
77. Conclusion
| Learn the scalability capabilities & trade-offs of your technologies
| Provide user-friendly webhooks & route them via API gateways as a consumer
| Define and roll out a good monitoring strategy, and make sure you have
enough information
| Automate everything, it will save you one day
| We live in a world of constant change, so be prepared
| You build it, you run it