Dynamic talks Silicon Valley: Microservices made easier on Google Cloud. In this talk, we'll cover the various technologies available on Google Cloud that help you find time for other things. We will cover Cloud Services Mesh and the built-in technologies within GCP such as Cloud Trace, Debugger, Logging, Monitoring, and Service Topology that makes your life easier. Finally, we will cover capabilities of Istio that help you prepare for and deal with failures, latency, "CrashLoopBackoff" and other things that go bump in the night.
About Salmaan Rashid:
About Salmaan Rashid: Solutions Architect at Google covering a wide variety of topics ranging from kubernetes, GCP's serverless suite, networking, security, system integration and client library usability. He spent the last 11 years at Google working in various technical roles and the last 5 years in Google Cloud. An Editor of the Google Cloud Blog over at Medium.com where you can find him writing about things he's wanted to learn or explore. He's a proud owner of an original (and still continuously working) Raspberry PI and a cat, Gigi
8. ● Cloud Logging
○ Structured (jsonPayload, protoPayload)
○ Unstructured (textPayload)
● Container Logs
○ just write to stdout/stderr 😊
○ Write via Logging API 😞*
○ Log grouped by resource type, source
○ gke_cluster, pod, container
● Request->Log correlation
○ "parent->child"
● Logs to Metrics
○ User defined alertable metric derived
from logs
log.Printf("Found ENV lookup backend ip: %v port: %vn",
backendHost, backendPort)
Logging
9. ● What can you monitor?
● Application Monitoring
○ Your app metrics, request metrics
● System Monitoring:
○ GKE (cluster, node), Loadbalancer, GCE (VM),
GAE
● Built in Metric by type: eg: a Cloud Run requests
○ "type": "run.googleapis.com/request_count",
○ Metric shows each request
○ How do you break down requests by its
response_code? Use its Metric Labels to filter
● Labels
○ Filter subset (eg, "response code=500, for
route=66")
Monitoring
{
"name": "projects//metricDescriptors/run.googleapis.com/request_count",
"labels": [
{
"key": "response_code",
"description": "Response code of a request."
},
{
"key": "response_code_class",
"description": "Response code class of a request."
},
{
"key": "route",
"description": "Route name that forwards a request."
}
],
"metricKind": "DELTA",
"valueType": "INT64",
"unit": "1",
"description": "Number of requests reaching the revision.",
"displayName": "Request Count",
"type": "run.googleapis.com/request_count",
}
10. ● What do you want to monitor?
● Service Level (Objectives | Indicator| Agreement)
○ SLI: measure metrics for user happiness :)
○ SLO: SLI + target goal over window
○ ↑ (SLO) →more﹩to operate
○ SLA: lawyer stuff
○ SRE Fundamentals
● Setup a Dashboard
● Setup Alerts based on Dashboard/SL*
○ PagerDuty,Email, Phone, Slack, etc
● Incident Dashboard to ACK/Resolve/Track
● UptimeChecks:
○ Send HTTP requests to your external IP
○ Check latency, response_code from
datacenters around the world!
Monitoring + Alerts ● Creating Dashboard with Istio+Stackdriver
Create a monitoring dashboard
1. Head over to Stackdriver Monitoring and create a Stackdriver Workspace.
2. Navigate to Dashboards > Create Dashboard in the left sidebar.
3. In the new Dashboard, click Add Chart and the following metric:
● Metric: Server Response Latencies
(istio.io/service/server/response_latencies)
● Group By: destination_workload_name
● Aligner: 50th percentile
● Reducer: mean
● Alignment Period: 1 minute
● Type: Line
11. ● Trace a HTTP/gRPC request end-to-end*
○ User → yourService
○ yourService → yourOtherService
○ yourService → GCP APIs
● Trace _WITHIN_ a GCP request:
○ What went on within the GCP API request
○ What query did my spanner system invoke and
how long did it take?
● Make it generic!
○ OpenCensus: run it anywhere, add you own
tracers (sample helloworld in reference section!)
Tracing
12. ● Need to use Logging API to traces and logs
together :(
● Trick is to embed the parent traceID as the
"trace" field.
ctx := span.SpanContext()
tr := ctx.TraceID.String()
lg := client.Logger("spannerlab")
trace := fmt.Sprintf("projects/%s/traces/%s", projectId, tr)
lg.Log(logging.Entry{
Severity: severity,
Payload: fmt.Sprintf(format, v...),
Trace: trace,
SpanID:
ctx.SpanID.String(),
})
Tracing+Logging
13. ● Live Heap, CPU, Thread info
● Collects metrics and emits to GCP
● Memory issues, CPU, etc
● Stackdriver CPU statistics and Profiler: identify
over/under provisioned systems.
● Profile and iterate code; use traffic splitting to A/B test!
Profiling
14. ● Live Debug of your running app
● Does NOT _stop_ your application at a breakpoint (just
not how it works!)
● Observe parameters at any breakpoint given a
reference to the source code (on github, Cloud Repo,
bitbucket).
● Insert log parameters for propagation.
● Need to start application as instrumented; do not
enable by default! (only canary/test with small% traffic)
● Observe parameters at any breakpoint given a
reference to the source code (on github, Cloud Repo,
bitbucket).
● Java, Python :) .... golang :(
Debug
18. How to manage all this?
Which version?
Version 2.0
Which instance?
Service to Service Communication
Service
(Caller)
Service
(Provider)
Quota Exhausted?
Authorized?
Wait for response?
Retry on Failure?
Secure?
Who’s calling?
Version 1.0
Without changing the service implementation!
Are my services
healthy?
19. Service Management
Service
(Caller)
Service
(Provider)
Proxy Proxy
Lookup
Routing
Timeout
Circuit Breaker
Policy Enforcement
TLS Termination
ThrottlingIn Out In Out
Service proxies intercept outbound and inbound service calls transparent to the service implementation.
The outbound proxy manages routing and error handling strategies, such as retries and circuit breakers.
The inbound proxy validates the service call based on credentials, available quota etc.
Management & Configuration
*depending on their existing application, can either shoot through this or spend more time*
But what are microservices exactly?
We can think of them as an isolated, autonomous services that work together. Typically, communication between these services are happening via network calls, so that we avoid tight coupling that led us to adopt this architecture in the first place.
A more concrete rule of thumb would be one codebase per service, thoroughly discussed in the 12 Factor app methodology. This gives companies two standout benefits: a.) the ability to release features rapidly and independent of the rest of the codebase and b.) the opportunity to organize people and teams according to business boundaries.
Generally, when trying to decide what services belong together, we want to follow the Single Responsibility Rule - any service should not have more than one reason to change, giving us a clean system design to deploy independent service.
The mixed technology landscape of a microservices implementation left using code to deal with resilience issues unsustainable, writing solutions for each and every programming language in your microservices implementation is time consuming and hard to maintain
As many people talk about services and micro-services, service to service communication seems simple enough: one service calls another service who provides a useful function.
However, there are quite a few things to think about:
What is the response doesn’t come right away? How long should the caller wait before giving up?
Should the caller retry the operation after a request times out? Retries are useful but can also burden a system that’s already overloaded.
Likely there’s more than one instance of the service, e.g. to provide resilience. Which one should you call?
Worse yet, there are likely diffeent versions of the service: someone may be soft-launching a new version or has to maintain backwards compatibility. Which version should you be calling? This could change any time, e.g. when the soft launch transitions into a full launch.
The service provider will also have quite a few questions:
It may need to know which service is calling.
It’ll want to check whether the caller is authorized to call the service
Even when the caller is authorized, it may have exhausted the number of calls it’s allowed to make in a specific time period.
After all these checks pass, communication between the services should be secured.
And last but not least, we’d like to know what’s going on with our services: are they healthy, are there a lot of errors, e.g. because the service provider has issues or because the caller makes invalid requests?
All these things need to be configured and managed for a large set of services. That needs to be done centrally - otherwise we have a giant mess.
And, in most cases you can’t modify the services to do so because you may not have the source. Even if you do, you would not want to make a code change and redeploy just to change the operational setup.
The answer lies in adding a service management layer that’s connected to, but independent of the services:
Service proxies intercept outbound and inbound service calls transparent to the service implementation.
The outbound proxy manages routing and error handling strategies, such as retries and circuit breakers, based on information from the management center.
The inbound proxy validates the service call based on credentials, available quota etc., which can be centrally configured