Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Microsoft Windows Azure - Cloud Computing Hosting Environment Presentation
1.
2. Under the Hood: Inside The Cloud Computing Hosting Environment ES19 Erick Smith Development Manager Microsoft Corporation Chuck Lenzmeier Architect Microsoft Corporation
3. Introduce the fabric controller Introduce the service model Give some insight into how it all works Describe the workings at the data center level Then zoom in to a single machine Purpose Of This Talk/Agenda
4. Resource allocation Machines must be chosen to host roles of the service Fault domains, update domains, resource utilization, hosting environment, etc. Procure additional hardware if necessary IP addresses must be acquired Provisioning Machines must be setup Virtual machines created Applications configured DNS setup Load balancers must be programmed Upgrades Locate appropriate machines Update the software/settings as necessary Only bring down a subset of the service at a time Maintaining service health Software faults must be handled Hardware failures will occur Logging infrastructure is provided to diagnose issues This is ongoing work…you’re never done Deploying A Service Manually
5. Windows Azure Fabric Controller VM Control VM VM VM WS08 Hypervisor Service Roles Control Agent Out-of-band communication – hardware control WS08 In-band communication – software control Load-balancers Node can be a VM or a physical machine Switches Highly-available Fabric Controller
6. Windows Azure Automation Fabric Controller “What” is needed Fabric Controller (FC) Maps declarative service specifications to available resources Manages service life cycle starting from bare metal Maintains system health and satisfies SLA What’s special about it Model-driven service management Enables utility-model shared fabric Automates hardware management Make it happen Fabric Switches Load-balancers
7. Owns all the data center hardware Uses the inventory to host services Similar to what a per machine operating system does with applications The FC provisions the hardware as necessary Maintains the health of the hardware Deploys applications to free resources Maintains the health of those applications Fabric Controller
8. Modeling Services Public Internet Template automatically maps to service model Background Process Role Front-end Web Role Load Balancer Fundamental Services Load Balancer Channel Endpoint Interface Directory Resource
9. The topology of your service The roles and how they are connected Attributes of the various components Operating system features required Configuration settings Describe exposed interfaces Required characteristics How many fault/update domains you need How many instances of each role What You Describe In Your Service Model…
10. Allows you to specify what portion of your service can be offline at a time Fault domains are based on the topology of the data center Switch failure Statistical in nature Update domains are determined by what percentage of your service you will take out at a time for an upgrade You may experience outages for both at the same time System considers fault domains when allocating service roles Example: Don’t put all roles in same rack System considers update domains when upgrading a service Fault/Update Domains Fault domains Allocation is across fault domains
11. Purpose: Communicate settings to service roles There is no “registry” for services Application configuration settings Declared by developer Set by deployer System configuration settings Pre-declared, same kinds for all roles Instance ID, fault domain ID, update domain ID Assigned by the system In both cases, settings accessible at run time Via call-backs when values change Dynamic Configuration Settings
12. Windows Azure Service LifecycleGoal is to automate life cycle as much as possible Automated Automated Developer/ Deployer Developer
13. Resource allocation Nodes are chosen based on constraints encoded in the service model Fault domains, update domains, resource utilization, hosting environment, etc. VIPs/LBs are reserved for each external interface described in the model Provisioning Allocated hardware is assigned a new goal state FC drives hardware into goal state Upgrades FC can upgrade a running service Maintaining service health Software faults must be handled Hardware failures will occur Logging infrastructure is provided to diagnose issues Lifecycle Of A Windows Azure Service
14. Primary goal – find a home for all role instances Essentially a constraint satisfaction problem Allocate instances across “fault domains” Example constraints include Only roles from a single service can be assigned to a node Only a single instance of a role can be assigned to a node Node must contain a compatible hosting environment Node must have enough resources remaining Service model allows for simple hints as to the resources the role will utilize Node must be in the correct fault domain Nodes should only be considered if healthy A machine can be sub-partitioned into VMs Performed as a transaction Resources Come From Our Shared Pool
15. Key FC Data Structures Logical Node Logical Role Instance Logical Role Logical Service Role Instance Description Role Description Physical Node Service Description
16. Maintaining Node State Logical Node Logical Role Instance Goal State Current State Physical Node
17. FC maintains a state machine for each node Various events cause node to move into a new state FC maintains a cache about the state it believes each node to be in State reconciled with true node state via communication with agent Goal state derived based on assigned role instances On a heartbeat event the FC tries to move the node closer to its goal state (if it isn’t already there) FC tracks when goal state is reached Certain events clear the “in goal state” flag The FC Provisions Machines…
18. Virtual IPs (VIPs) are allocated from a pool Load balancer (LB) setup VIPs and dedicated IP (DIP) pools are programmed automatically Dips are marked in/out of service as the FCs belief about state of role instances change LB probing is set up to communicate with agent on node which has real time info on health of role Traffic is only routed to roles ready to accept traffic Routing information is sent to agent to configure routes based on network configuration Redundant network gear is in place for high availability …And Other Data Center Resources
19. Windows Azure FC monitors the health of roles FC detects if a role dies A role can indicate it is unhealthy Upon learning a role is unhealthy Current state of the node is updated appropriately State machine kicks in again to drive us back into goals state Windows Azure FC monitors the health of the host If the node goes offline, FC will try to recover it If a failed node can’t be recovered, FC migrates role instances to a new node A suitable replacement location is found Existing role instances are notified of the configuration change The FC Keeps Your Service Running
20. FC can upgrade a running service Resources deployed to all nodes in parallel Done by updating one “update domain” at a time Update domains are logical and don’t need to be tied to a fault domain Goal state for a given node is updated when the appropriate update domain is reached Two modes of operation Manual Automatic Rollbacks are achieved with the same basic mechanism How Upgrades Are Handled
21. Windows Azure provisions and monitors hardware elements Compute nodes, TOR/L2 switches, LBs, access routers, and node OOB control elements Hardware life cycle management Burn-in tests, diagnostics, and repair Failed hardware taken out of pool Application of automatic diagnostics Physical replacement of failed hardware Capacity planning On-going node and network utilization measurements Proven process for bringing new hardware capacity online Behind The Scenes Work
22. Your services are isolated from other services Can access resources declared in model only Local node resources – temp storage Network end-points Isolation using multiple mechanisms Automatic application of windows security patches Rolling operating system image upgrades Service Isolation And Security
23. FC is a cluster of 5-7 replicas Replicated state with automatic failover New primary picks up seamlessly from failed replica Even if all FC replicas are down, services continue to function Rolling upgrade support of FC itself FC cluster is modeled and controlled by a utility “root” FC Windows Azure FC Is Highly Available Client Node FC Agent FC Core FC Core FC Core Object Model Object Model Object Model Primary FC Node Secondary FC Node Secondary FC Node Uncommitted Committed Committed Committed Disk Disk Disk Replication system
24. Network has redundancy built in Redundant switches, load balancers, and access routers Services are deployed across fault domains Load balancers route traffic to active nodes only Windows Azure FC state check-pointed periodically Can roll-back to previous checkpoints Guards against corrupted FC state, loss of all replicated state, operator errors FC state is stored on multiple replicas across fault domains Windows Azure Fabric Is Highly Available
25. PDC release Automated service deployment Three service templates Support for changing number of running instances Simple service upgrades/downgrades Automated service failure discovery and recovery External VIP address/DNS name per service Service network isolation enforcement Automated hardware management Include automated network load-balancer management For 2009 Ability to model more complex applications Richer service life-cycle management Richer network management Service Life-cycle
26. Windows Azure automates most functions System takes care of running and keeping services up Service owner in control Self-management model through portal Secure and highly-available platform Built-in data center management Capacity planning Hardware and network management Summary
28. Multi-tenancy with security and isolation Improved ‘performance/watt/$’ ratio Increased operations automation Hypervisor-based virtualization Highly efficient and scalable Leverages hardware advances Virtual Computing Environment
29. High-Level Architecture Guest OS Server Enterprise Guest OS Server Enterprise Host OS Server Core Applications Applications VirtualizationStack (VSC) VirtualizationStack (VSC) VirtualizationStack (VSP) Drivers Hypervisor GuestPartition Host Partition GuestPartition VMBUS VMBUS VMBUS Hardware CPU NIC Disk1 Disk2
30. Images are virtual hard disks (VHDs) Offline construction and servicing of images Separate operating system and service images Same deployment model for root partition Image-Based Deployment
31. Image-Based Deployment Maintenance OS Host Partition Guest Partition Guest Partition Guest Partition Application VHD Application VHD Application VHD App1 Package App3 Package App2 Package Host partition differencing VHD Guest partition differencing VHD Guest partition differencing VHD Guest partition differencing VHD HV-enabled Server Core base VHD Server Enterprise base VHD Server Core base VHD Server Enterprise base VHD
32. Deployment of images is just file copy No installation Background process Multicast Image caching for quick update and rollback Servicing is an offline process Dynamic allocation based on business needs Net: High availability at lower cost Rapid And Reliable Provisioning
33. Tech Preview offers one virtual machine type Platform: 64-bit Windows Server 2008 CPU: 1.5-1.7 GHz x64 equivalent Memory: 1.7 GB Network: 100 Mbps Transient local storage: 250 GB Windows azure storage also available: 50 GB Full service model supports more virtual machine types Expect to see more options post-PDC Windows Azure Compute Instance
34. Hypervisor Efficient: Exploit latest processor virtualization features (e.g., SLAT, large pages) Scalable: NUMA-aware for scalability Small: Take up little resources Host/guest operating system Window Server 2008 compatible Optimized for virtualized environment I/O performance equally shared between virtual machines Windows Azure Virtualization
35. Expensive SLAT requires less hypervisor intervention associated with shadow page tables (SPT) Allow more CPU cycles to be spent on real work Release memory allocated for SPT SLAT supports large page size (2MB and 1GB) Second-Level Address Translation
36. The system is divided into small groups of processors (NUMA nodes) Each node has dedicated memory (local) Nodes can access memory residing in other nodes (remote), but with extra latency NUMA Support
40. Scheduler Tuned for datacenter workloads (ASP.NET, etc.) More predictability and fairness Tolerate heavy I/O loads Intercept reduction Spin lock enlightenments Reduce TLB flushes VMBUS bandwidth improvement More Hypervisor Optimizations
41. Automated, reliable deployment Streamlined and consistent Verifiable through offline provisioning Efficient, scalable hypervisor Maximizing CPU cycles on customer applications Optimized for datacenter workload Reliable and secure virtualization Compute instances are isolated from each other Predictable and consistent behavior Summary
42. Related PDC sessions A Lap Around Cloud Services Architecting Services For The Cloud Cloud Computing: Programming In The Cloud Related PDC labs Windows Azure Hands-on Labs Windows Azure Lounge Web site http://www.azure.com/windows Related Content
43. Evals & Recordings Please fill out your evaluation for this session at: This session will be available as a recording at: www.microsoftpdc.com
47. Stay Updated Know More about Windows Azure- http://www.microsoft.com/windowsazure/ Know more about Microsoft Cloud Services- http://www.microsoft.com/india/cloud/ Request for an Enterprise Cloud Assessment workshop- email us at azurepro@microsoft.com Follow us