This is little dated deck for our learnings - I keep getting multiple requests for it. I have removed one slide for access permissions (RBAC -which are now available).
2. About me
- @govindk
- http://govindkanshi.wordpress.com
- Databases and applications is the focus.
- MTC India
3. Agenda
• Most common issues
• Lift & Shift – or start blaming everybody else
• DR & Backup – there is no clustering?
• Performance – why is my disk so slow
• Network – what does a CIDR mean
• Some services – what they do and how you can use
them
4. Migration issues(top issues we get)
• Sticky session (ARR) – fixed now (use ps command to create tuple)
• Isolation (machine should not go out of subnet) – fixed
• Multiple Ips/NICs - fixed (NICs fixed, IP coming)
• Management NW
• Disk performance
• Provisioned (fixed)
• SSD (fixed)
• OS – X – need exception talk to vendor - Talk to vendor
• Oracle/SAP/DB2 – need to go for support from them
• No multicast allowed - Java based App servers can use JGroups
• SNMP not present – in most public clouds
5. Practical issues
• Issue (Operations)
• Sprawl of subscriptions,VMs (monitor)
• Running out of core, storage accounts or skewed account
usage - monitor
• Granular billing (+ tags - coming )
• Better Security mechanism (RBAC + operations log across
service getting there)
• Run out of Network (properly allocate CIDR)
• Naming conventions
• Name_of_proj_imageName_purpose_region (no need of
tag)
7. ISV and enterprise - cloud
ISV Enterprise
Agility for change Stability with some agility
Shared Capex Shared Capex across
stakeholders
SaaS Maintain balance (old data,
old systems)
Elasticity depends on
customer
Elasticity well defined for
workloads – in general.
Cost/Margins are big factor Established firms know costs
of people/sw and optimize
Provisioning Provisioning with control
Need to exploit cloud infra to gain
Efficiencies around cost
9. Lift and test - Enterprise
• Issue – In my DC/Colo/…..
• Resources are throttled in public cloud
• Storage – throttled - You can catch Storage throttling
• Your network bw is throttled so as to be nice to others.
• Your vm cpu is throttled so as to be nice to the neighbor.
• Services are throttled(shared resources)
• Exception is o365 – dedicated client or
• You go for largest machine (compute)
• Mismanaged expectations
• OS support, vendor support, network , storage IOPs requirement
• Special clustering requirement for HA
10. “Forklift” – with care
• Challenge is applications are very deeply integrated with each other
11. Decision matrix
Input Output
Data size,
Retire, DoNot Migrate, Replace with
SaaS(work commercials), Optimize
(refactor, utilize cloud offerings) , Lift and
shift (weigh in approaches)
Adaptation to cloud cost
(storage/nw/monitoring)
Badly performing app on-premise will
perfom worse on the cloud
Security implications- store data outside,
auditing req
Workload complexity – comes with biztalk
and mq series and solaris/sgi app
Availability -Nothing like availability sets is
present on-premise
Location y people will access apps from x
13. What does Azure provide
• http://azure.microsoft.com/en-in/support/trust-
center/security/
• Security Development Lifecycle (SDL).
• Operational Security Assurance (OSA).
• Assume Breach.
• Incident Response
• 24 hour monitored physical security.
• Monitoring and logging.
• Antivirus/Antimalware protection.
• Intrusion detection and DDoS.
• Zero standing privileges.
• Encrypted communications.
• Penetration testing …….
• ISO/IEC 27001:2005
• SOC 1 and SOC 2 SSAE 16/ISAE 3402
• Cloud Security Alliance Cloud Controls
Matrix
• FISMA
• FedRAMP
• PCI/DSS- I
• United Kingdom G-Cloud
• HIPAA
• Life Sciences GxP
• FERPA
• FIPS
14. Security & isolation
• Isolate using virtual network/subnet – always use vnets to
host
• Create proper subnets
• Use network acls
• Use network security groups, ACLs, firewalls
• Other services
• SQL Azure – connection string
• DocDB/Search - Keys
• Storage- SAS/Regenerate keys and the list goes on
• Others
• Use AD accounts, MFA
15. Security
• All connection endpoints(gateway/network permissions)
• Who manages them
• Who uses them
• Traditional monitoring (SNMP) does not work
• Data at rest encryption is your resp (for now)
• SQLAzure …
• SQLAzure has auditing too
• Do your own on SQL on VM or storage
• Key management is an issue – have process of attribution
and checks
• RBAC across services –starting
• Auditing – log available
16.
17. Availability
• Issue
• On premise we use cluster of some kind
• We do not think of Datacenter/Racks today
• Our admins do that
• DB on VMs
• SQLServer – Always ON (don’t compromise availability for cost )
• Oracle – DG, ADG, GG
• MySql, PG – master slave,
• Mongo – master slave,
• Look at every service availability (it varies)
19. Availability
• Notification of downtime
• No single machine SLA – Availability group with at least 2 instances
• Need to work on SLA by replicating data and settings
• Generally 2 pair of app + db works fine
• Cache etc require re-building
• VPN connectivity availability
• DNS/NW
• Use 3rd party DNS
• VPN connectivity availability vs expressroute
• Services (example)
• Redis cache
• Master Slave(auto failover – hopefully more transparency in future)
• SQL Azure/Queue/Storage
• 3 replicas + RO + geo replication (Where applicable)
• Monitor from external endpoints, inside apps, inside Azure
• Think about availability at all levels
20. Availability sets
• Compute need to be in availability set
• Some workloads do not enable themselves for Availability
set
• Plan for DR in another region by
• Pushing configuration changes
• Pushing data changes using data tech
• Pushing cache – invalidation
• Traffic manager is great but backend data needs to be in
sync
21. Availability
• Test it (develop your own chaos monkey)
• Hosted services do not have failure mode so you need to go back
• Kill the connection or connect to wrong/non-existing machine.
• Measure everything – tools time, data restore time, verification
time, people interaction time – literally have a log book which
keeps improving over time to include other events
• Use hysterix and similar approaches – circuit breakers to overcome
service issues
• Canaries across the services(applies to perf too)
23. Performance
• What is better D or A series –
• Do the test
• Cpu/io at least
• Choose right vm – try scale out & scale up
• It all depends – DBs like scaleup
• Reiterate - Choose right storage
• Local disk, SSD, ephemeral disks,shared , and persistent disk from Azure blob
• Provisioned vs standard
• Standard-decoupling-scale-individual-pieces(SDSIP)
• DB – scale up/R-W-Shard
• Session Data – cache/nosql or chose right store
• Front end assets – use CDN, use varnish
• Load balancer – Internal-External or nginx, HA proxy
• Auto scale –– plan for it and test it
24. Do not forget basics
• Use perf tools
• NW – iperf
• Disk – iozone
• Memory – stream
• Load balancer
• You don’t have control over size/notifications (in a way good )
• Myth - LB is ROUND ROBIN - nope
• Operations - No logs yet, can’t install monitoring agents or see the
stats (coming)
• Operations - SSL termination does not happen on LB(coming)
25. Performance – things you will find
• NW
• Machines have BW barrier – which keeps going up
• NW gateways have barrier – 200 Mbps
• Even though internal nw could be GB hookup
• For enterprise scenarios
• Location based pipes to VNETs (use express route)
• Use New regional VNET ensures assets are close by
• Use New SQL image pre-striped with storage pool
available for SQL Transactional workload
26. Performance
• Monitor
• Reachability, latency, throughput
• Within app telemetry – boundary/newrelic/appinsight/erroception
for js etc
• Latency
• App –stack monitoring
• OpsInsights or agent based sw – boundary/scom/datadog etc…
• Perfmon counters , error logs, app logs
• Monitor logs – error/syslog – logstash is simplest but ymmv
• Collectd/StatsD + fav collection tool(flume to x ) + visualization graphite to
x – identifying issues
• Monitor services
• Request for API based pull of data so that your “app” can have 360 view
27. Save Money
• Issue
• Ran out of budget in days/weeks/months(ran large machine)
• Other side of pay as you go
• You pay even if you do not use but keep services on
• Do custom provisioning and de-provisioning to take care of growth
and lag- you need to think through “quiecising”
• Think through excessive disk space usage – you pay by “storage”
• Switch off unused/unwanted vm instances and orphan storage disks
28. Exploit azure to get cost effciencies
• Exploit Azure
• Don’t just move compute and storage
• It requires rework on part of software
• Can I do without full fledged relational db
• Can I use pre-generate reports and store them in low cost storage
• Can I use smaller machines
• Can I start using lower cost services for search/cache/json or nosql store
• Look at long term (3 year) for ROI –
• Azure EA(if you have SA- you will have lot of sleep) is great steal
• Don't forget your hvac, real estate, people, rent, provisioning, cost of DR-HA,
licensing
• Look at agility and the cost of not having it
• Always get Azure support - it is small price to pay for the peace
29. Your Feedback is Important
OPTION 3: Feedback stations outside the hall
Fill out evaluation of this session and help shape future events.
OPTION 1 OPTION 2
30. Rough guide
NW MPLS, VPN MPLS, VPN/Expressroute, VirtualNetwork
(dynamic advertisements of routes coming)
Storage SSD/Voilin/NAS/San/Das Local/SSD/Ephemeral/VHDs from storage, availability/rr/geo
Compute Raw/vm on
hyperv/vmware/amzn
Inmage tool to convert, Azure Iaas or PaaS
CDN CDN CDN
LB F5, custom-sw External Load balancer,Internal, run your own
Monitoring Scom, Nagios or just tail log
file
Scom,new relic,boundary, gomex,keynote, Nagios,cacti,
Azure metrics (Paas/Iaas –linux coming)
Data Relational – NoSql AzureTable/DocDB, SQL Azure/SQL on VM, all other DB on
vm
DW PDW? Do not migrate – but fresh approach bityota
Ingestion,
Integration and
Messaging) -
Biztalk, MSMQ, Workflow,
RabbitMQ, Camel, ZeroMq
Biztalk as service, Azure Queue, Azure EventHub,
Notification Hub, API mgmt, custom sw
31. Rough guide
CEP Streaminsight Streaming Analytics
Batch Jobs Azure Automation
Caching memcache, appfabric,
redis
hosted redis, document db
Identity - AD - AD Azure AD (EMS)
RMS RMS Azure RMS (EMS)
Management of assets Intune,System Center Intune (EMS),
Access to apps on byod EMS EMS
Backup - Tapes, custom SW Azure StorSimple, Backup Vault
Monetization/Mobility - Azure Mobile service/API management
Dynamics/CRM On-premise On-Azure or Hosted
ES/Solr/Caching On premise Hosted Azure services for redis/search/DocDB
32. Services
• Always plan for “moving out”
• Your own datacenter, co-lo
• Applications have some abstraction layer to plug in
services
• Storage for example – plug in behind at least an interface to
allow “pluggable” storage.
HA – update – reboot less
Guest agent – only in PaaS
Order guarantee – Nope
PaaS – role - the order and number of instances of a role that get rebooted concurrently during host updates can vary. That’s because the placement of instances on servers can prevent the FC from rebooting the servers on which all instances of a UD are hosted at the same time, or even in UD-order
How much time – depends but - Another difference between host updates and Cloud Service updates is that when the update is to the host, however, the FC must ensure that one instance doesn’t indefinitely stall the forward progress of server updates across the datacenter. The FC therefore allots instances at most five minutes to shut down before proceeding with a reboot of the server into a new host OS and at most fifteen minutes for a role instance to report that it’s healthy from when it restarts. It takes a few minutes to reboot the host, then restart VMs, GAs and finally the role instance code, so an instance is typically offline anywhere between fifteen and thirty minutes depending on how long it and any other instances sharing the server take to shut down, as well as how long it takes to restart.
Azure Storage(queue, blob, table, block)
3rice replicated
Blocks – geo replicated
Aspera for file transfer
Backup of VM, data,
Availability of db, app, services.
http://msdn.microsoft.com/en-us/library/windowsazure/ee924680.aspx
Kevin’s blog excerpt as it is
HostOS - The Host OS update can take several days for the fabric to coordinate the upgrades across all of the different hosted services and upgrade domains within a datacenter. It is not uncommon for different instances of your deployment to be updated several hours apart from each other.
Guest OS. Once the Host OS has finished upgrading across the datacenter then the Guest OS will be upgraded for services which are configured to use automatic Guest OS versions and this upgrade will proceed using standard upgrade domain rules for your service. Your VM will be rebooted and the Windows Partition (the D drive) will be reimaged with the upgraded OS. The Guest OS update process is much faster than the Host OS update since the fabric only has to coordinate the update within your hosted service and your upgrade domains. The duration of the Guest OS update process for your service will largely depend on how many instances you have, how many upgrade domains you have, and how long your service takes to shut down (Stopping/OnStop events) and start up (startup tasks and OnStart event).
Guest Agent. The Azure guest agent is updated on a roughly monthly basis. When the guest agent is updated the host process running your role (typically WaWorkerHost or WaWebHost) will be gracefully shutdown, then the guest agent will update itself, then the host process will start again. See http://blogs.msdn.com/b/kwill/archive/2011/05/05/windows-azure-role-architecture.aspx for more information about the guest agent process and how it interacts with your service.
Approximately every month, expect your instances to reboot once for the Host OS update. If you have automatic guest OS updates, expect your instances to reboot again. These reboots are typically several hours apart, but this time frame can change depending on the makeup of different services within a datacenter.
Your role needs to adhere to the rules around host OS updates, in particular instances should reach the Ready state within 15 minutes of starting the Startup tasks. For more information about this limitation see http://msdn.microsoft.com/en-us/library/hh543978.
Your role instances should be able to handle a Reboot, a Reimage, and a Recycle. The Host OS upgrade will cause a Reboot of your instance, and the Guest OS upgrade will cause the equivalent of a Reimage of your instance. See the common issues below for more information.
Another attempt to source images…not working
Disk
Windows – use storage pool or stripe disk with raid 0
Linux – use mdadm at min
Go for largest disk at beginning – do not think of “will add disk capacity etc” 0 unless you have diskspace monitoring tool
SSD great for Cassandra/elasticsearch/solr
Local disk – os
VHDs – take instance snapshot for non-db workload or where state is clear otherwise not very useful
Better throughput sw –
Orleans – ak..ka – the actors for .net world
Support - http://azure.microsoft.com/en-in/support/plans/
Please ensure the below communicated
We value your feedback
Use any of the 3 options to provide feedback
Option 1 -> QR Code
Option 2 -> Feedback right through the app for the sessionOption 3 -> Via feedback stations
You’ll also be entered into a daily prize drawing!