Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Serverless operations for the iRobot fleet

158 visualizaciones

Publicado el

An overview of how we operate and monitor serverless enterprise applications.

Publicado en: Internet
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Serverless operations for the iRobot fleet

  1. 1. Serverless Operations for the iRobot Fleet 2017 Aaron Kammerer AWS Platform Manager
  2. 2. iRobot 2017 | 2 • Founded in 1990 • Defense and security: circa 2000 • Roomba: 2002 • Roomba 900 = cloud connectivity: 2015 • Migrated to AWS: 2016 • Now exclusively focused on consumer robots About iRobot We are THE robot company
  3. 3. iRobot 2017 | 3 • Founded in 1976 • IT consulting for 15+ years • Hopped over to iRobot in 2015 • Manage the AWS implementation across iRobot • Primary focus on the cloud connected Robot ecosystem • Contact me: akammerer@irobot.com About Aaron He is THE aws platform manager
  4. 4. iRobot 2017 | 4 • Embodying good ops: • Good situational awareness • Ability to navigate dynamic, challenging landscapes with agility • Can fix anything with the tools available • A steady hand, calm and collected About Operations
  5. 5. iRobot 2017 | 5 Our Team Well, you go to war with the army you have (well we’re actually not too shabby)
  6. 6. iRobot 2017 | 6 • Build faster • POCs, testing, etc. flies • Operate leaner • Skip the pain of learning to scale • Important for a historically hardware-oriented company – we LIKE to build stuff here! • Cost saving: • Perhaps net-neutral between tightly managed servers and AWS Managed Svcs • Huge savings in internal operations, development, and monitoring effort So we can… Why serverless on AWS? Outsource servers, OS, and mid-tier applications to the pros Serverless increases our agility
  7. 7. iRobot 2017 | 7 • Provides Rules Engine, Device Gateway, Certs, Authentication/Auth, Registry, Shadows • Tons of infrastructure supporting these features that we rely on AWS to maintain for us • Just one of the 25 services we utilize Prime Example – AWS IoT Why serverless on AWS? No need to reinvent any wheels
  8. 8. iRobot 2017 | 8 • Add photo of missions So that we can focus on our apps:
  9. 9. iRobot 2017 | 9 • Millions of robots sold per year • Not all are connected, but majority soon • iRobot Home production application: • 100+ Lambda functions • 25 AWS services • 0 unmanaged EC2 instances • Development and internal AWS footprint: • ~50 accounts, growing constantly • 1000s of Lambda deploys per day • Low single digit FTE supporting operations iRobot Scale Currently running and managing Lots of stuff!
  10. 10. iRobot 2017 | 10 Luckily Serverless means NoOps, right? Bueller?
  11. 11. iRobot 2017 | 11 • Moving from servers to serverless is a bit like the change from on-prem to cloud • It’s easier, in many respects, but it’s not without its own idiosyncratic issues • You stand on the shoulders of giants (Tim Wagner is pretty tall), through outsourcing these operations • But outsourcing doesn’t mean you do zero work • Being clear about this organizationally is important DiffOps No such thing as a free lunch
  12. 12. iRobot 2017 | 12 • Red/black Deployment Paradigm • Proprietary CloudFormation deployments • A deployment comprises a complete application stack ̶ API Gateway, Lambda, CFront, Kinesis, etc • Data sources are maintained separately and protected from accidental updating, etc iRobot stack Production ecosystem – Deployment
  13. 13. iRobot 2017 | 13 • SumoLogic • Essential for log sleuthing • Get all data associated with an artifact immediately across all accounts • Provides quantitative metrics on fleet health • Alarms and notifications • Of course, we use Cloudwatch as well iRobot stack Production ecosystem – Monitoring
  14. 14. iRobot 2017 | 14 • ADFS – both our AWS console and command line point of entry • Ensures ease of access across environments for developers • Removes reliance on long-lived access keys • Multi-region backup using Data pipeline and S3 cross-region replication • S3 as a cross account data messenger, or hub in a hub and spoke data sharing model • Multi-account/region rollouts of foundational architectures • Standardized IAM roles, policies • Cloudtrail implementation • Logging infrastructure (Sumologic pumpers, etc) iRobot stack – multi-account considerations Bits and pieces
  15. 15. iRobot 2017 | 15 • S3 has good bucket policy support for cross account interaction • Simply throw data to an accepting bucket on the other account, where it can listen for the objects events. • Primarily for very loosely coupled applications • Our cloudtrail data is aggregated into one bucket then processed by Sumologic • Have also used a lambda client/server model for more tightly coupled use cases • Central ‘server’ lambda can be called by ’client’ lambdas in other accounts, limiting scope in the ’server’ account, without requiring apis, etc. iRobot stack – S3 cross-account data transfer Easily integrating applications Account 1 Account 2
  16. 16. iRobot 2017 | 16 • Use ADFS to run scripts on all accounts • Foundational roles, limit checking, support utilization • Maintain a data structure of all ADFS and other foundational IAM roles/policies • Tracked in source control • Can be run idempotently in any account • New accounts can be provisioned quickly • Roll out standardized logging infrastructure • Sumologic lambda infrastructure • Cloudtrail implementation • API Gateway/IOT logging parameters • Consolidate billing • Then run summation to Sumologic via cron’d lambda, for billing alerts, granular reports, trends iRobot stack – multi-account considerations How to manage all 50+ accounts
  17. 17. iRobot 2017 | 17 • Same granularity in the platform as production • But orders of magnitude more churn • Exercises the account limits • Tests metrics to determine relevance and meaning • Bonus – Developer activity provides additional visibility into how the platform is currently behaving • Higher volume of deployments in many different AWS accounts means problems found quickly • This can alert us prior to problems hitting prod DeveloperOperations Can help with visibility Developers can be platform testers, canaries, and guinea pigs
  18. 18. iRobot 2017 | 18 • No provider is immune to problems • Small effects are more common than big outages • More services = blips could be encountered more frequently • This comes with the territory • Setting expectations organizationally is important • Architecting robustly is key ̶ Event based ̶ Async ̶ Microservices The cloud has weather
  19. 19. iRobot 2017 | 19 • First, do no harm, gather data • What is actually impacted? Current transactions or new deployments? • Contact AWS Enterprise Support • Start the ball rolling toward the service teams if it turns out this has a platform component • Additionally consult the big board, as well as the Twitterverse to gauge whether many customers are affected • Start working the diagnosis – • Our code or platform? Reacting to incidents Errors abound, what do we do?
  20. 20. iRobot 2017 | 20 • Dig in: • Execute runbooks, Consult Cloudwatch, Sumologic, CWLogs • Root cause, etc • From Enterprise Support: • Get updates on platform health • Gain insights into more opaque aspects of services – hot partitions on Dynamo DB for instance • Take direct action when possible – • Ex. Kinesis stream iterator age increasing? Re- shard. Reacting to incidents cont’d It’s not you it’s me
  21. 21. iRobot 2017 | 21 • Serverless requires a change in mindset • These incidents can be opaque • Feeling out of control of your own destiny can be frustrating • But the truth: you’d probably not do a better job • And in fact, you would likey do a lot worse • And actions still need to be taken: • Alert management to potential impact • Proactively reach out to customer base • Activate cross-region failover, etc. Reacting to platform outages When it’s a Cloud Provider problem When it’s the platform’s problem, we still have work to do
  22. 22. iRobot 2017 | 22 • Biggest operational downside: visibility • You only know what the provider tells you • Architecture • Security • Operations • How do they actually do all of the stuff they do? • Many known unknowns and unknown unknowns • Unknown unknown unknowns: what you don’t know that they don’t know they don’t know Visibility
  23. 23. iRobot 2017 | 23 • AWS IoT today has 30+ metrics • At launch, it had <10 • Without throttling metrics, thing shadow updates, or web socket metrics it was hard to debug issues • Especially early on with small numbers of robots • Can I connect? How many publishes? • Load scale, are we over our limits? Visibility Metrics are our portal : Example – AWS IoT More is better
  24. 24. iRobot 2017 | 24 • Enterprise Support has been a valuable resource • They are our eyes and ears within AWS • Engage with them to run load tests, understand account limits • Our AWS Support team has made the effort to understand our technology choices • All of our AWS users, company-wide, benefit from being able to create tickets Visibility AWS Enterprise Support AWS Enterprise support, thumbs up!
  25. 25. iRobot 2017 | 25 • Personal Health Dashboard • When performance is degraded, status is important for ops to show evidence that it isn’t a problem with our software • Per-account service health means AWS can update those affected customers more directly • Metrics, metrics, metrics • Service teams are always on the lookout for which new metrics to include – connect with them and share your requests! • Kinesis shard-level metrics, lambda iterator ages, all added with user input and makes a real difference in understanding system performance The future of improved AWS visibility Looking toward the horizon
  26. 26. iRobot 2017 | 26 • Absolutely • Without serverless in general and AWS in particular, iRobot would not have been able to build and run a scalable, low-cost production cloud application with as efficiently as we have today So - Is serverless worth it? Serverless is Manageable and it Works for Us
  27. 27. Questions?

×