My name is Nick Turner and I’m the Director of IT Operations at Zenoss.
Feel free to follow me on Twitter to converse during or after the conference, I’m @nickclarkturner
Today we’ll be talking about Migrating IT to the Cloud, specifically Zenoss’ experience partnering with Amazon Web Services.
Just to be clear my usage of “the cloud” is in reference to IaaS providers like AWS, Azure, and Google Cloud but primarily this presentation is focused on AWS terminology and concepts.
Today the cloud is almost 10 years old as Amazon EC2 was introduced in August of 2006 so the concept and benefits of cloud computing are no longer new. Azure is becoming more competitive and Google Cloud has recently entered the market. However there are still companies out there that are sitting on the sidelines and choosing to avoid the cloud.
The question that companies need to ask themselves is does it make sense to move X to the cloud? X being their total infrastructure, a specific workload, initiative, product, departmental infrastructure, etc…
And some major driving concepts or questions are: Is running items with variable use in my infrastructure the most efficient use of resources, what happens when those resources need to temporarily or permanently scale, and is my infrastructure as resilient, ubiquitous, secure, and compliant as I would like?
The cloud provides unprecedented scalability of compute and storage across different availability zones or cluster of datacenters, across different regions throughout the globe, and with integrated and scalable networking and supportive systems for managing networks, name resolution, load balancing, storage tiers, recovery, and automation.
Of course while some cloud redundancy is included, some is a-la-carte and has to be architected into your cloud solution. You can still put all of your eggs in one basket in the cloud if you don’t architect in utilization of availability zones, cross-region replication of data, or elastic load balancing to name a few.
Additionally the international footprint or availability of services in international offerings may be a hinderence to global expansion, especially if you want the same ability to dynamically scale internationally as it makes sense to do so.
Finally, does your datacenter meet regulatory and compliance needs? What Tier is your datacenter, meaning how redundant is their infrastructure?
The most common objections to cloud adoption I encounter have to do with a lack of cloud experience or financial objections
In regards to lack of experience… a lot of the concepts that exist with physical datcenters are transferable to their cloud counterparts.
Additionally the same is true from cloud partner to partner. Different terminology but essentially the same meaning. Getting up-to-speed is a small up-front investment hurdle to overcome. Otherwise you can always pay others to manage your AWS for you. Partners such as...
You may also have a false sense of security in your existing operations but have far more risk than you realize with a few key individuals holding the keys to the kingdom. What happens when key individuals leave? What happens if you buy a new company or your company is bought? How is your...
When it comes to assessing financial feasibility it really comes down to the needs of the business. Is there available capital? Does the company care more about EBITDA today than the long term viability of the company? Is the time value of money important and you want to have your cloud cake and eat it too?
A great strategy is to level the playing field as much as possible if both options are on the table and feasible with analyses like a discounted cash flow for example
If you go the CapEx route, will the realities of reaching max capacity or equipment end of life survive 5 years of financial and budget plans?
<Experience with company acquisition...> Expesite vs. 360Facility
As each department hit an inflection point where feasibility and benefits of the cloud migration became apparent they were moved.
We went with a model of isolating each departmental use case to their own account for logical segmentation, “blast radius”, and architectural freedom. Even with different accounts we could still have consolidated billing with the segmentation helping departmental chargebacks.
However not enough work was done up front to centralize tool utilization or adjust as our product architecture incorporated docker and control center, so depending on the moment and team performing our cloud adoption we are using a cocktail of different solutions that were compatible with our supported OS. Tools such as…
When architecting each solution the level of automation varied based on the needs of the environment…
Of course, in addition to using the cloud to host our ZaaS infrastructure we are able to use Zenoss to monitor the health of that infrastructure.
Which in my obviously biased opinion is the best choice to use ZaaS. As we have complete visibility into the health of the infrastructure, the application, and the ZenPacks, there is uniformity in environment deployments which speeds up the mean time to resolution as we can leverage comparative analysis and not waste time blaming the infrastructure, and little mistakes which could lead to hours or days of downtime can be resolved quickly by our team of experts who can interface directly with the engineers.
Here are some NOC dashboards we use to display global deployed collectors, and track critical events for our AWS or distributed collector infrastructure.
Other graphs give us comparative visibility into CPU, RAM, and Disk health on all customer environments.
We rely on a few ZenPacks to monitor the health of our Zenoss environments and they are the following…
When it comes to deploying Zenoss in AWS, Choosing the right instance type is important for performance. With 4x it was a matter of aligning instance resources with amount of managed resources being monitored
With 5x on the other hand it is a matter of assigning instance resources to resource pools. Even with the architectural change I’ve seen that costs between running 4x ZSD and 5x ZSD are very similar.
I’ve listed some recommendations on what we’ve had success selecting from AWS in the past…
Some customers with 5x are choosing to offload HBASE like we’d previously offloaded MySQL with RDS which I’ll touch on a bit more later.
For an easy tool on comparing EC2 instance types, I recommend…
Additionally to simplify AWS deployment for customers, Zenoss is currently working on deploying an AMI to the marketplace to automate a lot of the process for standing up Zenoss successfully in the cloud. For more details please reach out to me and I can put you in touch with our Product Manager driving that initiative.
When it comes to determining how many cloud service accounts to establish, my recommendation is the goldilocks approach… not too big or too small, but somewhere in the middle.
Maybe base it on product, platform, initiative, or department.
- When too small you can have too many cloud accounts to manage that may push the bounds of what cloud governance tools can accomodate, you will likely have automation issues if you want to dynamically create accounts and tie them to other account networks or billing.
- When too big you might be putting items that don’t require or are severely inhibited by security with systems that need to be locked down. You will likely start to hit your head against account soft limits for amount of running instances, storage buckets, networks, etc… While those can be easily lifted at times, you could potentially hit the upper bounds of what is allowed. You might also be putting all your eggs in one basket and have reliability issues by being bound to a single AZ or Region.
- If you can’t decide on your own then generally cloud services will happily assist you with the architecture that makes the most sense for you and will try to steer you to something more managable if possible.
Your Migration Strategy could result in success or failure of cloud adoption. Planning a mass exodus or migration can overly complicate cloud adoption, scare away the risk averse or cause analysis paralysis that prevents the initiative from ever moving past the design phase. Similarly if cloud adoption is put on hold until product or platform re-architecture can occur then it may be shelved indefinitely due to complexity or potential level of investment required to accomplish. Meanwhile existing architectures could get a performance boost from running as architected in a cloud environment.
A major challenge as mentioned earlier with the lift and shift model could result in duplication of effort or processes used by disparate teams varying wildly or becoming antiquated as technology evolves. One team may choose to use Chef and Ruby and another comes along and chooses Ansible and Python. Centralization of these efforts by team makes sense for knowledge retention and tool or process standardization. Conversely to the all-at-once strategy some efficiencies that are discovered analyzing the migration of one workload might not get applied to a workload that was previously migrated.
If there is willingness and capability to re-architect to be cloud optimized via microservices or othter best practices then avoid lift and shift if possible.
Some challenges we’ve experienced running Zenoss is AWS that need to be thought through via certain considerations when planning out your deployment.
- Networking…
- Security Groups/DNS/Traffic flow…
Offloading processes for greater availability, historical record keeping, or managing multiple environments.
Do you need to manage changes by running them through a test environment prior to hitting a production one. Should those environments be in different AZs? Different Regions?
Make sure you allocate enough resources to allow for performance growth (CPU, RAM, Disk I/O), as well as storage consumption. Isolating backup processes to their own volume.
So why didn’t we choose Azure? Well mostly it was timing, but in the last year when we began testing with Azure and a customer chose Azure to deploy they ran into stability challenges. On paper the environment looked like it was architected with more than enough resources, but the distributed infrastructure of Azure was not as seamless as advertised and occasional breakdowns in communication would cause havoc with our application. Issues we’d never experienced with high or lower classes of service with AWS.
Cloud administration is not simple or easy and at scale can become unworkable if you don’t plan in advance.
Some considerations for cost control tool selection really center around the concept of purchasing reserved instances…
- Complexity…
- What if your users switch from one instance type to another? One AZ to another? Sometimes there are challenges migrating away from 1 instance type to another when factoring in changes like PV to HVM.
Different Accounts/Azs
What RI purchase cocktail makes the most sense based on my usage and my available budget. What if you choose poorly against a certain AZ or a certain instance type. The marketplace. How am I performing against my purchase once I make one
How much are they providing assistance in making this decision, getting visibility into the health of that decision, or helping you manage changes to your decision after the fact.
The best thing to do to your environments is to tag everything that you can as part of your automation. Cloud Governance tools are so much more powerful and insightful when they can help reduce noise and add clarity to the reporting and guide you in any business decisions that have to be made.
Additionally any automation to stop instances that are not in use, or remove storage, snapshots, etc… that are no longer in use.
After extensive evaluation we found Cloud Health Tech to be a superior offering far more capable and showing more promise than others.
Reserved Instance Recommendations had a budget based modifier. The system proactively sends reports on health and utilization of previous RI purchases.
The reporting feature is very strong, and gives you the capacity to automate reports and custom configure them to clear out the noise that Reserved Instances can have in reviewing your data.
Such as…
Not only does it provide historical data but can view it from different perspectives of use be it service or a custom tagged perspectives like environment, customer, and department.
The consistent value it provides is around automated RI modifications, and health checks,
Etc… on slide.