1) Machine learning is a branch of artificial intelligence that allows systems to learn and improve automatically from experience without being explicitly programmed. It can play a significant role in improving IT operations through incident management, root cause analysis, and avoiding future problems.
2) Most enterprises have begun introducing machine learning and AI to automate aspects of IT operations. Over 80% of businesses view AI as a strategic priority and over 60% see it as a way to reduce costs. While humans currently handle most critical operations, an AI-enabled future is possible with machines playing a larger role and humans in a supporting function.
3) For AI to be effective in IT operations, enterprises must focus on data management including what data to collect,
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Machine Learning in IT Operations - Sampath Manickam
1. Machine learning is a branch of Artificial Intelligence (AI) that provides systems with the ability to
automatically learn and improve from experience without being explicitly programmed. It has a capability
to play a significant role in improving IT operations in terms of incident management, root cause analysis,
run-book automation and avoidance of future problems and to maintain the highest IT service availability
to the end customers.
Many enterprises have begun introducing machine-learning and artificial intelligence platforms and
automation as part of their IT Operation journey. 83% of
businesses say AI is a strategic priority for their
businesses today, as per a study by the Boston
Consulting Group and MIT Sloan Management Review.
Additionally, 63% of businesses say pressure to reduce
costs will require them to use AI. While humans currently
hold significant responsibility for critical operations at
present, an AI-enabled future is possible with machines
playing a more critical role and humans supporting them.
Humans will be empowered to use a system at scale,
leaving the autonomous system to handle routine IT
operations.
In the context of this article, artificial intelligence can be
defined as the use of Big Data analytics, Machine Learning and other artificial intelligence technologies
to automate daily IT operations. Such autonomous system will require us to create safety nets in case of
incidents and help to monitor, correlate and gain deep insights into data/ problem that the system has
been tuned (machine-learned) over the period of time, helping to identify and resolve/prevent the issues
that come up.
Machine Learning in IT Operations
Machine Learning is a subset of Artificial Intelligence, includes various analytics and algorithms to
automate, based on sample data to make predictions or decisions without being explicitly programmed
to perform various tasks in IT Operations, including event correlation to arrive at Root cause analysis,
2. tickets, alerts, and Change execution analysis, planned change versus actual change validation and
correlating with received logs, alerts, Present & past events and History from multiple sources within IT
systems & Tools.
The concerted use in IT operations is still in the nascent stages and yet to mature a lot. However, many
large enterprises or startups are taking steps towards this journey. Gartner predicts that large enterprise
exclusive use of AIOPS and digital experience monitoring tools to monitor applications and infrastructure
will rise from 5% in 2018 to 30% in 2023. It might be years in making end-to-end Automation, predict and
take the corrective automated action as part of day-to-day IT operations and the methods will vary for
each organization or industry.
AI and ML are only as good as the right data made available on that platform. Hence, one of the biggest
challenges for enterprise is the data management, including what type of data to be collected, where to
be collected, real-time or batch processing, where to be stored, how to establish basic relationships
between collected data sources, how an engineer feed the right information at the initial stage to tune
the system as part of machine learning exercise, etc. As we are dealing with various levels of
unstructured data, the correlation is not that obvious. This is a perfect task for a Data Scientist / Data
Engineering Team to create various rules between different data sources, determine how to
correlate/group them and when it makes sense to do so. This requires enterprises put forth great effort
into enterprise Data governance, maintaining and managing the complete platform, the huge amount of
performance and data they produce and its overall management of the system.
Next comes choosing the right Machine Learning (ML) algorithms as part of the automation platform
creation. These algorithms serve as the baseline for the ML behavior to achieve the desired business
goals and to meet Objectives in an automated way. Once the Machine learning algorithms tuned based
on sample data over the period of time, it knows how to deliver results, we can come out what needs to
be automated, i.e. the machine learns itself and performs as designed. ML makes use of all available
data sources, aggregating and organizing output data. Each data set can be collected, formatted and
cleaned for relevant information with noise and unnecessary data reduced to find trends, patterns and
problems.
With ML, IT operations are more proactive than reactive, automatically anticipating, identifying and
resolving issues in real-time which a human might not have detected from the multiple systems,
dashboards and metrics.
AI & ML Capabilities
A more proactive approach helps to detect issues at an early stage and makes root cause analysis faster
and easier. Even if the data set is vast, AI can get a speedy overview to detect the relation between
events and issues which will allow for faster troubleshooting. This is especially useful in ensuring security
as AI will monitor and detect unusual processes or activities and prioritize and address the possible
malware. Not only will the algorithms flag unusual activity faster, but it will also help to detect system
capacity issues, predict system failures, etc. When properly implemented, AI frees up the time and
attention of IT operation staffs from focusing on routine tasks /processes and allowing them to focus on
more complex tasks.
AI and ML can automate the management of IT infrastructure by scaling forecasted demand and
anticipating requirements based on historical data for storage, memory and processing power. By
mapping the workload, the AI is able to recommend the right configuration and improve agility,
productivity and efficiency. An additional benefit is insights into the IT environment while streamlining
communication between teams and business units.
3. Conclusion
As the world continues to evolve in Digital transformation, operation skills will continue to be needed but
the team sizes will reduce with scale growing larger. Companies are adopting these techniques and
technologies to stay competitive, cost-effective and efficient. Management of large distributed systems
with smaller talent will make a big impact on the organization to be much more efficient. The organization
can optimize its platforms with the right workload sizes and as little user intervention as possible. Instead
of having to manage a crisis, humans can play a supervisory role and leave the AI to determine the
course of action required based on the supporting data and metrics. With many such products in the
industry, ever more innovation is taking place to integrate Artificial Intelligence — Machine learning
platforms with the existing IT Operations tools, the whole IT industry is getting transformed towards an
autonomous system in order to provide seamless IT operation.