2.
Practical Machine Learning: discerning differences and selecting the best approach
2
TABLE OF CONTENTS
Executive
summary
...................................................................................................................................................................
3
Introduction
..................................................................................................................................................................................
3
Concepts
.........................................................................................................................................................................................
6
Process
and
Practicalities
.....................................................................................................................................................
15
Accessible
to
Data
Scientists
&
Business
Users
...........................................................................................................
20
Accessible
to
Developers
&
BI/DW
Professionals
.....................................................................................................
24
Key
Takeaways
..........................................................................................................................................................................
30
References
and
Resources
....................................................................................................................................................
32
Table
of
Abbreviations
......................................................................................................................................................
33
About
Lynn
Langit
....................................................................................................................................................................
34
About
Mark
Tabladillo
............................................................................................................................................................
34
3.
Practical Machine Learning: discerning differences and selecting the best approach
3
Executive summary
The formal definition of Machine Learning is this: the ability of computing systems to gain
knowledge from experience. Practical ML enables your organization to answer business
questions more effectively because of that experience. Machine Learning solutions consist of
your input data built into models which combine that data with statistical and data mining
algorithms.
Until relatively recently applied ML (as contrasted to ML for research) was simply too
specialized, difficult and expensive to have broad adoption outside of the academic community
and a few commercial domains (finance, ad serving). However, improvements in languages,
libraries as well new commercial offerings (including cloud-only products) have greatly
increased the practicality of implementing ML applications. Also demand has been fueled by Big
Data - more data encourages more powerful methods of processing to gain understanding from
that data.
This report will discuss technologies and implementation approaches for creating enterprise
data solutions that include one or more machine learning components. The report will also detail
the tradeoffs of each solution and determine which approach best fits organizational needs.
Introduction
The term ‘Predictive Analytics’ is used somewhat interchangeably with Machine Learning. The
central idea is that Machine Learning enables the creation of important business insights based
on a analyzing some set of input data with one or more data mining or statistical algorithms.
Where Machine Learning is used
In some sectors, particularly academic research, statistical analysis and data mining have been
standard analytical techniques for years. These sectors tend to use open source languages, tools
and libraries. Academics commonly use specialty coding languages such as R or Python libraries
(SciPy/NumPy/Pandas), rather than enterprise languages, such as Java for their ML research
projects. Also researchers tend to work with wide (many attributes) and shallow (relatively small
sample sizes) datasets. This academic dataset size is significant because many of the commonly
4.
Practical Machine Learning: discerning differences and selecting the best approach
4
used tools, such as R Studio or even Weka, are designed for small (albeit rich) datasets and they
are limited to working with datasets that can fit in the memory of analyst’s desktop computer
rather than requiring server or even cloud-scale processing power.
In a few commercial sectors, such as financial (for example with credit scoring) and security (for
example for email spam detection), use of ML (via data mining) is not a new approach. In these
areas, highly specialized tools and specially trained professionals have supported these types of
solutions. These vertical-specific ML solution development cycles run to the hundreds of
thousand or even millions of dollars to implement. These costs include software licenses,
powerful hardware, proprietary development and management tools and consulting fees. Also
these types of projects have commonly taken months or even years to implement.
However, the ML market landscape is rapidly changing with the availability of Big Data/cloud
storage, processing and data pipelines. These new services enable faster and cheaper data
collection, storage and processing. Also the growth of IoT (mostly sensor) data is increasing the
volumes of available data for analysis. These market changes are making the overall ‘entry point’
for ML projects less risky –i.e. cheaper and faster. Another driver of adoption is the efforts that
commercial vendors are putting into creating usable ML tooling – most of which is runs on that
particular vendor’s cloud infrastructure (such as IBM Watson on Bluemix, Microsoft Azure ML
on Azure or Amazon ML). ML projects are increasingly seen as a realistic possibility given the
larger market landscape. Simply put, more data means a need for more powerful methods of
deriving meaning from the increasingly large and complex datasets. Enter the
democratization of Machine Learning.
Challenges to Adoption
Although tools are reducing the complexity of applying the power of statistical and data mining
techniques to increasingly larger data sets, the enterprise market is in the early stages of ML
adoption. One of the key blockers is complexity -- creating useful predictive analytics or ML
differs substantially from the more traditional business analytics.
Because the application (and demand) for technical professionals skilled in applied statistics and
data mining had traditionally been a small market, we are faced with a lack of trained, working
5.
Practical Machine Learning: discerning differences and selecting the best approach
5
professionals who can produce useful results in this area. Specifically we lack those who have
experience on how to perform the tasks needed in the enterprise ML solution lifecycle – such as
to clean and groom the input data, to select appropriate techniques and algorithms, to build and
evaluate models and to support moving the result of their work to production.
Vendors are stepping in to reduce this gap. Several major commercial vendors have launched
general-purpose machine learning suites this year. As mentioned, the majority of these new
offers are cloud-based. Some solutions offer you the ability to train, test and deploy in either a
cloud or on premises, while other solutions are cloud-only, such as BigML.
6.
Practical Machine Learning: discerning differences and selecting the best approach
6
Concepts
Taxonomies and terms for Machine Learning solutions have important and nuanced differences
in meaning, proper understanding is key to differentiating products and solutions available in
the ML space. To begin, we’ll start by providing definitions of associated technologies.
What
is
the
difference
between
business
analytics
and
predictive
analytics?
Business Analytics is defined as finding answers to business questions by querying data and
producing a definite result or result set. For example: “What are the top five items that are
found in a shopping basket for a 38 year old man from California who is shopping on a
Saturday at 5pm at a major grocery chain?” The answer to this question (via a query to source
data) produces a deterministic result set, usually shown as a report or a dashboard is the only
type of analytics that they have available. Stated differently, business analytics are used to
analyze “what has happened” for past events.
Predictive Analytics is defined as finding answers to business questions by applying one or
more probabilistic algorithms to some set of input data and producing one or more
probabilistic results. For example: “Consider the items which appear together in the
shopping baskets of all 38 year old men from California who are shopping on a Saturday at
5pm at any of the major grocery chain stores for which we have data and predict how many of
a given item from this set the stores should have on hand to ensure proper supply for this type
of customer.” In this case, the type of algorithm is regression because it is used to predict a
future value or set of values. To get a result one or more regression algorithms are applied to the
source data – for example, linear regression. Because the results are probabilistic, i.e. a
percentage or score of likelihood of a result, it is common to use more than one evaluative
algorithm and then to evaluate the quality of the result. This is process is called ‘evaluating the
model.’ The best result from the models is selected and is either presented via statistical output
(probability) or via a customized visualization. Stated differently, predictive analytics are used to
analyze “what will happen” for potential or future events. The graphic below illustrates and
7.
Practical Machine Learning: discerning differences and selecting the best approach
7
contrasts sample results in business and predictive analytics.
Figure 1 - Two Types of Analytics
What
is
the
difference
between
data
mining
and
predictive
analytics?
Data Mining encompasses a broader set of tasks than that included in predictive analytics. In
addition to regression algorithms, data mining also includes other types predictive analysis.
Specifically, finding groupings in the source data, by matching new data to existing labeled (or
categorized) data is called classification. Classification algorithm executions are characterized
as implementations of ‘supervised’ algorithms because there is an authoritative set of data,
which is used to process the input data in addition to an algorithm. For example “In a set of data
there are examples of pictures or drawings of objects that we’ve identified and labeled as
particular animals – i.e. ‘this is a picture of a dog and that is a picture of a cat.’ “ A classification
task is to evaluate the likelihood of a new picture being a dog or a cat based on pattern matching
to the set of known states. An example of a classification algorithm is decision trees. Of note is
that regression is also ‘supervised’ because a data set with ‘known values’ is used in conjunction
8.
Practical Machine Learning: discerning differences and selecting the best approach
8
with the application of the regression algorithm when evaluating the probability of a result using
new input data.
Discovering natural groupings in source data, for which there are no known states or labels is
called clustering. Since there are no known states when clustering algorithms are used, this
type of machine learning is called ‘unsupervised’. An example of this technique is ‘here are some
pictures, group them into subsets based on characteristics (or labels) that are discovered
during the process of running the algorithm.’ As with the other types of ML, when
implementing clustering it is common to use multiple clustering algorithms, such as k-means,
then to evaluate the model results and finally to select the top performing algorithm and model
for the particular business problem.
What
is
the
difference
between
predictive
analytics
and
machine
learning?
Machine Learning is evolving to support the increasing volumes, varieties and velocities of Big
Data projects, rather than the smaller, simpler datasets that typified data mining projects,
particularly in academia. Another way to understand ML is as the next generation of data
mining. Machine learning is a superset of predictive analytics because it involves more than
application of one or more predictive analytic techniques (and associated algorithms) to sets of
input data. Another consideration is the current push toward commercial ‘productization’ of
machine learning applications. Although data mining and statistical analysis has been widely
used in particular domains, the broadest application, for academic research, is implemented
quite differently than for commercial applications.
Specifically there are many steps in data preparation for predictive analytics (or ML) projects
that are different from data preparation common for business analytics projects. Steps to
prepare input data for predictive analytics include such tasks as the following:
• Evaluating data types and detecting or creating labels (for classification)
• Evaluating number / ratio of null values
9.
Practical Machine Learning: discerning differences and selecting the best approach
9
• Evaluating quality/ usefulness of input data based on statistical analysis (mean, mode,
etc…)
• Removing outlier values (exceptions)
• Creating groupings (called ‘bucketing’)
Commercial tools provide data visualizers, which assist with data quality assessment at this state
and also facilitate easy modification of the input data. After the data preparation tasks have
been completed there is a 3-step process to implement a machine learning solution or model. It
is quite common for the model process to be iterative (because the outputs are probabilistic)
during the model creation phase. Iterations often include returning to the data preparation
phase because adjusting the quality of the input data impacts outputs. The need for iteration
over increasingly large data sets marries nicely with the scalability of cloud-based ML solutions.
These steps include the following:
• Input Data
o Ingest – in this step you ingest source data, common ingest methods are file-
based, database-based. Increasingly accepting streaming input is a requirement.
o Evaluate & Clean – in this step you review the input data (often done using
statistical analysis) and tune that data, so as to be prepared for inclusion in one or
more ML models
• Model
o Select ML Algorithm and Initialize Model(s) – in this step you match the
business question and input data to a ML technique (regression, classification or
clustering) and one or more algorithms from within that technique (such as, linear
regression, decision trees, k-means clustering) to evaluate the possibility of
building a useful model with this information
10.
Practical Machine Learning: discerning differences and selecting the best approach
10
o Train Model(s) – in this step you create the model and load it with data, you
then process the model and view the output
o Score Model(s) – in this step you evaluate the effectiveness of model results vs.
the ‘random guess’ line to understand the potential use of the model(s) for future
predictions, classifications and clustering tasks
• Predict
o Perform Prediction – in this step you evaluate new data against the model in
order to predict the likelihood of selected results.
These steps are often performed iteratively, as model scoring results in differentiation between
multiple models. You may decide to repeat some or all of the entire cycle with slightly different
input data, different algorithms, different algorithm parameters, etc… in order to produce one or
mode ‘useful’ models. Wizards and visualization tools found in ML products speed up these
iterative cycles.
Shown below is an open source project for RStudio called Shiny. Shiny is used by many R
developers, because it allows them to quickly an easily visualize (and query) models they created
in the R programming language. Note the use of input parameters via slider bars and text boxes.
These controls allow the ML developer to ‘try out’ different values in evaluating the usefulness of
their model. Lightweight visualization tools for rapid iteration are particularly
valuable for ML scenarios.
11.
Practical Machine Learning: discerning differences and selecting the best approach
11
Figure 2 - Visualization of R results using Shiny
Is
data
science
the
same
thing
as
machine
learning?
Data science is a super set of Machine Learning in that in addition to all of the tasks described in
the last paragraph, data science also includes hypothesis formation, or more simply, ‘asking the
right question(s)?’ Data science, as shown in the graphic, involves domain expertise, healthy
curiosity, scientific thinking, understanding of math, statistics, algorithms, data input sets and
visualization. Increasingly, a team of people in the enterprise is responsible for data science
projects, because the skill sets needs are simply not found in any one or two people. Also these
teams benefit from using enterprise-grade tools, which facilitate communication and other
12.
Practical Machine Learning: discerning differences and selecting the best approach
12
enterprise needs, such as security, source control and others.
Figure 3 - Skills need for Data Science
What is Artificial Intelligence and how does it relate to machine learning?
An AI (Artificial Intelligence) solution contains one of more intelligent agents. AI intelligent
agents automate tasks that would normally require a highly trained person to do. An example of
this type of task is speech recognition and translation. An AI system is one that responds to
complex problems in a human-like way. A well-known AI success of late is the celebrated win of
the IBM Watson AI system again two top human players in the TV trivia game show Jeopardy.
13.
Practical Machine Learning: discerning differences and selecting the best approach
13
In some ways, AI has more to do with process automation than learning because AI systems
ingest vast amounts of source data and perform iterative ML processes, often over a period of
years. In practice AI includes a number of ML components, so that the system and its processes
can be increasingly optimized or can learn over time. You can see commercial application of AI
systems in domains as disparate as medical diagnostics, self-driving cars, face and speech
recognition and bank fraud detection.
What
is
Deep
Learning
and
how
does
it
relate
to
machine
learning?
Deep Learning is a relatively new aspect of Machine Learning. It’s a set of algorithms in ML that
attempt to model high-level abstractions in data by using multiple non-linear transformations.
Deep Learning is focusing on improving the efficiency of unsupervised or semi-supervised
feature learning algorithms. It’s based on research in human neuroscience, such as human
neural coding. Algorithms are deep neural networks and problem sets include computer vision,
natural language processing and speed recognition. Also Deep Learning has been called the new
definition of the ‘neural networks’ data-mining algorithm.
Advances in hardware, particularly around GPU computational capabilities have facilitated use
of Deep Learning as they have enabled model-processing times to shrink from weeks or days to a
more practical level, i.e. minutes. However, given the computational intensity, it is still the case
that computational (processing time) requirements limit the widespread application of Deep
Learning algorithms.
Deep Learning is also called ‘strong AI’ because of it’s potential to disrupt a large number of
processes. Major software companies are focusing millions of dollars in research around
improving usability of Deep Learning in their own core products (such as their voice recognition
systems, Google Now, Microsoft Cortana and Apple Siri and other products). Although the
potential of Deep Learning is exciting, the reality is that the broad application of its results due
to time, cost, complexity and skills needed is still limited to experimental and (mostly) research
projects at a small subset of companies, such as Google, IBM, Microsoft, etc....
14.
Practical Machine Learning: discerning differences and selecting the best approach
14
What
is
the
importance
of
real-‐time
analytics?
Broader adoption of technologies such as in-memory databases and streaming Hadoop (Spark
Streaming, Storm and Samza), along with new types of data providers, e.g. IoT data input
devices, are increasing the demand for real-time analytics as a category. In addition creation of
cloud-based data pipeline libraries and products, enables the creation of more complex conduits
for incoming data, including through multiple processing pipelines.
Along with these advances in real-time Big Data technologies in general comes demand for
products, which can enable rapid creation of solutions that also include real-time predictive
analytics. Major software vendors are creating consumer products and services, such as adaptive
voice input (Google Now, Microsoft Cortana and Apple Siri) that use real-time predictive
analytics. These types of applications are igniting consumer imagination and fueling demand in
general.
15.
Practical Machine Learning: discerning differences and selecting the best approach
15
Process and Practicalities
Let’s take a deeper look at the processes involved in creating commercial machine learning
solutions. We are doing so, because, as mentioned, the process for creating useful commercial
predictive analytics is quite different than that of creating business analytics. Digging into the
detailed processes involved will help in our understanding of the usability of the libraries, tools
and products currently available.
Business data projects are driven by the need to gain more or better business insights. Given
that, what are the types of use cases that machine learning solutions can address? Remembering
the core functionality of ML, i.e. predicting one or more discrete, future values, classifying or
labeling new data into known groups and/or detecting natural groups in new data, here is a short
list of some types of common use cases:
• Facilities
&
Manufacturing
-‐-‐
Smart
Buildings,
Predictive
Maintenance
• Sales
&
Marketing
-‐-‐
Demand
Forecasting,
Churn
Analysis,
Target
Advertising
• Biomedical
-‐-‐
Life
Science
Research,
Healthcare
outcomes
(patient
re-‐admission
rates)
• Security
-‐-‐
Fraud
Detection,
Network
Intrusion
Detection
• Logistics
–
Routing
As mentioned the steps involved in a creating an end-to-end machine learning solution include a
number of considerations. Before the advent of cloud-based data storage, pipelines and machine
learning model tooling, costs involved in creating what were then called data mining solutions
blocked many enterprises. These costs included high hardware and software license fees (often
well over $ 100k, up to $ 1 million simply to start what was often a multi-year project was not
unheard of as well). Additionally, the costs of re-training or hiring specialty consultants to
implement the data mining projects added to the project costs and complexity. Prior to cloud-
based data storage and cloud-based data pipeline products, costs associated to unearthing
enterprise data from the various (and often proprietary) on-premise data silos added to adoption
blockers. Yet another blocker to implementing traditional data mining was that the domain of
16.
Practical Machine Learning: discerning differences and selecting the best approach
16
business analyst (or, in some cases, statistician) were wholly separated from developers who
would be charged with creating application interfaces for the results of the data mining work
produced by the business analysts.
Cloud storage combined with new types of Big Data storage has driven overall enterprise data
volumes up dramatically. Increasingly large and complex data sets are becoming progressively
more difficult to analyze in a meaningful way for the enterprise. Driven by particular sectors,
such as the ML analysis of massive amounts of behavioral data collected in social gaming (Angry
Birds, Halo, etc…), the enterprise appetite for getting started with ML projects has increased
sharply over the last 12 months.
Although the landscape is improving due to the release of improved open source libraries, tools
as well as new commercial tools, for most enterprises, ML projects are a new type of analytics.
Given that, for traditional enterprises, the newly releasing set of cloud-based ML tools and
services, such as Azure ML, IBM Waston, Predixion Software, AWS ML, BigML and others are a
welcome compliment to the existing (mostly open source) languages, libraries and tools.
Another new item in the emerging ecosystem of enterprise tools and products designed to
support enterprise ML projects is the emergence of commercial data markets. IBM, Microsoft
and Predixion Software all include the ability to directly ‘publish’ the results of one or more
useful ML experiments into their cloud-based repository or marketplace. Technically, most
enable the ML experiment to be published as a REST-based web service endpoint.
Interestingly, cloud vendors are leveraging integration with their own cloud services. For
example, Amazon ML includes the ability to enable real-time ML via a one-button click as shown
in the screenshot below. This real-time capability is integrated with AWS S3 storage. AWS ML
integrates with S3, RDS or Redshift at this time.
17.
Practical Machine Learning: discerning differences and selecting the best approach
17
Figure 4 - Amazon ML Model Usage Options
This functionality not only facilitates quick and easy deployment to production of commercial
ML services, but also has the interesting implication of providing the enterprise a commercial
platform from which they can monetize the results of their ML experiments by making those
results available as a commercial offering.
18.
Practical Machine Learning: discerning differences and selecting the best approach
18
Shown below is a chart that lists many of the major offerings – either commercial or open source.
Phase
Azure
AWS
Google
Commercial
Open
Source
Ingest
Stream
Insight
Kinesis
Big
Query
Data
Torrent
Flume
Pipeline
Data
Pipeline
Data
Pipeline
Data
Pipeline
Data
Torrent
Kafka
Storage
BLOB
Document
DB
SQLAzure
HDInsight
S3
Dynamo
DB
RDS
–
SQL
Redshift
EMR
BLOB
H/R
Datastore
MySQL
Hadoop
on
GCE
SAS
NoSQL
Hadoop
Create
Predictive
Models
Azure
ML
Revolution
Analytics
for
R
Language
AWS
ML
Prediction
API
SAS
IBM
Watson
Predixion
Software
BigML
Matlab
Mathematica
PredictionIO…
R
Mahout
Python
Pandas
Weka
Predicative
Results
Publication
and/or
Visualization
Excel
Power
BI
Gateway
PowerView
Azure
Data
Market
AWS
Lambdas
Partners
Google
Charts
BigML
Dato
Predixion
Marketplace
Tableau
Wolfram
Language
D3
In some verticals, such as biomedical, it is common to have some form of academic data mining
or statistics work (data sets and / or data mining models) to use as a basis for creating
commercial machine learning solutions. One example is when you are turning that academic
research into commercial biomedical products. Given that, we’ll list data mining languages,
libraries and tools, which are commonly used in academic research. Also, it has been the case
that traditional statistical tools and languages, i.e. Matlab, Mathematica, have high adoption in
the research sector.
19.
Practical Machine Learning: discerning differences and selecting the best approach
19
ML Academic Languages, Tools and Libraries – some are open source – most have free
versions for academic research – shown below is a chart that summarizes many of these items.
We have included the communities’ category, because academic data science communities are at
the front edge of work on improving open source tools and libraries and bear watching when you
are assessing the state of ML tools and products.
Category Objects Notes
Languages R Language
SciPy/NumPy/Pandas
Matlab
Mathematica
Julia
Mahout
Weka
Stats Language
Python Libraries for ML
Stats Language
Stats Language
Scalable Stats Language
ML for Hadoop
Research Stats Language
Tools R Studio
Shiny for R
Weka Studio
PyCharm
Sublime
IDE for R
Visualization for R
IDE for Weka
IDE for Python
IDE for Python and more
Communities KDNuggets
Kaggle
DataKind
Open Gov/Open Data
Code for America
Website
Competition
Community
Community
Community
20.
Practical Machine Learning: discerning differences and selecting the best approach
20
Accessible to Data Scientists & Business Users
A key question around the practicality of ML solutions for the enterprise is this: Who exactly will
develop the ML solutions in the enterprise? Given the diverse set of skills needed to successfully
implement any type of data science solution, much less the smaller subset (which is even more
complex – around ML), the first part of the answer is the most critical. A team of skilled
professionals best implements ML projects. Our answer to the common question “Do I just need
to hire a statistician to implement a ML project?” is an unqualified “No!” Commercial ML
differs substantially from ML for academic research. While the image of the lone scientist,
toiling away in his/her lab and carefully analyzing the results via complex statistical calculations
is the heritage of ML, this images bears little relationship to the practicalities of implementing
ML in the enterprise.
While there is definitely a place for a dedicated statistician on an enterprise ML team, this is no
longer a requirement for all ML projects. That being said, ML tools compliment (but do not
substitute for) statistical and data mining domain expertise. What has changed with the advent
of these tools, is the ability for your key team members to work with others (business analysts,
decision makers, developers, DevOps, etc…) because the tools use common interfaces and well-
designed dataflow visualizations. Also most tools are cloud-based, which means zero-install and
configuration and quick environment start up time. Additionally commercial tools are designed
to scale storage and processing via cloud capacity, enabling faster movement from small dataset
experiments to full-scale production deployments. Cloud-based tools are particularly well
suited for building quick proof-of-concept projects for the enterprise.
Given the democratization of tooling, you may be wondering whether this new tooling is
sophisticated enough for classically trained data scientists and academics to be able to make full
use of their complete skill sets? The answer is a conditional yes – some, but not all, commercial
products, such as Azure ML, contain integration with commonly used statistical languages (R
Language and Python libraries) and allow re-use of scripts created in these languages.
21.
Practical Machine Learning: discerning differences and selecting the best approach
21
Additionally, it’s important for researches to have visibility into algorithms and algorithm
parameters. This is important for reproducibility of published experiment results. Shown below
is an Azure ML model, which uses two-class support vector machines in performing
classification (of Tweets in this sample). Also of note is the ability to use R Language scripts in a
ML workflow:
Figure 5 - Azure ML Experiement
22.
Practical Machine Learning: discerning differences and selecting the best approach
22
Model evaluation is a key component of a ML Experiment. Here is sample output from Azure
ML model evaluation visualization. You’ll note that both score information (table) and graphical
output are included in the visualization:
Figure 6 - Azure ML Model Evaluation Output
23.
Practical Machine Learning: discerning differences and selecting the best approach
23
For comparison, shown below is output from a sample Amazon ML model evaluation:
Figure 7 - Amazon ML Model Evaluation Visualization
24.
Practical Machine Learning: discerning differences and selecting the best approach
24
Accessible to Developers & BI/DW
Professionals
An interesting and somewhat unexpected aspect of ML enterprise projects is that in no way is
having one or more Big Data repositories a requirement for undertaking this type of project.
Due to the origins of ML, i.e. academic research using statistics and data mining, some of the
most useful ML projects are, in fact, based on application of these techniques to LOB data. You
can think of it as being able to ask different kinds of questions of your current data.
Understanding when to use ML (and when not to) relates directly to the definitions of business
and predictive analytics. Simply put, use ML when you want to ask business questions will result
in probabilistic answers.
The ability to ask predictive questions of LOB data often yields useful results. For example, it
has been quite common to begin ML projects in sales and marketing departments, using CRM
data as source for ML experiments that involve answering business questions like ‘what are the
characteristics of the customers who produce the most revenue?’ (Clustering) and ‘what type of
cross-sell opportunities can we introduce on our website based on known customer purchase
patterns?’ (Classification).
Another common ‘entry point’ for ML solutions in the enterprise is in using IT (log) data.
Regulatory (access auditing) and compliance requirements – and also general security concerns,
drive ML experiments such as ‘at what day / time can I expect that network bandwidth usage will
spike to a particular level (value) for a particular segment of my corporate users?’ (Regression).
25.
Practical Machine Learning: discerning differences and selecting the best approach
25
In general, the enterprise can find value in appropriately applying predictive analytics via ML
solutions to a broad spectrum of domains. In addition to sales and market or DevOps,
enterprises can apply ML to other scenarios for which probabilistic analysis would yield useful
results. For example questions such as these can now be addressed:
• What are the most closely correlated employee attributes with highest revenue
production of that employee’s team?
• At what future point (value) in time do our customers in a certain segment (i.e.
demographics, geographic…) tend to make a subsequent purchase?
• What groups (trial or free items) of our public resources (website, Github, YouTube…)
tend to be used by browsers who become our customers?
As mentioned, integrated tooling provided by commercial vendors enables simpler deployment
and embedding of ML model results into enterprise applications via their ‘publish as a web
service’ functionality. Given that relatively few enterprise application developers have familiarity,
much less expertise in ML languages, tools and libraries, using commercial ML tools that include
‘click to publish’ functionality significantly speeds up time to market.
Another advantage of using commercial ML tools for the enterprise is the built in connectors to
disparate incoming data sources. Given that it is increasingly common to use a broad variety of
data sources as ML ingest sources, the availability of pre-built connectors once again speed
development cycles. It is common to include connector for LOB data, i.e. RDBMS systems (both
on-premise and cloud-based) as well as for some of the newer NoSQL databases, Hadoop as well
as one or more type of incoming data stream.
26.
Practical Machine Learning: discerning differences and selecting the best approach
26
Also useful are the quick statistical snapshots that most commercial ML tools provide of datasets
in your ML project. For example, the AWS ML dataset console view includes the visualization
shown below:
Figure 8 - AWS ML Datasources Attribute Information
The AWS viewer not only allows the ML team to ‘see’ the attribute names, but also the
correlations, uniqueness of data, most/least frequent categories, it also includes an inline
‘Preview’ visualization of the uniqueness of the data.
As mentioned, integrated commercial ML tooling, which include ‘one-click’ to deploy capabilities
increases usability for developers and BI professionals. Additionally, capabilities, which
essentially advertise published ML web services, such as Microsoft Azure Data Market, provide
additional discoverability; usability and also commerce opportunities for published services are
also emerging. An example is shown below.
27.
Practical Machine Learning: discerning differences and selecting the best approach
27
Figure 9 - Azure Machine Learning Test Harness
28.
Practical Machine Learning: discerning differences and selecting the best approach
28
Visualization of results is another element of ML solution usability. To that end, we’ve included
a sample from IBM Watson Analytics. This service includes flexible visualizations at all phases
of the ML process (i.e. data discovery, modeling, etc…) an example is shown below.
Figure 10 - IBM Watson ML Visualization
29.
Practical Machine Learning: discerning differences and selecting the best approach
29
Our last example of model visualization is from the commercial cloud-based vendor BigML and
is shown below. Also interesting is how vendors such as BigML enable community via providing
a platform for their users to get more value from their ML models. You’ll note BigML allows
users to upload, share, rate and also sell models for use by others in their own ML scenarios.
Figure 11 - BigML Model Visualization
30.
Practical Machine Learning: discerning differences and selecting the best approach
30
Key Takeaways
Incorporating the results of machine learning experiments into production data solutions adds
significant complexity to the overall projects. Given this, a solid understanding of technology
choices around machine learning solutions is essential for designing and delivering solutions
that provide business value to the organization.
• Use commercial machine learning products when team members new to
machine learning processes are creating your solution. Due to fundamental
differences at every stage in the data pipeline, i.e. data preparation, hypothesis formation,
algorithm selection, model training and evaluation, ML projects introduce a set of
complex processes into the enterprise. If your data paradigm consists of an OLTP store
alone, you would be best served by leveraging commercial ML development suites, rather
than attempting to cobble together solutions based on tools and libraries that were built
primarily for statisticians.
• Select tools or coding libraries that perform at the speed and scale for the
data ingest and processing scale for the types of machine learning methods
that your business problems require. Enterprises will benefit from leveraging cloud
storage and process of Big Data workloads as sources for ML solutions because their data
volumes are generally significantly larger than those of academic research. Also, in-
memory streams are increasingly relevant, particularly with the advent of more and more
IoT scenarios.
• Teams that have already implemented pure open source data solutions are
most capable of adding pure open source machine learning solutions.
Domains where data mining and/or statistics may have already been in use, such as
academic research will have more success using open source tools and libraries, so long as
their input data does not overrun the capabilities of those tools.
• Plan for and test your model deployment topology to ensure ML experiments
deliver production business value. Commercial vendors are incorporating one-click
to deploy functionality in their ML studio environments, given the common challenges
31.
Practical Machine Learning: discerning differences and selecting the best approach
31
around deployment of ML models; such functionality enables faster time to market for
production solutions. Also consider the vendor path to implementing streaming or near-
real time ML solutions if that is part of your requirements.
• Select tools or plan for coding appropriate types of visualization solutions.
ML outputs are unfamiliar to many business users. Standard reports and
dashboards have not been designed to display ML results in a meaningful way. Selecting
ML vendors, which integrate results easily into other commercial solutions or common
libraries results in broader usability for ML solutions.
32.
Practical Machine Learning: discerning differences and selecting the best approach
32
References and Resources
This
section
lists
the
references
and
resources
referred
to
in
this
article.
Data
Science
graphic
-‐-‐
http://civicscience.com/data-‐science-‐a-‐visual-‐guide/
Shiny
for
R-‐Studio
-‐-‐
http://shiny.rstudio.com/gallery/movie-‐explorer.html
Deep
Learning
and
the
Hololens
-‐-‐
https://technoptimist.wordpress.com/2015/01/25/deep-‐
learning-‐and-‐the-‐hololens
Collection
of
papers
on
how
IBM
Watson
works
-‐
http://www.andrew.cmu.edu/user/ooo/watson/
What
is
AI?
-‐-‐
http://www.techopedia.com/definition/190/artificial-‐intelligence-‐ai
How
Google
is
Teaching
Computers
to
See
-‐
https://gigaom.com/2012/06/25/how-‐google-‐is-‐
teaching-‐computers-‐to-‐see/
Need
Deep
Learning?
Here
are
4
Lessons
from
Google
-‐
https://gigaom.com/2015/01/29/new-‐to-‐
deep-‐learning-‐here-‐are-‐4-‐easy-‐lessons-‐from-‐google/
Getting
started
with
AWS
ML
-‐-‐
http://docs.aws.amazon.com/machine-‐
learning/latest/dg/tutorial.html
AzureML
on
Windows
Azure
DataMarket
/
Binary
Classifier
Sample
-‐-‐
https://datamarket.azure.com/dataset/aml_labs/log_regression
BigML
Sample
Model
-‐
https://bigml.com/user/ashikiar/gallery/model/53b2f21ec8db635905000d33
Kaggle
Community
-‐
https://www.kaggle.com/
DataKind
Community
-‐
http://www.datakind.org/
33.
Practical Machine Learning: discerning differences and selecting the best approach
33
Table of Abbreviations
Abbreviation
Full
Term
AI
Artificial
Intelligence
AWS
Amazon
Web
Services
BI
Business
Intelligence
CRM
Customer
Relationship
Management
DW
Data
Warehouse
GPU
Graphics
Processing
Unit
IoT
Internet
of
Things
LOB
Line
of
Business
ML
Machine
Learning
NoSQL
No
SQL
OLAP
On
line
analytical
processing
OLTP
On
line
transactional
processing
POC
Proof-‐of-‐concept
RDBMS
Relational
Database
Management
System
34.
Practical Machine Learning: discerning differences and selecting the best approach
34
About Lynn Langit
Lynn Langit is a Big Data and Cloud Architect who has been working with database solutions for
more than 15 years. Over the past 4 years, she’s been working as an independent architect using
these technologies, mostly in the biotech, education, manufacturing and facilities verticals. Lynn
has done POCs and has helped teams build solutions on the AWS, Azure, Google and Rackspace
Clouds. She has done work with SQL Server, MySQL, AWS Redshift, AWS MapReduce, Cloudera
Hadoop, MongoDB, Neo4j, Aerospike and many other database systems. In addition to building
solutions, Lynn also partners with all major vendor cloud vendors, providing early technical
feedback into their Big Data and Cloud offerings. She is an AWS Community Hero, Google
Developer Expert (Cloud), Microsoft MVP (SQL Server) and a MongoDB Master. Lynn is also a
Cloudera certified instructor (for MapReduce Programming).
Prior to re-entering the consulting world 3 years ago, Lynn’s background is over 10 years as a
Microsoft Certified instructor, a Microsoft vendor and then 4 years as Microsoft employee. She’s
published 3 books on SQL Server Business Intelligence and has most recently worked with the
SQL Azure team at Microsoft. She continues to write and screencast and hosts a BigData channel
on YouTube (http://www.youtube.com/SoCalDevGal) with over 150 different technical videos
on Cloud and BigData topics. Lynn is also a committer on several open source projects
(http://github.com/lynnlangit).
About Mark Tabladillo
Mark Tabladillo is a Senior Data Scientist at midtown Atlanta's Predictix/LogicBlox. He has used
and promoted Microsoft Azure Machine Learning, Microsoft SQL Server Data Mining, Microsoft
BI Stack, Power BI, SAS, SPSS, R, and Julia. He is a SQL Server MVP and has a research
doctorate (PhD) from Georgia Tech. He is chapter leader for PASS Data Science Virtual Chapter,
which has periodic live meetings and its own YouTube channel.