SlideShare una empresa de Scribd logo
1 de 34
Descargar para leer sin conexión
Practical Machine Learning:
discerning differences and selecting
the best approach
Lynn Langit
Reviewed	
  by	
  Mark	
  Tabladillo	
  
 
Practical Machine Learning: discerning differences and selecting the best approach	
   2
TABLE OF CONTENTS
Executive	
  summary	
  ...................................................................................................................................................................	
  3	
  
Introduction	
  ..................................................................................................................................................................................	
  3	
  
Concepts	
  .........................................................................................................................................................................................	
  6	
  
Process	
  and	
  Practicalities	
  .....................................................................................................................................................	
  15	
  
Accessible	
  to	
  Data	
  Scientists	
  &	
  Business	
  Users	
  ...........................................................................................................	
  20	
  
Accessible	
  to	
  Developers	
  &	
  BI/DW	
  Professionals	
  .....................................................................................................	
  24	
  
Key	
  Takeaways	
  ..........................................................................................................................................................................	
  30	
  
References	
  and	
  Resources	
  ....................................................................................................................................................	
  32	
  
Table	
  of	
  Abbreviations	
  ......................................................................................................................................................	
  33	
  
About	
  Lynn	
  Langit	
  ....................................................................................................................................................................	
  34	
  
About	
  Mark	
  Tabladillo	
  ............................................................................................................................................................	
  34	
  
	
  
 
Practical Machine Learning: discerning differences and selecting the best approach	
   3
Executive summary
The formal definition of Machine Learning is this: the ability of computing systems to gain
knowledge from experience. Practical ML enables your organization to answer business
questions more effectively because of that experience. Machine Learning solutions consist of
your input data built into models which combine that data with statistical and data mining
algorithms.
Until relatively recently applied ML (as contrasted to ML for research) was simply too
specialized, difficult and expensive to have broad adoption outside of the academic community
and a few commercial domains (finance, ad serving). However, improvements in languages,
libraries as well new commercial offerings (including cloud-only products) have greatly
increased the practicality of implementing ML applications. Also demand has been fueled by Big
Data - more data encourages more powerful methods of processing to gain understanding from
that data.
This report will discuss technologies and implementation approaches for creating enterprise
data solutions that include one or more machine learning components. The report will also detail
the tradeoffs of each solution and determine which approach best fits organizational needs.
Introduction
The term ‘Predictive Analytics’ is used somewhat interchangeably with Machine Learning. The
central idea is that Machine Learning enables the creation of important business insights based
on a analyzing some set of input data with one or more data mining or statistical algorithms.
Where Machine Learning is used
In some sectors, particularly academic research, statistical analysis and data mining have been
standard analytical techniques for years. These sectors tend to use open source languages, tools
and libraries. Academics commonly use specialty coding languages such as R or Python libraries
(SciPy/NumPy/Pandas), rather than enterprise languages, such as Java for their ML research
projects. Also researchers tend to work with wide (many attributes) and shallow (relatively small
sample sizes) datasets. This academic dataset size is significant because many of the commonly
 
Practical Machine Learning: discerning differences and selecting the best approach	
   4
used tools, such as R Studio or even Weka, are designed for small (albeit rich) datasets and they
are limited to working with datasets that can fit in the memory of analyst’s desktop computer
rather than requiring server or even cloud-scale processing power.
In a few commercial sectors, such as financial (for example with credit scoring) and security (for
example for email spam detection), use of ML (via data mining) is not a new approach. In these
areas, highly specialized tools and specially trained professionals have supported these types of
solutions. These vertical-specific ML solution development cycles run to the hundreds of
thousand or even millions of dollars to implement. These costs include software licenses,
powerful hardware, proprietary development and management tools and consulting fees. Also
these types of projects have commonly taken months or even years to implement.
However, the ML market landscape is rapidly changing with the availability of Big Data/cloud
storage, processing and data pipelines. These new services enable faster and cheaper data
collection, storage and processing. Also the growth of IoT (mostly sensor) data is increasing the
volumes of available data for analysis. These market changes are making the overall ‘entry point’
for ML projects less risky –i.e. cheaper and faster. Another driver of adoption is the efforts that
commercial vendors are putting into creating usable ML tooling – most of which is runs on that
particular vendor’s cloud infrastructure (such as IBM Watson on Bluemix, Microsoft Azure ML
on Azure or Amazon ML). ML projects are increasingly seen as a realistic possibility given the
larger market landscape. Simply put, more data means a need for more powerful methods of
deriving meaning from the increasingly large and complex datasets. Enter the
democratization of Machine Learning.
Challenges to Adoption
	
  
Although tools are reducing the complexity of applying the power of statistical and data mining
techniques to increasingly larger data sets, the enterprise market is in the early stages of ML
adoption. One of the key blockers is complexity -- creating useful predictive analytics or ML
differs substantially from the more traditional business analytics.
Because the application (and demand) for technical professionals skilled in applied statistics and
data mining had traditionally been a small market, we are faced with a lack of trained, working
 
Practical Machine Learning: discerning differences and selecting the best approach	
   5
professionals who can produce useful results in this area. Specifically we lack those who have
experience on how to perform the tasks needed in the enterprise ML solution lifecycle – such as
to clean and groom the input data, to select appropriate techniques and algorithms, to build and
evaluate models and to support moving the result of their work to production.
Vendors are stepping in to reduce this gap. Several major commercial vendors have launched
general-purpose machine learning suites this year. As mentioned, the majority of these new
offers are cloud-based. Some solutions offer you the ability to train, test and deploy in either a
cloud or on premises, while other solutions are cloud-only, such as BigML.
 
Practical Machine Learning: discerning differences and selecting the best approach	
   6
Concepts
Taxonomies and terms for Machine Learning solutions have important and nuanced differences
in meaning, proper understanding is key to differentiating products and solutions available in
the ML space. To begin, we’ll start by providing definitions of associated technologies.
What	
  is	
  the	
  difference	
  between	
  business	
  analytics	
  and	
  predictive	
  analytics?	
  
	
  
Business Analytics is defined as finding answers to business questions by querying data and
producing a definite result or result set. For example: “What are the top five items that are
found in a shopping basket for a 38 year old man from California who is shopping on a
Saturday at 5pm at a major grocery chain?” The answer to this question (via a query to source
data) produces a deterministic result set, usually shown as a report or a dashboard is the only
type of analytics that they have available. Stated differently, business analytics are used to
analyze “what has happened” for past events.
Predictive Analytics is defined as finding answers to business questions by applying one or
more probabilistic algorithms to some set of input data and producing one or more
probabilistic results. For example: “Consider the items which appear together in the
shopping baskets of all 38 year old men from California who are shopping on a Saturday at
5pm at any of the major grocery chain stores for which we have data and predict how many of
a given item from this set the stores should have on hand to ensure proper supply for this type
of customer.” In this case, the type of algorithm is regression because it is used to predict a
future value or set of values. To get a result one or more regression algorithms are applied to the
source data – for example, linear regression. Because the results are probabilistic, i.e. a
percentage or score of likelihood of a result, it is common to use more than one evaluative
algorithm and then to evaluate the quality of the result. This is process is called ‘evaluating the
model.’ The best result from the models is selected and is either presented via statistical output
(probability) or via a customized visualization. Stated differently, predictive analytics are used to
analyze “what will happen” for potential or future events. The graphic below illustrates and
 
Practical Machine Learning: discerning differences and selecting the best approach	
   7
contrasts sample results in business and predictive analytics.
Figure 1 - Two Types of Analytics
What	
  is	
  the	
  difference	
  between	
  data	
  mining	
  and	
  predictive	
  analytics?	
  
	
  
Data Mining encompasses a broader set of tasks than that included in predictive analytics. In
addition to regression algorithms, data mining also includes other types predictive analysis.
Specifically, finding groupings in the source data, by matching new data to existing labeled (or
categorized) data is called classification. Classification algorithm executions are characterized
as implementations of ‘supervised’ algorithms because there is an authoritative set of data,
which is used to process the input data in addition to an algorithm. For example “In a set of data
there are examples of pictures or drawings of objects that we’ve identified and labeled as
particular animals – i.e. ‘this is a picture of a dog and that is a picture of a cat.’ “ A classification
task is to evaluate the likelihood of a new picture being a dog or a cat based on pattern matching
to the set of known states. An example of a classification algorithm is decision trees. Of note is
that regression is also ‘supervised’ because a data set with ‘known values’ is used in conjunction
 
Practical Machine Learning: discerning differences and selecting the best approach	
   8
with the application of the regression algorithm when evaluating the probability of a result using
new input data.
Discovering natural groupings in source data, for which there are no known states or labels is
called clustering. Since there are no known states when clustering algorithms are used, this
type of machine learning is called ‘unsupervised’. An example of this technique is ‘here are some
pictures, group them into subsets based on characteristics (or labels) that are discovered
during the process of running the algorithm.’ As with the other types of ML, when
implementing clustering it is common to use multiple clustering algorithms, such as k-means,
then to evaluate the model results and finally to select the top performing algorithm and model
for the particular business problem.
What	
  is	
  the	
  difference	
  between	
  predictive	
  analytics	
  and	
  machine	
  learning?	
  
	
  
	
  
Machine Learning is evolving to support the increasing volumes, varieties and velocities of Big
Data projects, rather than the smaller, simpler datasets that typified data mining projects,
particularly in academia. Another way to understand ML is as the next generation of data
mining. Machine learning is a superset of predictive analytics because it involves more than
application of one or more predictive analytic techniques (and associated algorithms) to sets of
input data. Another consideration is the current push toward commercial ‘productization’ of
machine learning applications. Although data mining and statistical analysis has been widely
used in particular domains, the broadest application, for academic research, is implemented
quite differently than for commercial applications.
Specifically there are many steps in data preparation for predictive analytics (or ML) projects
that are different from data preparation common for business analytics projects. Steps to
prepare input data for predictive analytics include such tasks as the following:
• Evaluating data types and detecting or creating labels (for classification)
• Evaluating number / ratio of null values
 
Practical Machine Learning: discerning differences and selecting the best approach	
   9
• Evaluating quality/ usefulness of input data based on statistical analysis (mean, mode,
etc…)
• Removing outlier values (exceptions)
• Creating groupings (called ‘bucketing’)
Commercial tools provide data visualizers, which assist with data quality assessment at this state
and also facilitate easy modification of the input data. After the data preparation tasks have
been completed there is a 3-step process to implement a machine learning solution or model. It
is quite common for the model process to be iterative (because the outputs are probabilistic)
during the model creation phase. Iterations often include returning to the data preparation
phase because adjusting the quality of the input data impacts outputs. The need for iteration
over increasingly large data sets marries nicely with the scalability of cloud-based ML solutions.
These steps include the following:
• Input Data
o Ingest – in this step you ingest source data, common ingest methods are file-
based, database-based. Increasingly accepting streaming input is a requirement.
o Evaluate & Clean – in this step you review the input data (often done using
statistical analysis) and tune that data, so as to be prepared for inclusion in one or
more ML models
• Model
o Select ML Algorithm and Initialize Model(s) – in this step you match the
business question and input data to a ML technique (regression, classification or
clustering) and one or more algorithms from within that technique (such as, linear
regression, decision trees, k-means clustering) to evaluate the possibility of
building a useful model with this information
 
Practical Machine Learning: discerning differences and selecting the best approach	
   10
o Train Model(s) – in this step you create the model and load it with data, you
then process the model and view the output
o Score Model(s) – in this step you evaluate the effectiveness of model results vs.
the ‘random guess’ line to understand the potential use of the model(s) for future
predictions, classifications and clustering tasks
• Predict
o Perform Prediction – in this step you evaluate new data against the model in
order to predict the likelihood of selected results.
These steps are often performed iteratively, as model scoring results in differentiation between
multiple models. You may decide to repeat some or all of the entire cycle with slightly different
input data, different algorithms, different algorithm parameters, etc… in order to produce one or
mode ‘useful’ models. Wizards and visualization tools found in ML products speed up these
iterative cycles.
Shown below is an open source project for RStudio called Shiny. Shiny is used by many R
developers, because it allows them to quickly an easily visualize (and query) models they created
in the R programming language. Note the use of input parameters via slider bars and text boxes.
These controls allow the ML developer to ‘try out’ different values in evaluating the usefulness of
their model. Lightweight visualization tools for rapid iteration are particularly
valuable for ML scenarios.
 
Practical Machine Learning: discerning differences and selecting the best approach	
   11
Figure 2 - Visualization of R results using Shiny
Is	
  data	
  science	
  the	
  same	
  thing	
  as	
  machine	
  learning?	
  
	
  
Data science is a super set of Machine Learning in that in addition to all of the tasks described in
the last paragraph, data science also includes hypothesis formation, or more simply, ‘asking the
right question(s)?’ Data science, as shown in the graphic, involves domain expertise, healthy
curiosity, scientific thinking, understanding of math, statistics, algorithms, data input sets and
visualization. Increasingly, a team of people in the enterprise is responsible for data science
projects, because the skill sets needs are simply not found in any one or two people. Also these
teams benefit from using enterprise-grade tools, which facilitate communication and other
 
Practical Machine Learning: discerning differences and selecting the best approach	
   12
enterprise needs, such as security, source control and others.
Figure 3 - Skills need for Data Science
What is Artificial Intelligence and how does it relate to machine learning?
An AI (Artificial Intelligence) solution contains one of more intelligent agents. AI intelligent
agents automate tasks that would normally require a highly trained person to do. An example of
this type of task is speech recognition and translation. An AI system is one that responds to
complex problems in a human-like way. A well-known AI success of late is the celebrated win of
the IBM Watson AI system again two top human players in the TV trivia game show Jeopardy.
 
Practical Machine Learning: discerning differences and selecting the best approach	
   13
In some ways, AI has more to do with process automation than learning because AI systems
ingest vast amounts of source data and perform iterative ML processes, often over a period of
years. In practice AI includes a number of ML components, so that the system and its processes
can be increasingly optimized or can learn over time. You can see commercial application of AI
systems in domains as disparate as medical diagnostics, self-driving cars, face and speech
recognition and bank fraud detection.
What	
  is	
  Deep	
  Learning	
  and	
  how	
  does	
  it	
  relate	
  to	
  machine	
  learning?	
  
	
  
Deep Learning is a relatively new aspect of Machine Learning. It’s a set of algorithms in ML that
attempt to model high-level abstractions in data by using multiple non-linear transformations.
Deep Learning is focusing on improving the efficiency of unsupervised or semi-supervised
feature learning algorithms. It’s based on research in human neuroscience, such as human
neural coding. Algorithms are deep neural networks and problem sets include computer vision,
natural language processing and speed recognition. Also Deep Learning has been called the new
definition of the ‘neural networks’ data-mining algorithm.
Advances in hardware, particularly around GPU computational capabilities have facilitated use
of Deep Learning as they have enabled model-processing times to shrink from weeks or days to a
more practical level, i.e. minutes. However, given the computational intensity, it is still the case
that computational (processing time) requirements limit the widespread application of Deep
Learning algorithms.
Deep Learning is also called ‘strong AI’ because of it’s potential to disrupt a large number of
processes. Major software companies are focusing millions of dollars in research around
improving usability of Deep Learning in their own core products (such as their voice recognition
systems, Google Now, Microsoft Cortana and Apple Siri and other products). Although the
potential of Deep Learning is exciting, the reality is that the broad application of its results due
to time, cost, complexity and skills needed is still limited to experimental and (mostly) research
projects at a small subset of companies, such as Google, IBM, Microsoft, etc....
 
Practical Machine Learning: discerning differences and selecting the best approach	
   14
What	
  is	
  the	
  importance	
  of	
  real-­‐time	
  analytics?	
  
Broader adoption of technologies such as in-memory databases and streaming Hadoop (Spark
Streaming, Storm and Samza), along with new types of data providers, e.g. IoT data input
devices, are increasing the demand for real-time analytics as a category. In addition creation of
cloud-based data pipeline libraries and products, enables the creation of more complex conduits
for incoming data, including through multiple processing pipelines.
Along with these advances in real-time Big Data technologies in general comes demand for
products, which can enable rapid creation of solutions that also include real-time predictive
analytics. Major software vendors are creating consumer products and services, such as adaptive
voice input (Google Now, Microsoft Cortana and Apple Siri) that use real-time predictive
analytics. These types of applications are igniting consumer imagination and fueling demand in
general.
	
  
 
Practical Machine Learning: discerning differences and selecting the best approach	
   15
Process and Practicalities
	
  
Let’s take a deeper look at the processes involved in creating commercial machine learning
solutions. We are doing so, because, as mentioned, the process for creating useful commercial
predictive analytics is quite different than that of creating business analytics. Digging into the
detailed processes involved will help in our understanding of the usability of the libraries, tools
and products currently available.
Business data projects are driven by the need to gain more or better business insights. Given
that, what are the types of use cases that machine learning solutions can address? Remembering
the core functionality of ML, i.e. predicting one or more discrete, future values, classifying or
labeling new data into known groups and/or detecting natural groups in new data, here is a short
list of some types of common use cases:
• Facilities	
  &	
  Manufacturing	
  -­‐-­‐	
  Smart	
  Buildings,	
  Predictive	
  Maintenance	
  
• Sales	
  &	
  Marketing	
  -­‐-­‐	
  Demand	
  Forecasting,	
  Churn	
  Analysis,	
  Target	
  Advertising	
  
• Biomedical	
  -­‐-­‐	
  Life	
  Science	
  Research,	
  Healthcare	
  outcomes	
  (patient	
  re-­‐admission	
  
rates)	
  
• Security	
  -­‐-­‐	
  Fraud	
  Detection,	
  Network	
  Intrusion	
  Detection	
  
• Logistics	
  –	
  Routing	
  
	
  
As mentioned the steps involved in a creating an end-to-end machine learning solution include a
number of considerations. Before the advent of cloud-based data storage, pipelines and machine
learning model tooling, costs involved in creating what were then called data mining solutions
blocked many enterprises. These costs included high hardware and software license fees (often
well over $ 100k, up to $ 1 million simply to start what was often a multi-year project was not
unheard of as well). Additionally, the costs of re-training or hiring specialty consultants to
implement the data mining projects added to the project costs and complexity. Prior to cloud-
based data storage and cloud-based data pipeline products, costs associated to unearthing
enterprise data from the various (and often proprietary) on-premise data silos added to adoption
blockers. Yet another blocker to implementing traditional data mining was that the domain of
 
Practical Machine Learning: discerning differences and selecting the best approach	
   16
business analyst (or, in some cases, statistician) were wholly separated from developers who
would be charged with creating application interfaces for the results of the data mining work
produced by the business analysts.
Cloud storage combined with new types of Big Data storage has driven overall enterprise data
volumes up dramatically. Increasingly large and complex data sets are becoming progressively
more difficult to analyze in a meaningful way for the enterprise. Driven by particular sectors,
such as the ML analysis of massive amounts of behavioral data collected in social gaming (Angry
Birds, Halo, etc…), the enterprise appetite for getting started with ML projects has increased
sharply over the last 12 months.
Although the landscape is improving due to the release of improved open source libraries, tools
as well as new commercial tools, for most enterprises, ML projects are a new type of analytics.
Given that, for traditional enterprises, the newly releasing set of cloud-based ML tools and
services, such as Azure ML, IBM Waston, Predixion Software, AWS ML, BigML and others are a
welcome compliment to the existing (mostly open source) languages, libraries and tools.
Another new item in the emerging ecosystem of enterprise tools and products designed to
support enterprise ML projects is the emergence of commercial data markets. IBM, Microsoft
and Predixion Software all include the ability to directly ‘publish’ the results of one or more
useful ML experiments into their cloud-based repository or marketplace. Technically, most
enable the ML experiment to be published as a REST-based web service endpoint.
Interestingly, cloud vendors are leveraging integration with their own cloud services. For
example, Amazon ML includes the ability to enable real-time ML via a one-button click as shown
in the screenshot below. This real-time capability is integrated with AWS S3 storage. AWS ML
integrates with S3, RDS or Redshift at this time.
 
Practical Machine Learning: discerning differences and selecting the best approach	
   17
Figure 4 - Amazon ML Model Usage Options
This functionality not only facilitates quick and easy deployment to production of commercial
ML services, but also has the interesting implication of providing the enterprise a commercial
platform from which they can monetize the results of their ML experiments by making those
results available as a commercial offering.
 
Practical Machine Learning: discerning differences and selecting the best approach	
   18
Shown below is a chart that lists many of the major offerings – either commercial or open source.
	
  
Phase	
   Azure	
   AWS	
   Google	
   Commercial	
   Open	
  
Source	
  
Ingest	
   Stream	
  
Insight	
  
Kinesis	
   Big	
  Query	
   Data	
  Torrent	
   Flume	
  
Pipeline	
   Data	
  Pipeline	
   Data	
  Pipeline	
   Data	
  Pipeline	
   Data	
  Torrent	
   Kafka	
  
Storage	
   BLOB	
  
Document	
  DB	
  
SQLAzure	
  
HDInsight	
  
S3	
  
Dynamo	
  DB	
  
RDS	
  –	
  SQL	
  	
  
Redshift	
  
EMR	
  
BLOB	
  
H/R	
  Datastore	
  
MySQL	
  
Hadoop	
  on	
  GCE	
  
SAS	
   NoSQL	
  
Hadoop	
  
Create	
  
Predictive	
  
Models	
  
Azure	
  ML	
  
Revolution	
  
Analytics	
  for	
  
R	
  Language	
  
AWS	
  ML	
   Prediction	
  API	
   SAS	
  
IBM	
  Watson	
  
Predixion	
  Software	
  
BigML	
  
Matlab	
  
Mathematica	
  
PredictionIO…	
  
R	
  	
  
Mahout	
  	
  
Python	
  
Pandas	
  
Weka	
  
Predicative	
  
Results	
  	
  
Publication	
  
and/or	
  
Visualization	
  
Excel	
  
Power	
  BI	
  
Gateway	
  
PowerView	
  
Azure	
  Data	
  
Market	
  
AWS	
  
Lambdas	
  
Partners	
  
Google	
  Charts	
   BigML	
  
Dato	
  
Predixion	
  
Marketplace	
  
Tableau	
  
Wolfram	
  Language	
  
	
  
D3	
  
	
  
In some verticals, such as biomedical, it is common to have some form of academic data mining
or statistics work (data sets and / or data mining models) to use as a basis for creating
commercial machine learning solutions. One example is when you are turning that academic
research into commercial biomedical products. Given that, we’ll list data mining languages,
libraries and tools, which are commonly used in academic research. Also, it has been the case
that traditional statistical tools and languages, i.e. Matlab, Mathematica, have high adoption in
the research sector.
 
Practical Machine Learning: discerning differences and selecting the best approach	
   19
ML Academic Languages, Tools and Libraries – some are open source – most have free
versions for academic research – shown below is a chart that summarizes many of these items.
We have included the communities’ category, because academic data science communities are at
the front edge of work on improving open source tools and libraries and bear watching when you
are assessing the state of ML tools and products.
Category Objects Notes
Languages R Language
SciPy/NumPy/Pandas
Matlab
Mathematica
Julia
Mahout
Weka
Stats Language
Python Libraries for ML
Stats Language
Stats Language
Scalable Stats Language
ML for Hadoop
Research Stats Language
Tools R Studio
Shiny for R
Weka Studio
PyCharm
Sublime
IDE for R
Visualization for R
IDE for Weka
IDE for Python
IDE for Python and more
Communities KDNuggets
Kaggle
DataKind
Open Gov/Open Data
Code for America
Website
Competition
Community
Community
Community
 
Practical Machine Learning: discerning differences and selecting the best approach	
   20
Accessible to Data Scientists & Business Users
	
  
A key question around the practicality of ML solutions for the enterprise is this: Who exactly will
develop the ML solutions in the enterprise? Given the diverse set of skills needed to successfully
implement any type of data science solution, much less the smaller subset (which is even more
complex – around ML), the first part of the answer is the most critical. A team of skilled
professionals best implements ML projects. Our answer to the common question “Do I just need
to hire a statistician to implement a ML project?” is an unqualified “No!” Commercial ML
differs substantially from ML for academic research. While the image of the lone scientist,
toiling away in his/her lab and carefully analyzing the results via complex statistical calculations
is the heritage of ML, this images bears little relationship to the practicalities of implementing
ML in the enterprise.
While there is definitely a place for a dedicated statistician on an enterprise ML team, this is no
longer a requirement for all ML projects. That being said, ML tools compliment (but do not
substitute for) statistical and data mining domain expertise. What has changed with the advent
of these tools, is the ability for your key team members to work with others (business analysts,
decision makers, developers, DevOps, etc…) because the tools use common interfaces and well-
designed dataflow visualizations. Also most tools are cloud-based, which means zero-install and
configuration and quick environment start up time. Additionally commercial tools are designed
to scale storage and processing via cloud capacity, enabling faster movement from small dataset
experiments to full-scale production deployments. Cloud-based tools are particularly well
suited for building quick proof-of-concept projects for the enterprise.
Given the democratization of tooling, you may be wondering whether this new tooling is
sophisticated enough for classically trained data scientists and academics to be able to make full
use of their complete skill sets? The answer is a conditional yes – some, but not all, commercial
products, such as Azure ML, contain integration with commonly used statistical languages (R
Language and Python libraries) and allow re-use of scripts created in these languages.
 
Practical Machine Learning: discerning differences and selecting the best approach	
   21
Additionally, it’s important for researches to have visibility into algorithms and algorithm
parameters. This is important for reproducibility of published experiment results. Shown below
is an Azure ML model, which uses two-class support vector machines in performing
classification (of Tweets in this sample). Also of note is the ability to use R Language scripts in a
ML workflow:
Figure 5 - Azure ML Experiement
	
  
 
Practical Machine Learning: discerning differences and selecting the best approach	
   22
Model evaluation is a key component of a ML Experiment. Here is sample output from Azure
ML model evaluation visualization. You’ll note that both score information (table) and graphical
output are included in the visualization:
Figure 6 - Azure ML Model Evaluation Output
 
Practical Machine Learning: discerning differences and selecting the best approach	
   23
For comparison, shown below is output from a sample Amazon ML model evaluation:
Figure 7 - Amazon ML Model Evaluation Visualization
	
  
 
Practical Machine Learning: discerning differences and selecting the best approach	
   24
Accessible to Developers & BI/DW
Professionals
An interesting and somewhat unexpected aspect of ML enterprise projects is that in no way is
having one or more Big Data repositories a requirement for undertaking this type of project.
Due to the origins of ML, i.e. academic research using statistics and data mining, some of the
most useful ML projects are, in fact, based on application of these techniques to LOB data. You
can think of it as being able to ask different kinds of questions of your current data.
Understanding when to use ML (and when not to) relates directly to the definitions of business
and predictive analytics. Simply put, use ML when you want to ask business questions will result
in probabilistic answers.
The ability to ask predictive questions of LOB data often yields useful results. For example, it
has been quite common to begin ML projects in sales and marketing departments, using CRM
data as source for ML experiments that involve answering business questions like ‘what are the
characteristics of the customers who produce the most revenue?’ (Clustering) and ‘what type of
cross-sell opportunities can we introduce on our website based on known customer purchase
patterns?’ (Classification).
Another common ‘entry point’ for ML solutions in the enterprise is in using IT (log) data.
Regulatory (access auditing) and compliance requirements – and also general security concerns,
drive ML experiments such as ‘at what day / time can I expect that network bandwidth usage will
spike to a particular level (value) for a particular segment of my corporate users?’ (Regression).
 
Practical Machine Learning: discerning differences and selecting the best approach	
   25
In general, the enterprise can find value in appropriately applying predictive analytics via ML
solutions to a broad spectrum of domains. In addition to sales and market or DevOps,
enterprises can apply ML to other scenarios for which probabilistic analysis would yield useful
results. For example questions such as these can now be addressed:
• What are the most closely correlated employee attributes with highest revenue
production of that employee’s team?
• At what future point (value) in time do our customers in a certain segment (i.e.
demographics, geographic…) tend to make a subsequent purchase?
• What groups (trial or free items) of our public resources (website, Github, YouTube…)
tend to be used by browsers who become our customers?
As mentioned, integrated tooling provided by commercial vendors enables simpler deployment
and embedding of ML model results into enterprise applications via their ‘publish as a web
service’ functionality. Given that relatively few enterprise application developers have familiarity,
much less expertise in ML languages, tools and libraries, using commercial ML tools that include
‘click to publish’ functionality significantly speeds up time to market.
Another advantage of using commercial ML tools for the enterprise is the built in connectors to
disparate incoming data sources. Given that it is increasingly common to use a broad variety of
data sources as ML ingest sources, the availability of pre-built connectors once again speed
development cycles. It is common to include connector for LOB data, i.e. RDBMS systems (both
on-premise and cloud-based) as well as for some of the newer NoSQL databases, Hadoop as well
as one or more type of incoming data stream.
 
Practical Machine Learning: discerning differences and selecting the best approach	
   26
Also useful are the quick statistical snapshots that most commercial ML tools provide of datasets
in your ML project. For example, the AWS ML dataset console view includes the visualization
shown below:
Figure 8 - AWS ML Datasources Attribute Information
The AWS viewer not only allows the ML team to ‘see’ the attribute names, but also the
correlations, uniqueness of data, most/least frequent categories, it also includes an inline
‘Preview’ visualization of the uniqueness of the data.
As mentioned, integrated commercial ML tooling, which include ‘one-click’ to deploy capabilities
increases usability for developers and BI professionals. Additionally, capabilities, which
essentially advertise published ML web services, such as Microsoft Azure Data Market, provide
additional discoverability; usability and also commerce opportunities for published services are
also emerging. An example is shown below.
 
Practical Machine Learning: discerning differences and selecting the best approach	
   27
Figure 9 - Azure Machine Learning Test Harness
 
Practical Machine Learning: discerning differences and selecting the best approach	
   28
Visualization of results is another element of ML solution usability. To that end, we’ve included
a sample from IBM Watson Analytics. This service includes flexible visualizations at all phases
of the ML process (i.e. data discovery, modeling, etc…) an example is shown below.
	
  
Figure 10 - IBM Watson ML Visualization
	
   	
  
 
Practical Machine Learning: discerning differences and selecting the best approach	
   29
Our last example of model visualization is from the commercial cloud-based vendor BigML and
is shown below. Also interesting is how vendors such as BigML enable community via providing
a platform for their users to get more value from their ML models. You’ll note BigML allows
users to upload, share, rate and also sell models for use by others in their own ML scenarios.
	
  
Figure 11 - BigML Model Visualization
	
  
 
Practical Machine Learning: discerning differences and selecting the best approach	
   30
Key Takeaways
Incorporating the results of machine learning experiments into production data solutions adds
significant complexity to the overall projects. Given this, a solid understanding of technology
choices around machine learning solutions is essential for designing and delivering solutions
that provide business value to the organization.
• Use commercial machine learning products when team members new to
machine learning processes are creating your solution. Due to fundamental
differences at every stage in the data pipeline, i.e. data preparation, hypothesis formation,
algorithm selection, model training and evaluation, ML projects introduce a set of
complex processes into the enterprise. If your data paradigm consists of an OLTP store
alone, you would be best served by leveraging commercial ML development suites, rather
than attempting to cobble together solutions based on tools and libraries that were built
primarily for statisticians.
• Select tools or coding libraries that perform at the speed and scale for the
data ingest and processing scale for the types of machine learning methods
that your business problems require. Enterprises will benefit from leveraging cloud
storage and process of Big Data workloads as sources for ML solutions because their data
volumes are generally significantly larger than those of academic research. Also, in-
memory streams are increasingly relevant, particularly with the advent of more and more
IoT scenarios.
• Teams that have already implemented pure open source data solutions are
most capable of adding pure open source machine learning solutions.
Domains where data mining and/or statistics may have already been in use, such as
academic research will have more success using open source tools and libraries, so long as
their input data does not overrun the capabilities of those tools.
• Plan for and test your model deployment topology to ensure ML experiments
deliver production business value. Commercial vendors are incorporating one-click
to deploy functionality in their ML studio environments, given the common challenges
 
Practical Machine Learning: discerning differences and selecting the best approach	
   31
around deployment of ML models; such functionality enables faster time to market for
production solutions. Also consider the vendor path to implementing streaming or near-
real time ML solutions if that is part of your requirements.
• Select tools or plan for coding appropriate types of visualization solutions.
ML outputs are unfamiliar to many business users. Standard reports and
dashboards have not been designed to display ML results in a meaningful way. Selecting
ML vendors, which integrate results easily into other commercial solutions or common
libraries results in broader usability for ML solutions.
	
  
 
Practical Machine Learning: discerning differences and selecting the best approach	
   32
References and Resources
This	
  section	
  lists	
  the	
  references	
  and	
  resources	
  referred	
  to	
  in	
  this	
  article.	
  	
  
	
  
Data	
  Science	
  graphic	
  -­‐-­‐	
  http://civicscience.com/data-­‐science-­‐a-­‐visual-­‐guide/	
  
	
  
Shiny	
  for	
  R-­‐Studio	
  -­‐-­‐	
  http://shiny.rstudio.com/gallery/movie-­‐explorer.html	
  
	
  
Deep	
  Learning	
  and	
  the	
  Hololens	
  -­‐-­‐	
  https://technoptimist.wordpress.com/2015/01/25/deep-­‐
learning-­‐and-­‐the-­‐hololens	
  
	
  
Collection	
  of	
  papers	
  on	
  how	
  IBM	
  Watson	
  works	
  -­‐	
  http://www.andrew.cmu.edu/user/ooo/watson/	
  
	
  
What	
  is	
  AI?	
  -­‐-­‐	
  http://www.techopedia.com/definition/190/artificial-­‐intelligence-­‐ai	
  
	
  
How	
  Google	
  is	
  Teaching	
  Computers	
  to	
  See	
  -­‐	
  https://gigaom.com/2012/06/25/how-­‐google-­‐is-­‐
teaching-­‐computers-­‐to-­‐see/	
  
	
  
Need	
  Deep	
  Learning?	
  Here	
  are	
  4	
  Lessons	
  from	
  Google	
  -­‐	
  https://gigaom.com/2015/01/29/new-­‐to-­‐
deep-­‐learning-­‐here-­‐are-­‐4-­‐easy-­‐lessons-­‐from-­‐google/	
  
	
  
Getting	
  started	
  with	
  AWS	
  ML	
  -­‐-­‐	
  http://docs.aws.amazon.com/machine-­‐
learning/latest/dg/tutorial.html	
  
	
  
AzureML	
  on	
  Windows	
  Azure	
  DataMarket	
  	
  /	
  Binary	
  Classifier	
  Sample	
  -­‐-­‐	
  
https://datamarket.azure.com/dataset/aml_labs/log_regression	
  
	
  
BigML	
  Sample	
  Model	
  -­‐	
  
https://bigml.com/user/ashikiar/gallery/model/53b2f21ec8db635905000d33	
  
	
  
Kaggle	
  Community	
  -­‐	
  https://www.kaggle.com/	
  
	
  
DataKind	
  Community	
  -­‐	
  http://www.datakind.org/	
  
	
  
	
  
	
  
	
  
 
Practical Machine Learning: discerning differences and selecting the best approach	
   33
Table of Abbreviations
Abbreviation	
   Full	
  Term	
  
AI	
   Artificial	
  Intelligence	
  
AWS	
   Amazon	
  Web	
  Services	
  
BI	
   Business	
  Intelligence	
  
CRM	
   Customer	
  Relationship	
  Management	
  
DW	
   Data	
  Warehouse	
  
GPU	
   Graphics	
  Processing	
  Unit	
  
IoT	
   Internet	
  of	
  Things	
  
LOB	
   Line	
  of	
  Business	
  	
  
ML	
   Machine	
  Learning	
  
NoSQL	
   No	
  SQL	
  	
  
OLAP	
   On	
  line	
  analytical	
  processing	
  
OLTP	
   On	
  line	
  transactional	
  processing	
  
POC	
   Proof-­‐of-­‐concept	
  
RDBMS	
   Relational	
  Database	
  Management	
  System	
  
	
  
 
Practical Machine Learning: discerning differences and selecting the best approach	
   34
About Lynn Langit
Lynn Langit is a Big Data and Cloud Architect who has been working with database solutions for
more than 15 years. Over the past 4 years, she’s been working as an independent architect using
these technologies, mostly in the biotech, education, manufacturing and facilities verticals. Lynn
has done POCs and has helped teams build solutions on the AWS, Azure, Google and Rackspace
Clouds. She has done work with SQL Server, MySQL, AWS Redshift, AWS MapReduce, Cloudera
Hadoop, MongoDB, Neo4j, Aerospike and many other database systems. In addition to building
solutions, Lynn also partners with all major vendor cloud vendors, providing early technical
feedback into their Big Data and Cloud offerings. She is an AWS Community Hero, Google
Developer Expert (Cloud), Microsoft MVP (SQL Server) and a MongoDB Master. Lynn is also a
Cloudera certified instructor (for MapReduce Programming).
Prior to re-entering the consulting world 3 years ago, Lynn’s background is over 10 years as a
Microsoft Certified instructor, a Microsoft vendor and then 4 years as Microsoft employee. She’s
published 3 books on SQL Server Business Intelligence and has most recently worked with the
SQL Azure team at Microsoft. She continues to write and screencast and hosts a BigData channel
on YouTube (http://www.youtube.com/SoCalDevGal) with over 150 different technical videos
on Cloud and BigData topics. Lynn is also a committer on several open source projects
(http://github.com/lynnlangit).
About Mark Tabladillo
Mark Tabladillo is a Senior Data Scientist at midtown Atlanta's Predictix/LogicBlox. He has used
and promoted Microsoft Azure Machine Learning, Microsoft SQL Server Data Mining, Microsoft
BI Stack, Power BI, SAS, SPSS, R, and Julia. He is a SQL Server MVP and has a research
doctorate (PhD) from Georgia Tech. He is chapter leader for PASS Data Science Virtual Chapter,
which has periodic live meetings and its own YouTube channel.

Más contenido relacionado

La actualidad más candente

Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
Anna Shymchenko
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
StampedeCon
 

La actualidad más candente (20)

Seeing Redshift: How Amazon Changed Data Warehousing Forever
Seeing Redshift: How Amazon Changed Data Warehousing ForeverSeeing Redshift: How Amazon Changed Data Warehousing Forever
Seeing Redshift: How Amazon Changed Data Warehousing Forever
 
Architecture of Big Data Solutions
Architecture of Big Data SolutionsArchitecture of Big Data Solutions
Architecture of Big Data Solutions
 
Analysing data analytics use cases to understand big data platform
Analysing data analytics use cases  to understand big data platformAnalysing data analytics use cases  to understand big data platform
Analysing data analytics use cases to understand big data platform
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
Disrupting Insurance with Advanced Analytics The Next Generation CarrierDisrupting Insurance with Advanced Analytics The Next Generation Carrier
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
 
Data engineering design patterns
Data engineering design patternsData engineering design patterns
Data engineering design patterns
 
Big data on AWS
Big data on AWSBig data on AWS
Big data on AWS
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and moreBig Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
 
Democratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druidDemocratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druid
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
Forget becoming a Data Scientist, become a Machine Learning Engineer instead
Forget becoming a Data Scientist, become a Machine Learning Engineer insteadForget becoming a Data Scientist, become a Machine Learning Engineer instead
Forget becoming a Data Scientist, become a Machine Learning Engineer instead
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
 
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
 
Building big data solutions on azure
Building big data solutions on azureBuilding big data solutions on azure
Building big data solutions on azure
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 

Destacado

Application of machine learning in industrial applications
Application of machine learning in industrial applicationsApplication of machine learning in industrial applications
Application of machine learning in industrial applications
Anish Das
 

Destacado (12)

Building a data warehouse with AWS Redshift, Matillion and Yellowfin
Building a data warehouse with AWS Redshift, Matillion and YellowfinBuilding a data warehouse with AWS Redshift, Matillion and Yellowfin
Building a data warehouse with AWS Redshift, Matillion and Yellowfin
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
 
Startups without Servers
Startups without ServersStartups without Servers
Startups without Servers
 
Roll Your Own API Management Platform with nginx and Lua
Roll Your Own API Management Platform with nginx and LuaRoll Your Own API Management Platform with nginx and Lua
Roll Your Own API Management Platform with nginx and Lua
 
Redis Labs and SQL Server
Redis Labs and SQL ServerRedis Labs and SQL Server
Redis Labs and SQL Server
 
Application of machine learning in industrial applications
Application of machine learning in industrial applicationsApplication of machine learning in industrial applications
Application of machine learning in industrial applications
 
Production and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsProduction and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning Models
 
Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructure
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
 
Gluecon Monitoring Microservices and Containers: A Challenge
Gluecon Monitoring Microservices and Containers: A ChallengeGluecon Monitoring Microservices and Containers: A Challenge
Gluecon Monitoring Microservices and Containers: A Challenge
 
Beyond Relational
Beyond RelationalBeyond Relational
Beyond Relational
 

Similar a Practical Machine Learning

thesis_jinxing_lin
thesis_jinxing_linthesis_jinxing_lin
thesis_jinxing_lin
jinxing lin
 
Applied_Data_Science_Presented_by_Yhat
Applied_Data_Science_Presented_by_YhatApplied_Data_Science_Presented_by_Yhat
Applied_Data_Science_Presented_by_Yhat
Charlie Hecht
 
A Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence ApplicationA Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence Application
Kate Subramanian
 
Oea big-data-guide-1522052
Oea big-data-guide-1522052Oea big-data-guide-1522052
Oea big-data-guide-1522052
kavi172
 
Oea big-data-guide-1522052
Oea big-data-guide-1522052Oea big-data-guide-1522052
Oea big-data-guide-1522052
Gilbert Rozario
 
Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docx
audeleypearl
 
Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docx
roushhsiu
 
Data analytics presentation- Management career institute
Data analytics presentation- Management career institute Data analytics presentation- Management career institute
Data analytics presentation- Management career institute
PoojaPatidar11
 

Similar a Practical Machine Learning (20)

thesis_jinxing_lin
thesis_jinxing_linthesis_jinxing_lin
thesis_jinxing_lin
 
Business Intelligence and decision support system
Business Intelligence and decision support system Business Intelligence and decision support system
Business Intelligence and decision support system
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learning
 
Applied_Data_Science_Presented_by_Yhat
Applied_Data_Science_Presented_by_YhatApplied_Data_Science_Presented_by_Yhat
Applied_Data_Science_Presented_by_Yhat
 
A Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence ApplicationA Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence Application
 
Marketing Analytics using R/Python
Marketing Analytics using R/PythonMarketing Analytics using R/Python
Marketing Analytics using R/Python
 
Technovision
TechnovisionTechnovision
Technovision
 
Oea big-data-guide-1522052
Oea big-data-guide-1522052Oea big-data-guide-1522052
Oea big-data-guide-1522052
 
Oea big-data-guide-1522052
Oea big-data-guide-1522052Oea big-data-guide-1522052
Oea big-data-guide-1522052
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 
Machine Learning: The First Salvo of the AI Business Revolution
Machine Learning: The First Salvo of the AI Business RevolutionMachine Learning: The First Salvo of the AI Business Revolution
Machine Learning: The First Salvo of the AI Business Revolution
 
Accelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature EngineeringAccelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature Engineering
 
Machine Learning
Machine Learning Machine Learning
Machine Learning
 
A guide to preparing your data for tableau
A guide to preparing your data for tableauA guide to preparing your data for tableau
A guide to preparing your data for tableau
 
Operational Analytics: Best Software For Sourcing Actionable Insights 2013
Operational Analytics: Best Software For Sourcing Actionable Insights 2013Operational Analytics: Best Software For Sourcing Actionable Insights 2013
Operational Analytics: Best Software For Sourcing Actionable Insights 2013
 
Analytics
AnalyticsAnalytics
Analytics
 
Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docx
 
Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docx
 
Real World End to End machine Learning Pipeline
Real World End to End machine Learning PipelineReal World End to End machine Learning Pipeline
Real World End to End machine Learning Pipeline
 
Data analytics presentation- Management career institute
Data analytics presentation- Management career institute Data analytics presentation- Management career institute
Data analytics presentation- Management career institute
 

Más de Lynn Langit

Más de Lynn Langit (20)

VariantSpark on AWS
VariantSpark on AWSVariantSpark on AWS
VariantSpark on AWS
 
Serverless Architectures
Serverless ArchitecturesServerless Architectures
Serverless Architectures
 
10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming
 
Blastn plus jupyter on Docker
Blastn plus jupyter on DockerBlastn plus jupyter on Docker
Blastn plus jupyter on Docker
 
Testing in Ballerina Language
Testing in Ballerina LanguageTesting in Ballerina Language
Testing in Ballerina Language
 
Teaching Kids to create Alexa Skills
Teaching Kids to create Alexa SkillsTeaching Kids to create Alexa Skills
Teaching Kids to create Alexa Skills
 
Practical cloud
Practical cloudPractical cloud
Practical cloud
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examples
 
Genome-scale Big Data Pipelines
Genome-scale Big Data PipelinesGenome-scale Big Data Pipelines
Genome-scale Big Data Pipelines
 
Teaching Kids Programming
Teaching Kids ProgrammingTeaching Kids Programming
Teaching Kids Programming
 
Practical Cloud
Practical CloudPractical Cloud
Practical Cloud
 
Serverless Reality
Serverless RealityServerless Reality
Serverless Reality
 
Genomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesGenomic Scale Big Data Pipelines
Genomic Scale Big Data Pipelines
 
VariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomicsVariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomics
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWS
 
Serverless Reality
Serverless RealityServerless Reality
Serverless Reality
 
New AWS Services for Bioinformatics
New AWS Services for BioinformaticsNew AWS Services for Bioinformatics
New AWS Services for Bioinformatics
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline Patterns
 
Scaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud PlatformScaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud Platform
 
SQL Server on Google Cloud Platform
SQL Server on Google Cloud PlatformSQL Server on Google Cloud Platform
SQL Server on Google Cloud Platform
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Practical Machine Learning

  • 1. Practical Machine Learning: discerning differences and selecting the best approach Lynn Langit Reviewed  by  Mark  Tabladillo  
  • 2.   Practical Machine Learning: discerning differences and selecting the best approach   2 TABLE OF CONTENTS Executive  summary  ...................................................................................................................................................................  3   Introduction  ..................................................................................................................................................................................  3   Concepts  .........................................................................................................................................................................................  6   Process  and  Practicalities  .....................................................................................................................................................  15   Accessible  to  Data  Scientists  &  Business  Users  ...........................................................................................................  20   Accessible  to  Developers  &  BI/DW  Professionals  .....................................................................................................  24   Key  Takeaways  ..........................................................................................................................................................................  30   References  and  Resources  ....................................................................................................................................................  32   Table  of  Abbreviations  ......................................................................................................................................................  33   About  Lynn  Langit  ....................................................................................................................................................................  34   About  Mark  Tabladillo  ............................................................................................................................................................  34    
  • 3.   Practical Machine Learning: discerning differences and selecting the best approach   3 Executive summary The formal definition of Machine Learning is this: the ability of computing systems to gain knowledge from experience. Practical ML enables your organization to answer business questions more effectively because of that experience. Machine Learning solutions consist of your input data built into models which combine that data with statistical and data mining algorithms. Until relatively recently applied ML (as contrasted to ML for research) was simply too specialized, difficult and expensive to have broad adoption outside of the academic community and a few commercial domains (finance, ad serving). However, improvements in languages, libraries as well new commercial offerings (including cloud-only products) have greatly increased the practicality of implementing ML applications. Also demand has been fueled by Big Data - more data encourages more powerful methods of processing to gain understanding from that data. This report will discuss technologies and implementation approaches for creating enterprise data solutions that include one or more machine learning components. The report will also detail the tradeoffs of each solution and determine which approach best fits organizational needs. Introduction The term ‘Predictive Analytics’ is used somewhat interchangeably with Machine Learning. The central idea is that Machine Learning enables the creation of important business insights based on a analyzing some set of input data with one or more data mining or statistical algorithms. Where Machine Learning is used In some sectors, particularly academic research, statistical analysis and data mining have been standard analytical techniques for years. These sectors tend to use open source languages, tools and libraries. Academics commonly use specialty coding languages such as R or Python libraries (SciPy/NumPy/Pandas), rather than enterprise languages, such as Java for their ML research projects. Also researchers tend to work with wide (many attributes) and shallow (relatively small sample sizes) datasets. This academic dataset size is significant because many of the commonly
  • 4.   Practical Machine Learning: discerning differences and selecting the best approach   4 used tools, such as R Studio or even Weka, are designed for small (albeit rich) datasets and they are limited to working with datasets that can fit in the memory of analyst’s desktop computer rather than requiring server or even cloud-scale processing power. In a few commercial sectors, such as financial (for example with credit scoring) and security (for example for email spam detection), use of ML (via data mining) is not a new approach. In these areas, highly specialized tools and specially trained professionals have supported these types of solutions. These vertical-specific ML solution development cycles run to the hundreds of thousand or even millions of dollars to implement. These costs include software licenses, powerful hardware, proprietary development and management tools and consulting fees. Also these types of projects have commonly taken months or even years to implement. However, the ML market landscape is rapidly changing with the availability of Big Data/cloud storage, processing and data pipelines. These new services enable faster and cheaper data collection, storage and processing. Also the growth of IoT (mostly sensor) data is increasing the volumes of available data for analysis. These market changes are making the overall ‘entry point’ for ML projects less risky –i.e. cheaper and faster. Another driver of adoption is the efforts that commercial vendors are putting into creating usable ML tooling – most of which is runs on that particular vendor’s cloud infrastructure (such as IBM Watson on Bluemix, Microsoft Azure ML on Azure or Amazon ML). ML projects are increasingly seen as a realistic possibility given the larger market landscape. Simply put, more data means a need for more powerful methods of deriving meaning from the increasingly large and complex datasets. Enter the democratization of Machine Learning. Challenges to Adoption   Although tools are reducing the complexity of applying the power of statistical and data mining techniques to increasingly larger data sets, the enterprise market is in the early stages of ML adoption. One of the key blockers is complexity -- creating useful predictive analytics or ML differs substantially from the more traditional business analytics. Because the application (and demand) for technical professionals skilled in applied statistics and data mining had traditionally been a small market, we are faced with a lack of trained, working
  • 5.   Practical Machine Learning: discerning differences and selecting the best approach   5 professionals who can produce useful results in this area. Specifically we lack those who have experience on how to perform the tasks needed in the enterprise ML solution lifecycle – such as to clean and groom the input data, to select appropriate techniques and algorithms, to build and evaluate models and to support moving the result of their work to production. Vendors are stepping in to reduce this gap. Several major commercial vendors have launched general-purpose machine learning suites this year. As mentioned, the majority of these new offers are cloud-based. Some solutions offer you the ability to train, test and deploy in either a cloud or on premises, while other solutions are cloud-only, such as BigML.
  • 6.   Practical Machine Learning: discerning differences and selecting the best approach   6 Concepts Taxonomies and terms for Machine Learning solutions have important and nuanced differences in meaning, proper understanding is key to differentiating products and solutions available in the ML space. To begin, we’ll start by providing definitions of associated technologies. What  is  the  difference  between  business  analytics  and  predictive  analytics?     Business Analytics is defined as finding answers to business questions by querying data and producing a definite result or result set. For example: “What are the top five items that are found in a shopping basket for a 38 year old man from California who is shopping on a Saturday at 5pm at a major grocery chain?” The answer to this question (via a query to source data) produces a deterministic result set, usually shown as a report or a dashboard is the only type of analytics that they have available. Stated differently, business analytics are used to analyze “what has happened” for past events. Predictive Analytics is defined as finding answers to business questions by applying one or more probabilistic algorithms to some set of input data and producing one or more probabilistic results. For example: “Consider the items which appear together in the shopping baskets of all 38 year old men from California who are shopping on a Saturday at 5pm at any of the major grocery chain stores for which we have data and predict how many of a given item from this set the stores should have on hand to ensure proper supply for this type of customer.” In this case, the type of algorithm is regression because it is used to predict a future value or set of values. To get a result one or more regression algorithms are applied to the source data – for example, linear regression. Because the results are probabilistic, i.e. a percentage or score of likelihood of a result, it is common to use more than one evaluative algorithm and then to evaluate the quality of the result. This is process is called ‘evaluating the model.’ The best result from the models is selected and is either presented via statistical output (probability) or via a customized visualization. Stated differently, predictive analytics are used to analyze “what will happen” for potential or future events. The graphic below illustrates and
  • 7.   Practical Machine Learning: discerning differences and selecting the best approach   7 contrasts sample results in business and predictive analytics. Figure 1 - Two Types of Analytics What  is  the  difference  between  data  mining  and  predictive  analytics?     Data Mining encompasses a broader set of tasks than that included in predictive analytics. In addition to regression algorithms, data mining also includes other types predictive analysis. Specifically, finding groupings in the source data, by matching new data to existing labeled (or categorized) data is called classification. Classification algorithm executions are characterized as implementations of ‘supervised’ algorithms because there is an authoritative set of data, which is used to process the input data in addition to an algorithm. For example “In a set of data there are examples of pictures or drawings of objects that we’ve identified and labeled as particular animals – i.e. ‘this is a picture of a dog and that is a picture of a cat.’ “ A classification task is to evaluate the likelihood of a new picture being a dog or a cat based on pattern matching to the set of known states. An example of a classification algorithm is decision trees. Of note is that regression is also ‘supervised’ because a data set with ‘known values’ is used in conjunction
  • 8.   Practical Machine Learning: discerning differences and selecting the best approach   8 with the application of the regression algorithm when evaluating the probability of a result using new input data. Discovering natural groupings in source data, for which there are no known states or labels is called clustering. Since there are no known states when clustering algorithms are used, this type of machine learning is called ‘unsupervised’. An example of this technique is ‘here are some pictures, group them into subsets based on characteristics (or labels) that are discovered during the process of running the algorithm.’ As with the other types of ML, when implementing clustering it is common to use multiple clustering algorithms, such as k-means, then to evaluate the model results and finally to select the top performing algorithm and model for the particular business problem. What  is  the  difference  between  predictive  analytics  and  machine  learning?       Machine Learning is evolving to support the increasing volumes, varieties and velocities of Big Data projects, rather than the smaller, simpler datasets that typified data mining projects, particularly in academia. Another way to understand ML is as the next generation of data mining. Machine learning is a superset of predictive analytics because it involves more than application of one or more predictive analytic techniques (and associated algorithms) to sets of input data. Another consideration is the current push toward commercial ‘productization’ of machine learning applications. Although data mining and statistical analysis has been widely used in particular domains, the broadest application, for academic research, is implemented quite differently than for commercial applications. Specifically there are many steps in data preparation for predictive analytics (or ML) projects that are different from data preparation common for business analytics projects. Steps to prepare input data for predictive analytics include such tasks as the following: • Evaluating data types and detecting or creating labels (for classification) • Evaluating number / ratio of null values
  • 9.   Practical Machine Learning: discerning differences and selecting the best approach   9 • Evaluating quality/ usefulness of input data based on statistical analysis (mean, mode, etc…) • Removing outlier values (exceptions) • Creating groupings (called ‘bucketing’) Commercial tools provide data visualizers, which assist with data quality assessment at this state and also facilitate easy modification of the input data. After the data preparation tasks have been completed there is a 3-step process to implement a machine learning solution or model. It is quite common for the model process to be iterative (because the outputs are probabilistic) during the model creation phase. Iterations often include returning to the data preparation phase because adjusting the quality of the input data impacts outputs. The need for iteration over increasingly large data sets marries nicely with the scalability of cloud-based ML solutions. These steps include the following: • Input Data o Ingest – in this step you ingest source data, common ingest methods are file- based, database-based. Increasingly accepting streaming input is a requirement. o Evaluate & Clean – in this step you review the input data (often done using statistical analysis) and tune that data, so as to be prepared for inclusion in one or more ML models • Model o Select ML Algorithm and Initialize Model(s) – in this step you match the business question and input data to a ML technique (regression, classification or clustering) and one or more algorithms from within that technique (such as, linear regression, decision trees, k-means clustering) to evaluate the possibility of building a useful model with this information
  • 10.   Practical Machine Learning: discerning differences and selecting the best approach   10 o Train Model(s) – in this step you create the model and load it with data, you then process the model and view the output o Score Model(s) – in this step you evaluate the effectiveness of model results vs. the ‘random guess’ line to understand the potential use of the model(s) for future predictions, classifications and clustering tasks • Predict o Perform Prediction – in this step you evaluate new data against the model in order to predict the likelihood of selected results. These steps are often performed iteratively, as model scoring results in differentiation between multiple models. You may decide to repeat some or all of the entire cycle with slightly different input data, different algorithms, different algorithm parameters, etc… in order to produce one or mode ‘useful’ models. Wizards and visualization tools found in ML products speed up these iterative cycles. Shown below is an open source project for RStudio called Shiny. Shiny is used by many R developers, because it allows them to quickly an easily visualize (and query) models they created in the R programming language. Note the use of input parameters via slider bars and text boxes. These controls allow the ML developer to ‘try out’ different values in evaluating the usefulness of their model. Lightweight visualization tools for rapid iteration are particularly valuable for ML scenarios.
  • 11.   Practical Machine Learning: discerning differences and selecting the best approach   11 Figure 2 - Visualization of R results using Shiny Is  data  science  the  same  thing  as  machine  learning?     Data science is a super set of Machine Learning in that in addition to all of the tasks described in the last paragraph, data science also includes hypothesis formation, or more simply, ‘asking the right question(s)?’ Data science, as shown in the graphic, involves domain expertise, healthy curiosity, scientific thinking, understanding of math, statistics, algorithms, data input sets and visualization. Increasingly, a team of people in the enterprise is responsible for data science projects, because the skill sets needs are simply not found in any one or two people. Also these teams benefit from using enterprise-grade tools, which facilitate communication and other
  • 12.   Practical Machine Learning: discerning differences and selecting the best approach   12 enterprise needs, such as security, source control and others. Figure 3 - Skills need for Data Science What is Artificial Intelligence and how does it relate to machine learning? An AI (Artificial Intelligence) solution contains one of more intelligent agents. AI intelligent agents automate tasks that would normally require a highly trained person to do. An example of this type of task is speech recognition and translation. An AI system is one that responds to complex problems in a human-like way. A well-known AI success of late is the celebrated win of the IBM Watson AI system again two top human players in the TV trivia game show Jeopardy.
  • 13.   Practical Machine Learning: discerning differences and selecting the best approach   13 In some ways, AI has more to do with process automation than learning because AI systems ingest vast amounts of source data and perform iterative ML processes, often over a period of years. In practice AI includes a number of ML components, so that the system and its processes can be increasingly optimized or can learn over time. You can see commercial application of AI systems in domains as disparate as medical diagnostics, self-driving cars, face and speech recognition and bank fraud detection. What  is  Deep  Learning  and  how  does  it  relate  to  machine  learning?     Deep Learning is a relatively new aspect of Machine Learning. It’s a set of algorithms in ML that attempt to model high-level abstractions in data by using multiple non-linear transformations. Deep Learning is focusing on improving the efficiency of unsupervised or semi-supervised feature learning algorithms. It’s based on research in human neuroscience, such as human neural coding. Algorithms are deep neural networks and problem sets include computer vision, natural language processing and speed recognition. Also Deep Learning has been called the new definition of the ‘neural networks’ data-mining algorithm. Advances in hardware, particularly around GPU computational capabilities have facilitated use of Deep Learning as they have enabled model-processing times to shrink from weeks or days to a more practical level, i.e. minutes. However, given the computational intensity, it is still the case that computational (processing time) requirements limit the widespread application of Deep Learning algorithms. Deep Learning is also called ‘strong AI’ because of it’s potential to disrupt a large number of processes. Major software companies are focusing millions of dollars in research around improving usability of Deep Learning in their own core products (such as their voice recognition systems, Google Now, Microsoft Cortana and Apple Siri and other products). Although the potential of Deep Learning is exciting, the reality is that the broad application of its results due to time, cost, complexity and skills needed is still limited to experimental and (mostly) research projects at a small subset of companies, such as Google, IBM, Microsoft, etc....
  • 14.   Practical Machine Learning: discerning differences and selecting the best approach   14 What  is  the  importance  of  real-­‐time  analytics?   Broader adoption of technologies such as in-memory databases and streaming Hadoop (Spark Streaming, Storm and Samza), along with new types of data providers, e.g. IoT data input devices, are increasing the demand for real-time analytics as a category. In addition creation of cloud-based data pipeline libraries and products, enables the creation of more complex conduits for incoming data, including through multiple processing pipelines. Along with these advances in real-time Big Data technologies in general comes demand for products, which can enable rapid creation of solutions that also include real-time predictive analytics. Major software vendors are creating consumer products and services, such as adaptive voice input (Google Now, Microsoft Cortana and Apple Siri) that use real-time predictive analytics. These types of applications are igniting consumer imagination and fueling demand in general.  
  • 15.   Practical Machine Learning: discerning differences and selecting the best approach   15 Process and Practicalities   Let’s take a deeper look at the processes involved in creating commercial machine learning solutions. We are doing so, because, as mentioned, the process for creating useful commercial predictive analytics is quite different than that of creating business analytics. Digging into the detailed processes involved will help in our understanding of the usability of the libraries, tools and products currently available. Business data projects are driven by the need to gain more or better business insights. Given that, what are the types of use cases that machine learning solutions can address? Remembering the core functionality of ML, i.e. predicting one or more discrete, future values, classifying or labeling new data into known groups and/or detecting natural groups in new data, here is a short list of some types of common use cases: • Facilities  &  Manufacturing  -­‐-­‐  Smart  Buildings,  Predictive  Maintenance   • Sales  &  Marketing  -­‐-­‐  Demand  Forecasting,  Churn  Analysis,  Target  Advertising   • Biomedical  -­‐-­‐  Life  Science  Research,  Healthcare  outcomes  (patient  re-­‐admission   rates)   • Security  -­‐-­‐  Fraud  Detection,  Network  Intrusion  Detection   • Logistics  –  Routing     As mentioned the steps involved in a creating an end-to-end machine learning solution include a number of considerations. Before the advent of cloud-based data storage, pipelines and machine learning model tooling, costs involved in creating what were then called data mining solutions blocked many enterprises. These costs included high hardware and software license fees (often well over $ 100k, up to $ 1 million simply to start what was often a multi-year project was not unheard of as well). Additionally, the costs of re-training or hiring specialty consultants to implement the data mining projects added to the project costs and complexity. Prior to cloud- based data storage and cloud-based data pipeline products, costs associated to unearthing enterprise data from the various (and often proprietary) on-premise data silos added to adoption blockers. Yet another blocker to implementing traditional data mining was that the domain of
  • 16.   Practical Machine Learning: discerning differences and selecting the best approach   16 business analyst (or, in some cases, statistician) were wholly separated from developers who would be charged with creating application interfaces for the results of the data mining work produced by the business analysts. Cloud storage combined with new types of Big Data storage has driven overall enterprise data volumes up dramatically. Increasingly large and complex data sets are becoming progressively more difficult to analyze in a meaningful way for the enterprise. Driven by particular sectors, such as the ML analysis of massive amounts of behavioral data collected in social gaming (Angry Birds, Halo, etc…), the enterprise appetite for getting started with ML projects has increased sharply over the last 12 months. Although the landscape is improving due to the release of improved open source libraries, tools as well as new commercial tools, for most enterprises, ML projects are a new type of analytics. Given that, for traditional enterprises, the newly releasing set of cloud-based ML tools and services, such as Azure ML, IBM Waston, Predixion Software, AWS ML, BigML and others are a welcome compliment to the existing (mostly open source) languages, libraries and tools. Another new item in the emerging ecosystem of enterprise tools and products designed to support enterprise ML projects is the emergence of commercial data markets. IBM, Microsoft and Predixion Software all include the ability to directly ‘publish’ the results of one or more useful ML experiments into their cloud-based repository or marketplace. Technically, most enable the ML experiment to be published as a REST-based web service endpoint. Interestingly, cloud vendors are leveraging integration with their own cloud services. For example, Amazon ML includes the ability to enable real-time ML via a one-button click as shown in the screenshot below. This real-time capability is integrated with AWS S3 storage. AWS ML integrates with S3, RDS or Redshift at this time.
  • 17.   Practical Machine Learning: discerning differences and selecting the best approach   17 Figure 4 - Amazon ML Model Usage Options This functionality not only facilitates quick and easy deployment to production of commercial ML services, but also has the interesting implication of providing the enterprise a commercial platform from which they can monetize the results of their ML experiments by making those results available as a commercial offering.
  • 18.   Practical Machine Learning: discerning differences and selecting the best approach   18 Shown below is a chart that lists many of the major offerings – either commercial or open source.   Phase   Azure   AWS   Google   Commercial   Open   Source   Ingest   Stream   Insight   Kinesis   Big  Query   Data  Torrent   Flume   Pipeline   Data  Pipeline   Data  Pipeline   Data  Pipeline   Data  Torrent   Kafka   Storage   BLOB   Document  DB   SQLAzure   HDInsight   S3   Dynamo  DB   RDS  –  SQL     Redshift   EMR   BLOB   H/R  Datastore   MySQL   Hadoop  on  GCE   SAS   NoSQL   Hadoop   Create   Predictive   Models   Azure  ML   Revolution   Analytics  for   R  Language   AWS  ML   Prediction  API   SAS   IBM  Watson   Predixion  Software   BigML   Matlab   Mathematica   PredictionIO…   R     Mahout     Python   Pandas   Weka   Predicative   Results     Publication   and/or   Visualization   Excel   Power  BI   Gateway   PowerView   Azure  Data   Market   AWS   Lambdas   Partners   Google  Charts   BigML   Dato   Predixion   Marketplace   Tableau   Wolfram  Language     D3     In some verticals, such as biomedical, it is common to have some form of academic data mining or statistics work (data sets and / or data mining models) to use as a basis for creating commercial machine learning solutions. One example is when you are turning that academic research into commercial biomedical products. Given that, we’ll list data mining languages, libraries and tools, which are commonly used in academic research. Also, it has been the case that traditional statistical tools and languages, i.e. Matlab, Mathematica, have high adoption in the research sector.
  • 19.   Practical Machine Learning: discerning differences and selecting the best approach   19 ML Academic Languages, Tools and Libraries – some are open source – most have free versions for academic research – shown below is a chart that summarizes many of these items. We have included the communities’ category, because academic data science communities are at the front edge of work on improving open source tools and libraries and bear watching when you are assessing the state of ML tools and products. Category Objects Notes Languages R Language SciPy/NumPy/Pandas Matlab Mathematica Julia Mahout Weka Stats Language Python Libraries for ML Stats Language Stats Language Scalable Stats Language ML for Hadoop Research Stats Language Tools R Studio Shiny for R Weka Studio PyCharm Sublime IDE for R Visualization for R IDE for Weka IDE for Python IDE for Python and more Communities KDNuggets Kaggle DataKind Open Gov/Open Data Code for America Website Competition Community Community Community
  • 20.   Practical Machine Learning: discerning differences and selecting the best approach   20 Accessible to Data Scientists & Business Users   A key question around the practicality of ML solutions for the enterprise is this: Who exactly will develop the ML solutions in the enterprise? Given the diverse set of skills needed to successfully implement any type of data science solution, much less the smaller subset (which is even more complex – around ML), the first part of the answer is the most critical. A team of skilled professionals best implements ML projects. Our answer to the common question “Do I just need to hire a statistician to implement a ML project?” is an unqualified “No!” Commercial ML differs substantially from ML for academic research. While the image of the lone scientist, toiling away in his/her lab and carefully analyzing the results via complex statistical calculations is the heritage of ML, this images bears little relationship to the practicalities of implementing ML in the enterprise. While there is definitely a place for a dedicated statistician on an enterprise ML team, this is no longer a requirement for all ML projects. That being said, ML tools compliment (but do not substitute for) statistical and data mining domain expertise. What has changed with the advent of these tools, is the ability for your key team members to work with others (business analysts, decision makers, developers, DevOps, etc…) because the tools use common interfaces and well- designed dataflow visualizations. Also most tools are cloud-based, which means zero-install and configuration and quick environment start up time. Additionally commercial tools are designed to scale storage and processing via cloud capacity, enabling faster movement from small dataset experiments to full-scale production deployments. Cloud-based tools are particularly well suited for building quick proof-of-concept projects for the enterprise. Given the democratization of tooling, you may be wondering whether this new tooling is sophisticated enough for classically trained data scientists and academics to be able to make full use of their complete skill sets? The answer is a conditional yes – some, but not all, commercial products, such as Azure ML, contain integration with commonly used statistical languages (R Language and Python libraries) and allow re-use of scripts created in these languages.
  • 21.   Practical Machine Learning: discerning differences and selecting the best approach   21 Additionally, it’s important for researches to have visibility into algorithms and algorithm parameters. This is important for reproducibility of published experiment results. Shown below is an Azure ML model, which uses two-class support vector machines in performing classification (of Tweets in this sample). Also of note is the ability to use R Language scripts in a ML workflow: Figure 5 - Azure ML Experiement  
  • 22.   Practical Machine Learning: discerning differences and selecting the best approach   22 Model evaluation is a key component of a ML Experiment. Here is sample output from Azure ML model evaluation visualization. You’ll note that both score information (table) and graphical output are included in the visualization: Figure 6 - Azure ML Model Evaluation Output
  • 23.   Practical Machine Learning: discerning differences and selecting the best approach   23 For comparison, shown below is output from a sample Amazon ML model evaluation: Figure 7 - Amazon ML Model Evaluation Visualization  
  • 24.   Practical Machine Learning: discerning differences and selecting the best approach   24 Accessible to Developers & BI/DW Professionals An interesting and somewhat unexpected aspect of ML enterprise projects is that in no way is having one or more Big Data repositories a requirement for undertaking this type of project. Due to the origins of ML, i.e. academic research using statistics and data mining, some of the most useful ML projects are, in fact, based on application of these techniques to LOB data. You can think of it as being able to ask different kinds of questions of your current data. Understanding when to use ML (and when not to) relates directly to the definitions of business and predictive analytics. Simply put, use ML when you want to ask business questions will result in probabilistic answers. The ability to ask predictive questions of LOB data often yields useful results. For example, it has been quite common to begin ML projects in sales and marketing departments, using CRM data as source for ML experiments that involve answering business questions like ‘what are the characteristics of the customers who produce the most revenue?’ (Clustering) and ‘what type of cross-sell opportunities can we introduce on our website based on known customer purchase patterns?’ (Classification). Another common ‘entry point’ for ML solutions in the enterprise is in using IT (log) data. Regulatory (access auditing) and compliance requirements – and also general security concerns, drive ML experiments such as ‘at what day / time can I expect that network bandwidth usage will spike to a particular level (value) for a particular segment of my corporate users?’ (Regression).
  • 25.   Practical Machine Learning: discerning differences and selecting the best approach   25 In general, the enterprise can find value in appropriately applying predictive analytics via ML solutions to a broad spectrum of domains. In addition to sales and market or DevOps, enterprises can apply ML to other scenarios for which probabilistic analysis would yield useful results. For example questions such as these can now be addressed: • What are the most closely correlated employee attributes with highest revenue production of that employee’s team? • At what future point (value) in time do our customers in a certain segment (i.e. demographics, geographic…) tend to make a subsequent purchase? • What groups (trial or free items) of our public resources (website, Github, YouTube…) tend to be used by browsers who become our customers? As mentioned, integrated tooling provided by commercial vendors enables simpler deployment and embedding of ML model results into enterprise applications via their ‘publish as a web service’ functionality. Given that relatively few enterprise application developers have familiarity, much less expertise in ML languages, tools and libraries, using commercial ML tools that include ‘click to publish’ functionality significantly speeds up time to market. Another advantage of using commercial ML tools for the enterprise is the built in connectors to disparate incoming data sources. Given that it is increasingly common to use a broad variety of data sources as ML ingest sources, the availability of pre-built connectors once again speed development cycles. It is common to include connector for LOB data, i.e. RDBMS systems (both on-premise and cloud-based) as well as for some of the newer NoSQL databases, Hadoop as well as one or more type of incoming data stream.
  • 26.   Practical Machine Learning: discerning differences and selecting the best approach   26 Also useful are the quick statistical snapshots that most commercial ML tools provide of datasets in your ML project. For example, the AWS ML dataset console view includes the visualization shown below: Figure 8 - AWS ML Datasources Attribute Information The AWS viewer not only allows the ML team to ‘see’ the attribute names, but also the correlations, uniqueness of data, most/least frequent categories, it also includes an inline ‘Preview’ visualization of the uniqueness of the data. As mentioned, integrated commercial ML tooling, which include ‘one-click’ to deploy capabilities increases usability for developers and BI professionals. Additionally, capabilities, which essentially advertise published ML web services, such as Microsoft Azure Data Market, provide additional discoverability; usability and also commerce opportunities for published services are also emerging. An example is shown below.
  • 27.   Practical Machine Learning: discerning differences and selecting the best approach   27 Figure 9 - Azure Machine Learning Test Harness
  • 28.   Practical Machine Learning: discerning differences and selecting the best approach   28 Visualization of results is another element of ML solution usability. To that end, we’ve included a sample from IBM Watson Analytics. This service includes flexible visualizations at all phases of the ML process (i.e. data discovery, modeling, etc…) an example is shown below.   Figure 10 - IBM Watson ML Visualization    
  • 29.   Practical Machine Learning: discerning differences and selecting the best approach   29 Our last example of model visualization is from the commercial cloud-based vendor BigML and is shown below. Also interesting is how vendors such as BigML enable community via providing a platform for their users to get more value from their ML models. You’ll note BigML allows users to upload, share, rate and also sell models for use by others in their own ML scenarios.   Figure 11 - BigML Model Visualization  
  • 30.   Practical Machine Learning: discerning differences and selecting the best approach   30 Key Takeaways Incorporating the results of machine learning experiments into production data solutions adds significant complexity to the overall projects. Given this, a solid understanding of technology choices around machine learning solutions is essential for designing and delivering solutions that provide business value to the organization. • Use commercial machine learning products when team members new to machine learning processes are creating your solution. Due to fundamental differences at every stage in the data pipeline, i.e. data preparation, hypothesis formation, algorithm selection, model training and evaluation, ML projects introduce a set of complex processes into the enterprise. If your data paradigm consists of an OLTP store alone, you would be best served by leveraging commercial ML development suites, rather than attempting to cobble together solutions based on tools and libraries that were built primarily for statisticians. • Select tools or coding libraries that perform at the speed and scale for the data ingest and processing scale for the types of machine learning methods that your business problems require. Enterprises will benefit from leveraging cloud storage and process of Big Data workloads as sources for ML solutions because their data volumes are generally significantly larger than those of academic research. Also, in- memory streams are increasingly relevant, particularly with the advent of more and more IoT scenarios. • Teams that have already implemented pure open source data solutions are most capable of adding pure open source machine learning solutions. Domains where data mining and/or statistics may have already been in use, such as academic research will have more success using open source tools and libraries, so long as their input data does not overrun the capabilities of those tools. • Plan for and test your model deployment topology to ensure ML experiments deliver production business value. Commercial vendors are incorporating one-click to deploy functionality in their ML studio environments, given the common challenges
  • 31.   Practical Machine Learning: discerning differences and selecting the best approach   31 around deployment of ML models; such functionality enables faster time to market for production solutions. Also consider the vendor path to implementing streaming or near- real time ML solutions if that is part of your requirements. • Select tools or plan for coding appropriate types of visualization solutions. ML outputs are unfamiliar to many business users. Standard reports and dashboards have not been designed to display ML results in a meaningful way. Selecting ML vendors, which integrate results easily into other commercial solutions or common libraries results in broader usability for ML solutions.  
  • 32.   Practical Machine Learning: discerning differences and selecting the best approach   32 References and Resources This  section  lists  the  references  and  resources  referred  to  in  this  article.       Data  Science  graphic  -­‐-­‐  http://civicscience.com/data-­‐science-­‐a-­‐visual-­‐guide/     Shiny  for  R-­‐Studio  -­‐-­‐  http://shiny.rstudio.com/gallery/movie-­‐explorer.html     Deep  Learning  and  the  Hololens  -­‐-­‐  https://technoptimist.wordpress.com/2015/01/25/deep-­‐ learning-­‐and-­‐the-­‐hololens     Collection  of  papers  on  how  IBM  Watson  works  -­‐  http://www.andrew.cmu.edu/user/ooo/watson/     What  is  AI?  -­‐-­‐  http://www.techopedia.com/definition/190/artificial-­‐intelligence-­‐ai     How  Google  is  Teaching  Computers  to  See  -­‐  https://gigaom.com/2012/06/25/how-­‐google-­‐is-­‐ teaching-­‐computers-­‐to-­‐see/     Need  Deep  Learning?  Here  are  4  Lessons  from  Google  -­‐  https://gigaom.com/2015/01/29/new-­‐to-­‐ deep-­‐learning-­‐here-­‐are-­‐4-­‐easy-­‐lessons-­‐from-­‐google/     Getting  started  with  AWS  ML  -­‐-­‐  http://docs.aws.amazon.com/machine-­‐ learning/latest/dg/tutorial.html     AzureML  on  Windows  Azure  DataMarket    /  Binary  Classifier  Sample  -­‐-­‐   https://datamarket.azure.com/dataset/aml_labs/log_regression     BigML  Sample  Model  -­‐   https://bigml.com/user/ashikiar/gallery/model/53b2f21ec8db635905000d33     Kaggle  Community  -­‐  https://www.kaggle.com/     DataKind  Community  -­‐  http://www.datakind.org/          
  • 33.   Practical Machine Learning: discerning differences and selecting the best approach   33 Table of Abbreviations Abbreviation   Full  Term   AI   Artificial  Intelligence   AWS   Amazon  Web  Services   BI   Business  Intelligence   CRM   Customer  Relationship  Management   DW   Data  Warehouse   GPU   Graphics  Processing  Unit   IoT   Internet  of  Things   LOB   Line  of  Business     ML   Machine  Learning   NoSQL   No  SQL     OLAP   On  line  analytical  processing   OLTP   On  line  transactional  processing   POC   Proof-­‐of-­‐concept   RDBMS   Relational  Database  Management  System    
  • 34.   Practical Machine Learning: discerning differences and selecting the best approach   34 About Lynn Langit Lynn Langit is a Big Data and Cloud Architect who has been working with database solutions for more than 15 years. Over the past 4 years, she’s been working as an independent architect using these technologies, mostly in the biotech, education, manufacturing and facilities verticals. Lynn has done POCs and has helped teams build solutions on the AWS, Azure, Google and Rackspace Clouds. She has done work with SQL Server, MySQL, AWS Redshift, AWS MapReduce, Cloudera Hadoop, MongoDB, Neo4j, Aerospike and many other database systems. In addition to building solutions, Lynn also partners with all major vendor cloud vendors, providing early technical feedback into their Big Data and Cloud offerings. She is an AWS Community Hero, Google Developer Expert (Cloud), Microsoft MVP (SQL Server) and a MongoDB Master. Lynn is also a Cloudera certified instructor (for MapReduce Programming). Prior to re-entering the consulting world 3 years ago, Lynn’s background is over 10 years as a Microsoft Certified instructor, a Microsoft vendor and then 4 years as Microsoft employee. She’s published 3 books on SQL Server Business Intelligence and has most recently worked with the SQL Azure team at Microsoft. She continues to write and screencast and hosts a BigData channel on YouTube (http://www.youtube.com/SoCalDevGal) with over 150 different technical videos on Cloud and BigData topics. Lynn is also a committer on several open source projects (http://github.com/lynnlangit). About Mark Tabladillo Mark Tabladillo is a Senior Data Scientist at midtown Atlanta's Predictix/LogicBlox. He has used and promoted Microsoft Azure Machine Learning, Microsoft SQL Server Data Mining, Microsoft BI Stack, Power BI, SAS, SPSS, R, and Julia. He is a SQL Server MVP and has a research doctorate (PhD) from Georgia Tech. He is chapter leader for PASS Data Science Virtual Chapter, which has periodic live meetings and its own YouTube channel.