A case study in using ibm watson studio machine learning services ibm developer recipes

6/29/2020 A Case Study in using IBM Watson Studio Machine Learning Services - IBM Developer Recipes
https://developer.ibm.com/recipes/tutorials/a-case-study-in-using-ibm-watson-studio-machine-learning-services/ 1/48
Overview
Skill Level: Beginner
This recipe shows various ways of predicting customer churn using IBM Watson Studio ranging from a semi-automated
approach using the Model Builder, a diagrammatic approach using SPSS Modeler Flows to a fully programmed style using
Jupyter notebooks.
Ingredients
Software Requirements
IBM Cloud Account
IBM Watson Studio
IBM Watson Machine Learning Service
To obtain an IBM Cloud Account and get access to the IBM Cloud and to IBM Watson Studio, please follow the instructions outlined
here:
Get Started with IBM Cloud
Sign up for IBM Watson Studio
Step-by-step
Introduction
The recipe has been replaced by an official IBM Developer tutorial. Please use the tutorial instead:
Learning path: Getting started with Watson Studio

This recipe demonstrates various ways of using IBM Watson Studio to predict customer churn ranging from a semi-
automated approach using the Model Builder, a diagrammatic approach using SPSS Modeler Flows to a fully programmed
style using Jupyter notebooks for Python.

The recipe will follow the main steps of methods for data science (and data mining) such as CRISP-DM (Cross Industry
Standard Process for Data Mining) and the IBM Data Science Methodology and will focus on tasks for data understanding,
data preparation, modeling, evaluation and deployment of a machine learning model for predictive analytics. It takes its basis
in a data set and notebook for customer churn available on Kaggle, and then demonstrate alternative ways of solving the
same problem but using the Model Builder, the SPSS Modeler and the IBM Watson Machine Learning service provided by
the IBM Watson Studio. At the same time the recipe will also dive into the use of the profiling tool and the dashboards of IBM
Watson Studio to support data understanding as well as the Refine tool to solve straightforward data preparation and
transformation tasks.
The recipe provide the following sections:
Section 2 provides a short overview of the methodology and tools used as well as an introduction to the notebook on
Kaggle thus setting the scene for the recipe.
Section 3 provides the steps needed to create and configure a project, import the artifacts and get the notebook from
Kaggle running inside IBM Watson Studio.
Section 4 focuses on getting insights into the data set used by using the profile tool and the dashboard capabilities of
IBM Watson Studio.
Section 5 will briefly introduce the Refine component for defining transformation. This step is optional.
Section 6 get you to create and evaluate a Watson Machine Learning model with a few user interactions using the Model
Builder.
Section 7 will continue with deployment and test of the model using the IBM Watson Machine Learning service.
Section 8 will repeat the steps for creating a model but using SPSS Modeler Flows and will demonstrate the capabilities
of this tool for data understanding, preparation, model creation and evaluation.
Section 9 will let you test the SPSS model using a Jupyter Notebook for Python and the IBM Watson Machine Learning
services REST API.
Setting the Scene
IBM has defined a Data Science Methodology that consists of 10 stages that form an iterative process for using data to
uncover insights. Each stage plays a vital role in the context of the overall methodology. At a certain level of abstraction it
can be seen as a refinement of the workflow outlined by the CRISP-DM (Cross Industry Standard Process for Data Mining)
method for data mining.

According to both methodologies every project starts with Business Understanding where the problem and objectives are
defined. This is followed in the IBM Data Science Method by the Analytical Approach phase where the data scientist can
define the approach to solving the problem. The IBM Data Science Method then continues with three phases called Data
requirements, Data collection and Data understanding, which in CRISP-DM is presented by a single Data Understanding
phase. Once the Data Scientist has an understanding of their data and has sufficient data to get started, they move on to
the Data Preparation phase. This phase is usually very time consuming. A data scientist spends about 80% of their time
here, performing tasks such as data cleaning and feature engineering. The term “data wrangling” is often used in this
context. During and after cleaning the data, the data scientist generally performs exploration – such as descriptive statistics
to get an overall feel for the data and clustering to look at the relationships and latent structure of the data. This process is
often iterated several time until the data scientist is satisfied with their data set. The model training stage is where machine
learning is used in building a predictive model. This model is trained and then evaluated by statistical measures such as
prediction accuracy, sensitivity, specificity etc. Once the model is deemed sufficient, the model is deployed and used for
scoring on unseen data. The IBM Data Science Methodology adds an additional Feedback stage for obtaining feedback from
using the model which will then be used to improve the model. Both methods are highly iterative by nature.
In this recipe we will focus on the phases starting with data understanding and then continue from there preparing the data,
building a model, evaluating the model and then deploying and testing the model. The purpose will be to develop models to
predict customer churn. Aspects related to analyzing the causes of these churns in order to improve the business is – on the
other hand – out of the scope of this recipe. This means that we will be working with various kinds of classification models
that can, given an observation of a customer defined by a set of features, give a prediction whether this specific client is at
risk of churning or not.
For all tasks we will use IBM Watson Studio. IBM Watson Studio provides users with environment and tools to solve
business problems by collaboratively working with data. Users can choose the tools needed to analyze and visualize data, to
cleanse and shape data, to ingest streaming data, or to create, train, and deploy machine learning models.
The main functionality offers relates to components for:
Create Projects to organize the resources (such as data connections, data assets, collaborators, notebooks) to achieve
an analytics goal.
Access data from Connections to your cloud or on-premises data sources. Upload files to the project’s object storage.
Create and maintain Data Catalogs to discover, index, and share data.
Refine data by cleansing and shaping the data to prepare it for analysis.

Perform Data Science tasks by creating Jupyter notebooks for Python or Scala to run code that processes data and
then view the results inline. Alternatively use RStudio for R.
Ingest and Analyze Streams data with the Streams Designer tool.
Create, test and deploy Machine Learning and Deep Learning models.
Classify images by training deep learning models to recognize image content.
Create and share Dashboards of data visualizations without coding.

IBM Watson Studio is technically based on a variety of Open Source technology and IBM products as depicted in the
following diagram:
In context of data science, IBM Watson Studio can be viewed as an integrated, multi-role collaboration platform that support
the developer, data engineer, business analyst and last but not least the data scientist in the process of solving a data
science problem. For the developer role other components of the IBM Cloud platform may be relevant as well in building
applications that utilizes machine learning services. The data scientist however can be build the model using a variety of
tools ranging from RStudio and Jupyter Notebooks using a programmatic style, SPSS Modeler Flows adopting a
diagrammatic style or the Model Builder component for creating IBM Watson Machine Learning Service which supports a
semi-automated style of generating machine learning models. Beyond those 3 main components you will also get to use IBM
Cloud Object Storage for storing the data set used to train and test the model, Data Refinery for transforming the data set
and IBM Watson Studio dashboards for generating visualizations. A key component is of course the IBM Watson Machine
Learning service and its set of REST APIs that can be called from any programming language to interact with a machine
learning model. The focus of the IBM Watson Machine Learning service is deployment, but you can use IBM SPSS Modeler
or IBM Watson Studio to author and work with models and pipelines. Both SPSS Modeler and IBM Watson Studio use Spark
MLlib and Python scikit-learn and offer various modeling methods that are taken from machine learning, artificial intelligence,
and statistics.

In the recipe we will start out with a dataset for Customer Churn available on Kaggle. The dataset is accompanied with a
corresponding Customer Churn Analysis Jupyter Notebook from Sandip Datta that shows the archetypical steps in
developing a machine learning model by going through the following essential steps:
Import the dataset.
Analyze the data by creating visualizations and inspecting basic statistic parameters (mean, standard variation etc.).
Prepare the data for machine model building e.g. by transforming categorical features into numeric features and by
normalizing the data.
Split data in train and test data to be used for model training and model validation respectively.
Train model using various machine learning algorithms for binary classification.
Evaluate the various models for accuracy and precision using a confusion matrix.
Select the model best fit for the given data set and analyze which features have low and have significant impact on the
outcome of the prediction.

The notebook is defined in terms of 25 Python cells and requires familiarity with the main libraries used: Python scikit-
learn for machine learning, Python numpy for scientific computing, Python pandas for managing and analyzing data
structures and last but not least matplotlib and seaborn for visualization of the data. An outline of the notebook is given by
the screenshots in the table below (to be read row by row). More details of the notebook will be briefly covered in the next
section where you will download and run the notebook once that you have created a project to manage the relevant assets:


One objective of this recipe is to show how IBM Watson Studio offers – in addition to Jupyter Notebooks for Python, Scala or
R – alternative ways of going through a similar process that may be faster and can be achieved without programming skills.
These mechanisms are in essence SPSS Modeler Flow which allows a data scientist to create a model purely graphically by
defining a flow and the IBM Model Builder inside IBM Watson Studio which goes one step beyond SPSS by providing a
semi-automatic approach to creation, evaluation, deployment and testing of a machine learning model. At the same time we
shall demonstrate how IBM Watson Studio provides capabilities out-of-the-box for profiling, visualizing and transforming the
data – again without any programming required.
Following the recipe you will create a project that contains the artifacts shown in the following screenshot.
The artifacts will be created as follows:
Section 3 of the recipe will get you started by creating the project and importing the assets from Kaggle so that you can
run the imported notebook named ‘Class – Customer Churn – Kaggle’.
Section 4 will let you perform tasks related to the Data Understanding phase, which includes profiling the imported data
set to view the distribution and statistical measures like minimum, maximum, mean and standard deviation for numerical
features. Moreover you will create a ‘Customer Churn Dashboard’ and a couple of visualizations.
Section 5 will cover the Data Preparation phase and will briefly introduce the Refine component where you will create a
Data Refinery Flow to transform the input data set. This step is optional.
Section 6 will continue with the Modeling and Evaluation phase and will get you to create and evaluate a Watson
Machine Learning model with a few user interactions using the Model Builder.
Section 7 will continue with Deployment and Test. You will deploy the Machine Learning model as a web service and
then test it using test data presented in form of JSON objects.
Section 8 will repeat the steps but using SPSS Modeler Flows.

Section 9 will let you deploy the SPSS model and then create a Jupyter Notebook for Python that uses the IBM Watson
Machine Learning services REST API to request predictions for specific observations.
Getting Started
We will assume that you have already gained access to IBM Cloud and IBM Watson Studio (see the “Prerequisites” section
at the beginning of the recipe for the links needed for registering). If in doubt about how to gain access to IBM Watson Studio
you can also follow the instructions in section 3 of the recipe “Analyze archived IoT device data using IBM Cloud Object
Storage and IBM Watson Studio“.
In this section of the recipe you will get started by doing the following:
Create a project.
Provision the IBM Machine Learning, Apache Spark and IBM Cognos Dashboard Embedded services for later use.
Download the dataset from Kaggle and import it to the project.
Download, modify and run the Jupyter notebook for Python that sets the scene for this recipe.

Create IBM Watson Studio Project
To create the project do the following:
Sign into IBM Watson Studio.
Click Create a project.
In the next page, select the Standard Project template and click Create Project.
In the New Project dialog, give a name to the project such as “Watson Machine Learning” and click Create.
Wait until the the project has been created.

Provision IBM Cloud Services
To provision the Machine Learning Service and associate it as a service to the current project do the following:
Select the Settings tab for the project at the top of the page.

Scroll down to the Associated Services section.
Click the Add Service button.
Select the Watson Menu item.
On the next page, select the Watson Machine Learning Service and click Add.
On the next page, select the New tab to create a new service.
Keep the Lite plan for now (you can change it later if necessary).
Scroll down and click Create to create the service.
Next the Confirm Creation dialog will appear that will let you specify the details of the service such as the region, the
plan, the resource group and the service name.
Enter a proper name for the service instance e.g. by prefixing the generated name with “Watson Machine Learning”.
Click Confirm.

You may choose to use the default resource group for the services but a better choice would be to use a dedicated one that
you have created in IBM Cloud. You can find the command for creating new resource groups in IBM Cloud using the
menu Manage > Account, and then navigate to Account Resources > Resource Groups in the toolbar to the left.
The Create button can be found in the top right corner of the page.
Continue in a similar way to create an instance of the Apache Spark service and the IBM Cognos Dashboard
Embedded service. Use whenever possible the Lite plan and provide the same prefix to the auto-generated service name
as above.

Upload Data Set
Next download the data set from Kaggle and upload it to IBM Watson Studio:
Go to the URL for the data set on Kaggle (https://www.kaggle.com/sandipdatta/customer-churn-analysis) and download
the file to your local desktop.
Rename the file to something more meaningful, e.g. ‘Customer Churn – Kaggle.csv’.
In IBM Watson Studio, select the Assets tab.
Drag and drop the file onto the area for uploading data to IBM Watson Studio in the upper right coerner of the page.
Wait until the file has been uploaded.

Import and Test Jupyter Notebook
Finally create a Jupyter notebook for predicting customer churn and change it to use the data set that you have uploaded to
the project.

In the Asset tab, click the command Add to Project.
Select the Notebook asset type.
In the New Notebook dialog, configure the notebook as follows:
Select the “From URL” tab and enter ‘https://github.com/EinarKarlsen/ibm-watson-machine-
learning/blob/master/Class%20-%20Customer%20Churn%20-%20Kaggle.ipynb‘ as the URL for the notebook.
Enter the name for the notebook, e.g. “Class – Customer Churn – Kaggle”.
Select the runtime system (e.g. the default Python runtime system which is for free).
Optionally, enter a short description for the notebook.
Click Create Notebook.

Scroll down to the third cell and select the empty line in the middle of the cell.
In the right part of the window, select the Customer Churn data set. Click insert to code and select Insert pandas
DataFrame. This will add code to the data cell for reading the data set into a pandas Data Frame.
Change the generated variable name df_data_1 for the data frame to df which is used in the rest of the notebook as
shown above.
Save the notebook by invoking File > Save.

Run the cells of the notebook one by one and observe the effect and how the notebook is defined.
Data Understanding and Visualization
During the data understanding phase, the initial set of data is collected. The phase then proceeds with activities that enables
you to become familiar with the data, identify data quality problems and discover first insights into the data. In the Jupyter

notebook these activities are done using pandas and the embodied matplotlib functions of pandas. The describe function of
pandas is used to generate descriptive statistics for the features and the plot function is used to generate diagrams showing
the distribution of the data:
We can achieve the same in IBM Watson Studio by simple user interactions without a single line of code by using out-of-the-
box functionality. To view the data set in IBM Watson Studio, simply locate the data asset and then click the name of the data
set to open it:
IBM Watson Studio will show you a preview of the data in the Preview tab. The Profile tab on the other hand provides you
with profiling information that shows the distribution of the values and for numerical features also the maximum, minimum,
mean and standard deviation for the feature:

Notice that although the numerical columns are identified to be of type varchar, the profiler is sufficient smart to recognize
these to be numerical columns and consequently convert them implicitly and compute the mean and the standard deviation.
To generate the profile the first time simply do the following:
Select the Profile tab,
Then invoke the command Create Profile.
Wait a short while and then refresh the page.

Notice that the churn parameter does not provide a balanced distribution of churn and no-churn observations as already
observed in the notebook on Kaggle, which calls for a need for cross validation strategies to be adopted during the model
building and evaluation phase.
We can look further into the dataset by creating a dashboard with associated visualizations. This basically requires 3 steps:
1) create an empty dashboard, 2) add a data source to be used for visualizations and 3) add appropriate visualizations to the
dashboard.
To create the dashboard do the following:
Click the Add to project button at the top of the page.

In the next dialog, click Dashboard to create a new dashboard.
On the next page titled New Dashboard do the following:
Enter a Name for the dashboard, e.g. ‘Customer Churn Dashboard’
Provide a Description for the dashboard (optional).
As Cognos Dashboard Embedded Service, select the dashboard service that you created in the previous section.
Click Save to save the dashboard.
On the next page select the Freeform template.
Keep the default setting that will create a Tabbed dashboard.
Click OK to create an empty freeform dashboard with a single Tab.

To add a data connection, go through the following steps:

Click the “Add a source” button in the upper left part of the page:
On the next page select the data source named ‘Customer Churn – Kaggle.csv’.
You can now (optionally) Preview the data source now by clicking the eye icon to the right of the data source name.
Click Select to select the data source.
Back in the dashboard, select the newly imported data source.
Expand the data source by clicking > so that you can view the columns.

Notice that you can view and change the properties of the columns. Simply click the 3 dots to the right of the column name,
then select Properties in the popup menu. This will display a dialog as shown above, and allow you to alter the default
setting for Usage (Identifier, Attribute, Measure) and Aggregate Function (Count, Count Distinct, Maximum, Minimum etc).
For now we should be fine with the default settings.
To create a visualization that shows the distribution of churns and no-churns as a pie chart do the following:
Select the Visualizations icon in the toolbar to the left.
Select a Pie chart.

This will create a form for specifying the properties of the pie chart using e.g. columns of the data set.
Select the Sources icon in the toolbar to the left (it is the one located above the Visualizations icon).
Drag and drop the churn column onto the Segments property of the pie chart.
Drag and drop the churn column onto the Size column of the pie chart.
Click the Collapse arrow in the top right of the form as shown above. This will minimize the pie chart and render it on the
dashboard.

Select the Tab to the top left, then click the Edit the title button.
Provide a title for the tab (e.g. ‘Customer Churn’)’.

Continue this way creating two more visualizations:
A Stacked Column Chart showing State (visualization property Bars) and Churn (Length, Color) on the X and Y axis
respectively.
A Pie Chart showing the distribution of International Plan (Segments, Length).

This should result in a dashboard looking like below. Notice that you can move visualizations on the dashboard using
the Move widget command located on the top of each visualization:
The dashboards are dynamic by nature and supports exploration of the data using e.g. filters. In the visualization showing
‘International Plan’ click the slice associated with the value ‘yes’. This will create a filter which will apply to all other

(connected) visualizations on the current dashboard as well:
Notice that the slice for churn in the visualization to the left has increased significantly. This tells us that clients on an
international plan are more likely to churn than clients that are not. To remove the filter, simply click the filter icon for the
visualization in the top right corner, then select the delete filter button that pops up as a result (the icon is a cross in a circle).
Simply clicking the slice again will achieve the same effect.
Data Preparation and Transformation using Refine
The data preparation phase covers all activities needed to construct the final dataset that will be feed into the machine
learning service. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks
include table, record, and attribute selection, as well as transformation and cleaning of data for the modeling tools. In the
original notebook on Kaggle this involved turning categorical features into numerical ones, normalizing the features and
removing columns not relevant for prediction (such as e.g. the phone number of the client). A subset of the operations are
shown below:

If we would just like to create a model semi-automatically or fully automated using the IBM Watson Model Builder and
Machine Learning service, no more activity would actually be needed during data preparation (for the current data set) since
the Model Builder service will take care of such operations under the hood. We will show how this is done in the next section.
However, IBM Watson Studio offers a service called Data Refine that allows us to cleanup and transform data without any
programming required. We will shortly introduce the service so that you can get a feeling of how it works. However, this step
is not strictly necessary for the process:
Click Add to new project in the top bar of the page.
In the Choose asset type dialog, select Data Refinery Flow to create a new flow
On the next page, select the Customer Churn data set and click Add.

This will open up the data source for you so that you can transform and view it.

Notice the tabs to the top left which provides you with capabilities for view the data in a tabular form, for profiling it (as in the
previous section) and for creating custom visualizations of the data.
To transform the data do the following:
Select the 3 dots in the “phone number” column and invoke the Remove command in the pull-down menu. This will
delete the column.
Select the “total days minutes” feature column. This is a really a String type but should be numeric.

Click the Operations button in the upper left corner. This will show you some available transformation:

You could for example convert the column to another type (say float or integer). However we will not do this for now since the
Machine Learning service will do it for us behind the scene automatically, but in principle you could decide e.g. to turn the
“total day minutes” column into an integer column and round it to show zero decimals. Alternatively you clould convert it into
a floating type. For now let’s just continue executing the flow just defined and view the result:
Click the Run Data Refinery flow button in the toolbar. Its icon is an arrow head.
On the next page you can give a name to the flow as well as the resulting output file. However, leave the default names
for now.
Click the Save and Run flow.
In the next dialog named “What’s Next?” select the View Flow command.

The resulting window shows the input file, the output file and the runs. Notice that there is also a tab where you can schedule
the flow so that it is executed automatically. Go back to your project and check that the output file and the flow are now part
of your project assets.
Data Refinery Flows allow a user to perform quick transformations of data without need for programming. It is of course by
no way a replacement for e.g. Jupyter notebooks and the powerful capabilities of e.g. numpy and pandas but for a quick
cleanup process is comes in quite handy. For more complex transformations and computations one should revert to using
other means such as e.g. Jupyter notebooks or SPSS Modeler flows (which we will cover in a later section).
Modeling and Evaluation using the IBM Watson Studio Model Builder
In the modeling phase, various modeling techniques are selected and applied, and their parameters are calibrated to
achieve an optimal prediction. Typically, there are several techniques that can be applied and some techniques have specific
requirements on the form of data. Therefore, going back to the data preparation phase is often necessary. In the model
evaluation phase however, the goal is to build a model that has high quality from a data analysis perspective. Before
proceeding to final deployment of the model, it is important to thoroughly evaluate it and review the steps executed to create
it, to be certain the model properly achieves the business objectives.
In the Jupyter notebook on Kaggle this boiled down to e.g. splitting the data set into training and testing data sets (using
stratified cross validation) and then train several models using distinct classification algorithms such as Gradient Boosting
Classifier, Support Vector Machines, Random Forest and K-Nearest Neighbors:
Following this step model evaluation continued but printing out the confusion matrix for each algorithm to get a more in-depth
view of the accuracy and precision offered by the models:

Using the Model Builder of IBM Watson Studio we can get to a model and an evaluation of it accuracy a bit faster and
without any programming required. The model builder in IBM Watson Studio is an interactive tool that guides you, step by
step, through building a machine learning model by uploading training data, choosing a machine learning technique and
algorithms and finally train and evaluate the model.
To create a new model using the IBM Watson Studio do the following:
Select the Assets tab for your IBM Watson Studio project.

Locate the Models section and invoke the command New Watson Machine Learning model.
In the New Model dialog:
Enter the Name of the machine learning model (e.g. ‘Customer Churn – Manual’).
Select the Watson Machine Learning service that you created in section 2 as the Machine Learning Service.
For the Runtime, select the Apache Spark service that you created in section 2.
Specify Manual as the approach for training the models.
Click Create.
On the next page titled “Select data asset”, simply select the data set that you imported in section 2 (you do not need to
use the file that was preprocessed using Refine in the previous section).
Click Next which will take you to the next page where you can select the Machine Learning algorithms to be used for the
classification.

On the page titled Select a technique do the following
Select ‘churn’ as the column value to predict.
Leave the default of using all feature columns for the prediction.
Select Binary Classification.
Keep the default settings for the test-validation-hold-out split of the data set.
On the top right of the page select Add Estimators.
Select Random Forest Classifier and click Add.
Repeat the same step for Gradient Boosted Tree Classifier.
Click Next and wait for the moment when the models have been trained.

Evaluate the model Model Performance and area under ROC and PR curve. They figures may be slightly different to the
figures shown above but the performance of the two estimators should be the same (from Excellent to Good).
Keep Random Forest Classifier as the selected approach and click Save to save the model.
Should IBM Watson Studio asks you for confirmation, e.g. whether to save the model or not, click Save.
The resulting page will provide you with information about the model and its evaluation results.
The model evaluation report does no provide exactly the same set of classification approaches and evaluation metrics as the
Jupyter notebook did, but it arrived at a result significantly faster.
I find this Model Builder component of IBM Watson Studio extremely useful in creating an initial machine learning model that
can be evaluated with respect to prediction performance and tested as well without time consuming programming efforts.
The single prediction delivered by the service (Excellent, Good, Fair, Poor) is also helpful in initially getting an idea whether
the data set at hand is at all useful for the purpose that we intend to use it for. Another advantage which can be observed
from the page above is that it is possible to configure performance monitoring of the model. This will provide you with the
ability to monitor the execution of the model as it is used and retrain the model the model on the run as feedback data are
gathered. For an example on how to do this, see for example the tutorial “Build, deploy, test, and retrain a predictive machine
learning model” or the video “Build a Continuous Learning Model” that is part of the IBM Watson Machine Learning course
on developer Works.
You can try out this way of using the Model Builder by creating a model using a data set for customer churn that is available
in IBM Watson Studio community. Do the following to get this data set into your project:
Select the Community tab in the toolbar of IBM Watson Studio.
Enter ‘Telco’ as search term.
Select the filter icon titled All filters.
Enable ‘Data Sets’ only so that you only see the data sets.

Select the ‘Customers of a telco including services used’ dataset.
Click the + button in the right bottom corner to import the dataset into your project.
Select your project in the Add to project menu.
Click Add and wait for the import to finish.
Select the View Project button to get back to your project.
Select the Asset tab to get back to the page that shows your asset and locate the imported data asset.

You can now continue very fast with data understanding and model building. Open the imported data set to view the
attributes. Then repeat the steps to build a model from this data set using a binary classification estimator and ‘churned’ as
target attribute. Wait a few minutes and you will get the feedback for the performance of the estimators. It is likely to be Poor
for the given data set.
Deployment and Test using the IBM Watson Machine Learning Service
According to the IBM process for Data Science, once a satisfactory model has been developed and is approved by the
business sponsors, it is deployed into the production environment or a comparable test environment. Usually it is deployed in
a limited way until its performance has been fully evaluated.
With the Model Builder and Machine Learning service of IBM Watson Studio, we can deploy a model in 3 different ways: as a
web service, as a batch program or as real time streaming prediction. In this recipe we shall simply deploy it as a web
service and then continue immediately by testing it interactively.
To deploy the model do the following within the resulting model evaluation page from the previous step. Alternatively, locate
the model in the Model section of the Assets tab for the project and click the name of the model to open it:
Select the Deployments tab.
Click Add Deployments in the upper right part of the page.

On the Create Deployment page do the following:
Enter a Name for the deployment, e.g. ‘Customer Churn – Manual – Web Deployment’.
Keep the default Web Service Deployment type setting.
Enter an optional Description.
Click Save to save the deployment.
Wait until the IBM Watson Studio set the STATUS field to DEPLOYMENT_SUCCES.

The model is now deployed and can be used for prediction. However, before using it in a production environment it may be
wortwhile to test it using real data. This can be done interactively or programmatically using the API for the IBM Machine
Learning Service. We shall look into using the API in an upcoming section of the recipe and will continue in this section
testing it interactively.
The Model Builder provides you with two options for testing the prediction: by entering the values one by one in distinct fields
(one for each feature), or to specify the feature values using a JSON object. We shall use the second option since it is the
most convenient one when tests are performed more than once (which is usually the case) and when a large set of feature
values are needed. To get thold on a predefined test data set do the following:
Download the test data from GitHub in the file ibm-watson-machine-learning/Customer Churn Test Data.txt.
Open the file and copy the value.

Notice that the JSON object defines the names of the fields first, followed by a sequence of observations to be predicted –
each in the form of a sequence:

{"fields": ["state", "account length", "area code", "phone number", "international plan", "voice
mail plan", "number vmail messages", "total day minutes", "total day calls", "total day charge",
"total eve minutes", "total eve calls", "total eve charge", "total night minutes", "total night
calls", "total night charge", "total intl minutes", "total intl calls", "total intl charge",
"customer service calls"], "values": [["NY",161,415,"351-
7269","no","no",0,332.9,67,56.59,317.8,97,27.01,160.6,128,7.23,5.4,9,1.46,4]]}
Be aware that some of the features such as state (and phone number) are expected to be in the form of strings (which
should be no surprise), whereas the true numerical features can be provided as integers or floats as appropriate for the
given feature.
To test the model at runtime do the following:
Select the deployment that you just created by clicking the link named by the deployment (e.g. ‘Customer Churn –
Manual – Web’).
This will open a new page providing you with an overview of the properties of the deployment (e.g. name, creation date,
status).
Select the Test tab.
Select the icon above that allows you to enter the values using JSON.
Paste the JSON object in the downloaded ‘Customer Churn Test Data.txt’ file into the Enter input data field.
Click the Predict button.

The result of the prediction is given in terms of the probability that the customer will churn (True) or not (False). You can try it
with other values, e.g. by substituting the values with values taken from the ‘Customer Churn – Kaggle.csv’ file. Another test
would be to change the phone number to e.g. “XYZ” and then run the prediction again. The result of the prediction should be
the same.
Modeling and Evaluation using the SPSS Modeler Flows
IBM Watson Studio Modeler flows provide an interactive environment for quickly building machine learning pipelines that flow
data from ingestion to transformations and model building and evaluation – without needing any code.
We shall briefly introduce the component in this section of the recipe by going through fhe following steps:
Create a new model flow from an existing model flow on GitHub.
Change the model flows input file and then run it.
Get into the main details of the flow to understand how it works and what kind of features the modeler flow provides for
defining machine learning pipelines and models.
Deploy the flow to the IBM Watson Machine Learning model.

Once that the model has been deployed we will test it in the next section using a Jupyter notebook for Python.
To create an initial machine learning flow, do the following:
From the Assets page, click Add to project.
In the Choose asset type dialog, select Modeler Flow.
On the next page titled Modeler, select the ‘From File’ tab.
Download the modle flow named ‘Customer Churn Flow.str’ from https://github.com/EinarKarlsen/ibm-watson-machine-
learning.
Drag and drop the downloaded modeler flow file the upload area. This will also set the name for the flow (see above
screenshot).
Change the name and provide a description for the machine learning flow if you like (optional).

Click Create. This opens the Flow Editor that ca nbe used to create a machine learning flow.

You have now imported an initial flow that we will explore in the the remainder of this section.
As you can get an overview of the various supported modeling techniques from the Palette to the right of the page. The first
one is Auto Classifier that will try several techniques and then present you with the results of the best one.
The main flow itself defines a pipeline consisting of several steps:
A Data Asset node for importing the data set.
A Type node for defining meta data for the features, including a selection of the target attribute for the classification.
An Auto Data Prep node for preparing the data for modeling.
A Partition node for partitioning the data into a training set and a testing set.
An Auto Classifier node called ‘churn’ for creating and evaluating the model.

Additional nodes have been associated with the main pipeline for viewing the input and output respectively. These are:
A Table output node called ‘Input Table’ for previewing the input data.
A Data Audit node called ’21 fields’ (default name) for auditing the quality of the input data set (min, max, standard
deviation etc.).
An Evaluation node for evaluating the generated model.
A Table output node called ‘Result Table’ for previewing the results of the test prediction.

We will go through the details one by one in the remainder of this section before we finally deploy the model to the IBM
Watson Machine Learning Service. But first you will need to run the flow and before doing this you must connect the flow
with the appropriate set of test data available in your project. Consequently do the following:
Select the 3 dots of the Data Asset node to the left of the flow (the input node).

Invoke the Open command from the menu. This will show the attributes of the node in the right part of the page.
Click the Change data asset button to change the input file.
On the next page, select your CSV file containing customer churn and click OK.
Click Save.
Click the Run button (the arrow head) in the toolbar to run the flow.
Running the flow will create a number of outputs or results that can be inspected in more detail.
Data Understanding
If we follow the flow in the original Jupyter notebook on Kaggle, then the first step following data import is to view the data.
To achieve this do the following:
Select the Input Table node.
Select the 3 dots in the upper right corner and invoke the Preview command from the popup menu.


The last interaction may run part of the flow again but has the advantage that the page provides a Profile tab for profiling the
data and a Visualization tab for creating dashboards:
The Jupyter notebook then continues providing a description for each of the columns listing their minimum, maximum, mean
and standard deviation – amongst others. To achieve a similar task with the current flow do the following:
Select the command View outputs and versions from the top right of the toolbar.
Select the Output tab.
Double click the output for the node named “21 Fields”.Alternatively select the 3 dots assocaited with the putput and
invoke Open from the popup menu.

This will provide you with the following overview:

For each feature it shows the distribution in graphical form and whether the feature is categorical or continuous. For
numerical features the computed min, max, mean, standard deviation and skewness are shown as well. From the column
named Valid we observe that there are 3333 valid values meaning that no values are missing for the listed features and we
do not need to bother further with this aspect of preprocessing to filter or transform columns with lacking values.

Data Preparation
You can actually change the initial assessment of the features made by the import using the Type node which happens to be
the next node in the pipeline. To achieve this do the following:
Go back to the Flow Editor by selecting ‘Customer Churn Flow’ in toolbar.
Select the Type node.
Invoke the Open command from the popup menu.

This will provide a table showing the features (i.e. fields), their kind (continous, flag etc) and role – amongst others:

The Measure can be changed if needed using this node and it is also possible to specify the role of a feature. In this case
the role of the churn feature (which is a Flag with True and False values) has been changed to Target. The Check column
may give you more insight into the values of the field.
The Jupyter notebook continued by transforming categorical fields into numerical ones using label encoders and by
normalizing the fields. The same can be achieved with very little work required using the Auto Data Prep node. To continue
simply:
Click Cancel to close the property editor for the Type node.
Select the Auto Data Prep node in the flow editor.
Invoke Open from the popup menu.

This node offers a multitude of settings, e.g. for defining the objective of the transformation (optimize for speed or for
accuracy).

The screenshot above shows that the transformation has been configured to exclude fields with too many missing values
(treshhold being 50) and to exclude fields with too many unique categories. I assume that the latter applies to the phone
numbers and have therefore decided not to worry more about them.
The next node in the pipeline is the Partition node, which splits the data set into a training set and a testing set. For the
current Partition node a 80-20 split has been used:

Modeling
Having transformed and partioned the data the Jupyter notebook continued by training the model. In the SPSS Modeler Flow
this is achieved by the Auto Classifier node which – amongst others – provides various settings e.g. for ranking and
discarding (using threshold accuracy) the models generated.

Notice that the property Default number of models to use is set to 3 which is the default value. Please feel free to change
it to 5 and then click Save to save the changes.

Model Evaluation
To get more details about the generated model do the following:
Select the yellow model icon
invoke the View Model command from the menu.

This overview section will provide you with a list of 3 selected classifier models and their accuracy.

The estimator with the least accuracy is the C&R Tree Model. To dive into the detals do the following:
Select name C&RT (it is a link).
On the next page select the Tree Diagram link to the left to get the tree diagram for the estimator.

You can now hover over either one of the nodes or one of the branches in the tree to get more detailed information about
decision made at a given point:
Go back by clicking the left arrow in the top left of the corner. Then select the Random Tree estimator to get the details for
that estimator:
You may wonder why the number (89%) is lower than the one shown in the Auto-Classifier overview (94%) for the Random
Forest estimator. The reason why is that the numbers in the confusion matrix is based on results applied to out-of-bag (OOB)

instances for each tree in the ensemble, which is a standard method used for random trees/forests models in estimating how
well the models will work on new data. The number shown in the overview page for the Auto-Classifier node is on the other
hand based on scoring the full training data set using the Random Trees ensemble, which tends to give a more optimistic
value, but which is more directly comparable to values from the other algorithms shown in that table. A more detailed
discussion can be found in the documentation for Random Trees.
If we would like to get the confusion matrix for the complete data set, which would provide a better basis for comparing the
results with the Python Notebook, it can be achieved by adding an Matrix Output node to the canvas:
Go back to the flow.
Add a Matrix node from the Outputs menu.
Attach the matrix node to the model output node.
Open the Matrix node.
Put the target attribute ‘churn’ in the Rows and the binary prediction ‘$XF-churn’ in the Columns.
For Cell contents select Cross-tabulations.

Click on Appearance and select Counts, Percentage of Row, Percentage of Column, and Include row and column totals.
Click Save.
Run the Matrix node.
Select View Output and Versions in the upper right corner.
Open the output for the Matrix node (named ‘churn x $XF-churn’) by double clicking it.
The main diagonal cell percentages contain the recall values as the row percentages (100 times the proportions metric
generally used) and the precision values as the column percentages. The F1 statistics and weighted versions of precision
and recall over both categories would have to be manually calculated. The results shown are the combined results applying
all 3 algorithms. If you want to see the results just for the Random Forest go back to the Auto Classifier node. Open it and
un-check the boxes for all other models than Random Forest. Then rerun the flow.
If you want to just get the confusion matrix open the Matrix Output node and unselect ‘Percentage of Row’ and ‘Percentage
of Column’ appearance. Then repeat step 8-11 above:

A more graphical way of showing the confusion matrix can be achieved by using SPSS visualizations. For that purpose you
will need to select the Result Table output node, invoke Preview and then create a Treemap Visualization with the Columns
and Summary settings as shown below:
Notice that the current pipeline performs a simple split of test and training data using the Partition node. It is also possible to
use cross validation and stratified cross validation to achieve slightly better model performance but at the cost of
complicating the pipeline. We refer to the article ‘k-fold Cross-validation in IBM SPSS Modeler‘ by Kenneth Jensen for details
on how this can be achieved.
Showing predictor importance was the last step in the original notebook on Kaggle. To get that information for the Random
Tree classifier select the Random Forest Predictor predictor in output for the Auto Classifer, then select the Importance tab to
the left:

There are two more ways of viewing the results of the evaluation.
Go back to the flow editor for the Customer Churn Flow.
Select View outputs and version from the top toolbar.
Select the output named ‘Evaluation of [$XF-churn] : Gains’ by double clicking it.

You will see the generated outputs for the model. Moreover, select the output node named Evaluation, then double click it to
get the Gain information:
Model Deployment
After you create, train, and evaluate a model, you can deploy it.

To deploy the SPSS model do the following:
Go back to the flow editor for the model flow.
Select the output node shown above (or one of the other output nodes).
Invoke the command ‘Save branch as model from the popup menu.
A new window opens.
Type a model name, e.g. ‘Customer Chrun – SPSS Model’
Click Save.
The model is saved to the current project.

If interested in seeing other examples for using the SPSS Modeler to predict customer churn please see the tutorial ‘Predict
Customer Churn by Building and Deploying Models Using Watson Studio Flows‘

Scoring Machine Learning Models using the API
In section 7 we tested the Machine Learning service interactively. In this section we shall see how the service can be used
for predicting customer churn using the Machine Learning Service API and a Jupyter notebook for Python. The notebook is
quite simple and consists of 4 code cells:
The first code cell imports the libraries needed for submitting REST requests. The second defines the credentials for the IBM
Watson Machine Learning service. The third cell defines the payload for the scoring – basically the same payload that you
used in section 7 to test the model generated by the Model Builder. The fourth cell constructs a HTTP POST request and
sends it to the server to get the scoring for the payload. The requests needs the credentials for the IBM Watson Machine
Learning service and the API scoring endpoint for the created model.
To get the notebook to run in your environment you will need to do the following:
Deploy the machine learning model and get the code template for calling the API endpoint for scoring using Python.
Obtain the credentials for your IBM Watson Machine Learning service.
Create a new Jupyter notebook for Python from the basis of a notebook on GitHub.
Modify the notebook to use the endpoint of your machine learning model and IBM Watson Machine Learning service.
Run the notebook.

To deploy the model and get the template code for scoring the model do the following:

Locate the Watson Machine Learning Models that you have created and open the one named ‘Customer Churn – SPSS
Model’.
Select the Deployment tab.
Create a new Web service deployment named ‘Customer Churn – SPSS Model – Web Service’.
Wait until the deployment has been created, then open the deployment by clicking on the name.
Select the Implementation tab.
Select the Python tab to render the Python template code for using the API to get a prediction.
Save the code for later use.

The code defines the API endpoint, the payload for scoring as well as the header to be passed to the POST request to get
the prediction. This header will need the credentials for the IBM Watson Machine Learning service.
Go back to your Watson Studio Project.
From the toolbar select Services > Watson Services. This will provide you with a list of all IBM Cloud Watson services
that you have used.
Select the Watson Machine Learning Service that you are using in this project. This will open the dashboard for the
service.
Select the Service credentials tab to the left of the dashboard
Click the New Credential button to the right to create the credentials
Copy the credentials (including username, password and API key) to a local file.

If you are in doubt which IBM Watson Machine Learning service you are using in the project, simply select Settingsfrom the
IBM Watson Studio toolbar and you will get a list of all services associated with the project.
Next import a notebook from GitHub and modify the notebook to use the credentials and endpoint for your model:

In the Asset tab of your IBM Watson Studio project, select the command New Notebook.
Select the From URL tab.
Click the following hyperlink ‘Test SPSS Customer Churn Machine Learning Model‘ and copy the URL. Then paste the
URL to the URL field.
Select the Free Python runtime system.
Click Create Notebook.
Copy your Machine Learning service credentials into the second code cell as shown in the first screenshot in this
section.
Replace the content of the 4th cell with the similar code fragments for your deployment (the important part of the code to
replace is the API endpoint)
Invoke File > Save.

Having modified the code you can run the cells one by one and finally get the score. Feel free to test the prediction with
other values.
Conclusion
In this recipe we have briefly presented 3 approaches for creating machine learning models in IBM Watson Studio: Jupyter
notebooks with Python, SPSS Modeler Flows and last but not least the Model Builder.
The Model Builder provides the highest degree of automation and makes it possible to generate a machine learning model
that can be evaluated, deployed and tested within a few minutes by simple user interactions with IBM Watson Studio. It does
not however give much insight into what is going on behind the scene with regard to data preparation and transformation,
the training process or the detailed evaluation metrics. It is however very useful in generating models very fast that can be
used right away in a business context or to get an assertion whether the data set at hand can at all be used as a basis for
training models (in its raw form). This component is backed up with capabilities of IBM Watson Studio such as dashboards
and Refine that come in handy during the Data Understanding and Data Transformation phase when the transformations
needed are of limited complexity.
The SPSS Modeler Flow provides a graph editor for composing machine learning pipelines with an extensive palette of
operations for data transformation (cleansing, filtering, normalization etc) as well as a large set of data science estimators to
choose from. One of these is the Auto Classifier that will automatically train several models at once enabling the user to pick
the most suitable one at the end. This is backed up with an extensive set of capabilities supporting the Data Understanding
and Model Evaluation phase – all using a graphical notation and without the need to get deeply involved in any kind of
programming. Straight forward pipelines can therefore be built in a short time, and the approach provide significantly more
transparency and control compared to e.g. the Model Builder.
In context of a more intensive need for data transformations during the Data Preparation phase or specific approaches for
e.g. model training and model evaluation during the Modeling phase (e.g. using stratified cross validation) Jupyter notebooks
and Python numpy, pandas and scikit-learn are probably still the place to be. However this does not necessarily imply that
everything need to be done in Python as in the original notebook. Task such as Data Understanding can more easily be
undertaken using e.g. the Profiler and Dashboard capabilities of IBM Watson Studio. Final deployment of machine learning
models can also be achieved using e.g. IBM Watson Machine learning – although this capability has been out of scope for
the current recipe. Last but not least, once deployed the models can be monitored and retrained using the capabilities of the
IBM Machine Learning service.


Acknowledgement
This recipe started out with a dataset and a corresponding Jupyter Notebook for predicting customer churn from Sandip
Datta available on Kaggle. I would like to thank Sandip Datta for making both assets – of very good quality – available for
use by others. I would also like to thank David P. Nichols from the Watson Machine Learning team for providing me with
information on how to interpret the accuracy and generate the confusion matrix for the Random Forest predictor using
SPSS.

A case study in using ibm watson studio machine learning services ibm developer recipes

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (8)

Similar a A case study in using ibm watson studio machine learning services ibm developer recipes

Similar a A case study in using ibm watson studio machine learning services ibm developer recipes (20)

Último

Último (20)

A case study in using ibm watson studio machine learning services ibm developer recipes