SlideShare una empresa de Scribd logo
1 de 101
Descargar para leer sin conexión
Getting Started with Text Mining:
 STM™, CART® and TreeNet®

            Dan Steinberg
          Mykhaylo Golovnya
           Ilya Polosukhin
              May, 2011
Text Mining and Data Mining

Text mining is an important and fascinating area of modern analytics
On the one hand text mining can be thought of as just another application
area for powerful learning machines
On the other hand, text mining is a distinct field with its own dedicated
concepts, vocabulary, tools, and techniques
In this tutorial we aim to illustrate some important analytical methods and
strategies from both perspectives on data mining
    introducing tools specific to the analysis text, and,
    deploying general machine learning technology

The Salford Text Mining utility (STM) is a powerful text processing system
that prepares data for advanced machine learning analytics
Our machine learning tools are the Salford Systems flagship CART® decision
tree and stochastic gradient boosting TreeNet®
Evaluation copies of the the proprietary technology in CART and TreeNet as
well as the STM are available from
                           Salford Systems © Copyright 2011                   2
For Readers of this Tutorial

To follow along this tutorial we recommend that you have the analytical tools we use
installed on your computer. Everything you need may already be on a CD disk
containing this tutorial and analytical software
Create an empty folder named “stmtutor”, this is the root folder where all of the work
files related to this tutorial will reside
You may also use the following link to download Salford Systems Predictive Modeler
After downloading the package, unzip its contents into “stmtutor” which will create a
new folder named “SPM680_Mulitple_Installs_2011_06_07”. Follow installation steps
described on the next slide.
For the original DMC2006 competition website visit
We recommend that you visit the above site for information only; data and tools for
preparing that data are available at the URL next below
For the STM package, prepared data files, and other utilities developed for this tutorial
please visit
After downloading the archive, unzip its contents into “stmtutor”
                          Salford Systems © Copyright 2011                                  3
Important! Installing the SPM Software

The Salford Systems software you‟ve just downloaded needs to be both
installed and licensed. No-cost license codes for a 30 day period are
available on request to visitors of this tutorial*
Double click on the “Install_a_Transform_SPM.exe” file located in the
“SPM680_Mulitple_Installs_2011_06_07” folder (see the previous slide) to
install the specific version of SPM used in this tutorial
    Following the above procedure will ensure that all of the currently installed
     versions of SPM, if any, will remain intact!

Follow simple installation steps on your screen
* Salford Systems reserves the right to decline to offer a no-cost license at its sole discretion
                              Salford Systems © Copyright 2011                                      4
Important! Licensing the SPM Software

When you launch the Salford Systems Predictive Modeler (SPM) you will be
greeted with a License dialog containing information needed to secure a
license via email
                                                        Please, send the
                                                        necessary information
                                                        to Salford Systems to
                                                        secure your license by
                                                        entering the “Unlock
                                                        Code” which will be e-
                                                        mailed back to you
                                                        The software will
                                                        operate for 3 days
                                                        without any licensing;
                                                        however, you can
                                                        secure a 30-day
                                                        license on request

                     Salford Systems © Copyright 2011                            5
Installing the Salford Text Miner (STM)

In addition to the Salford Predictive Modeler (SPM) you will also work with the
Salford Text Miner (STM) software
No installation is needed and you should already have the “stm.exe”
executable in the “stmtutorSTMbin” folder as the result of unzipping the
“” package earlier
STM builds upon the Python 2.6 distribution and the NLTK (Natural Language
Tool Kit) but makes text data processing for analytics very easy to conduct
and manage
    You do not need to add any other support software to use STM

Expect to see several folders and a large number of files located under the
“stmtutorSTM” folder. It is important to leave these files in the location to
which you have installed them.
    Please do not MOVE or alter any of the installed files other than those explicitly
     listed as user-modifiable!

“stm.exe” will expire in the middle of 2012, contact Salford Systems to get an
updated version beyond that
                          Salford Systems © Copyright 2011                                6
The Example Project

The best examples are drawn from real world data sets and we were
fortunate to locate data publicly released by eBay.
Good teaching examples also need to be simple.
    Unfortunately, real world text mining could easily involve hundreds of thousands if
     not millions of features characterizing billions of records. Professionals need to be
     able to tackle such problems but to learn we need to start with simpler situations.
    Fortunately, there are many applications in which text is important but the
     dimensions of the data set are radically smaller, either because the data available
     is limited or because a decision has been made to work with a reduced problem.

We use our simpler example to illustrate many useful ideas for beginning text
miners while pointing the way to working on larger problems.

                          Salford Systems © Copyright 2011                                   7
The DMC2006 Text Mining Challenge

In 2006 the DMC data mining competition (restricted to student competitors
only) introduced a predictive modeling problem for which much of the
predictive information was in the form of unstructured text.
The datasets for the DMC 2006 data mining competition can be downloaded
    For your convenience we have re-packaged this data and made it somewhat
     easier to work with. This re-packaged data is included in the STMU package
     described near the beginning of this tutorial.

The data summarizes 16,000 iPod auctions held at eBay from May 2005
through May 2006 in Germany
Each auction item is represented by a text description written by the seller (in
German) as well as a number of flags and features available to the seller at
the time of the auction
Auction items were grouped into 15 mutually exclusive categories based on
distinct iPod features: storage size, type (regular, mini, nano), and color
The competition goal was to predict whether the closing price would be above
or below the category average
                         Salford Systems © Copyright 2011                          8
Comments on the Challenge

One might think that a challenge with text in German might not be of general
interest outside of Germany
However, working with a language essentially unfamiliar to any member of
the analysis team helps to illustrate one important point
    Text mining via tools that have no “understanding” of the language can be
     strikingly effective

We have no doubt that dedicated tools which embed knowledge of the
language being analyzed can yield predictive benefits
    We also believe we could have gained further valuable insight into the data if any
     of the authors spoke German! But our performance without this knowledge is still

In contexts where simple methods can yield more than satisfactory results, or
in contexts where the same methods must be applied uniformly across
multiple languages, the methods described in this tutorial will be an excellent

                          Salford Systems © Copyright 2011                                9
Configuring Work Location in SPM

The original datasets from the DMC 2006 challenge reside in the
“stmtutorSTMdmc2006” folder
To facilitate further modeling steps, we will configure SPM to use this location
as the default location:
    Start SPM
    Go to the Edit – Options menu
    Switch to the Directories tab
    Enter the “stmtutorSTMdmc2006”
     folder location in all text entry boxes
     except the last one
    Press the [Save as Defaults] button
     so that the configuration is restored
     the next time you start SPM

                                Salford Systems © Copyright 2011                   10
Configuring TreeNet Engine

Now switch to the TreeNet tab
    Configure the Plot Creation
     section as shown on the screen
    Press the
     [Save as Defaults]
    Press the [OK] button
     to exit

                             Salford Systems © Copyright 2011                  11
Steps in the Analysis: Data Overview

1.   Describe the data: (Data Dictionary and Dimensions of Data)
         a.   What is the unit of observation? Each record of data is describing what?
         b.   What is the dependent or target variable?
         c.   What other variables (data base fields) are available?
         d.   How many records are available?

2.   Statistical Summary
         a.   Basic summary including means, quantiles, frequency tables
         b.   Dimensions of categorical predictors
         c.   Number of distinct values of continuous variables

3.   Outlier and Anomaly Assessment
         a.   Detection of gross data errors such as extreme values
         b.   Assessment of usability of levels of categorical predictors (rare levels)

                              Salford Systems © Copyright 2011                            12
Data Fundamentals

The original dataset is called “dmc2006.csv” and resides in the
“stmtutorSTMdmc2006” folder
16,000 records divided into two equal sized partitions
    Part 1: Complete data including target, available for training during the competition
    Part 2: Data to be scored; during the competition the target was not availabler

25 database fields two of which were unstructured text written by the seller
Each line of data describes an auction of an iPod including the final winning
bid price
An eBay seller must construct a headline and a description of the product
being sold. Sellers can also pay for selling assistance
    E.g. Seller can pay to list the item title in BOLD

                           Salford Systems © Copyright 2011                                  13
The Data: Available Fields

  The following variables describe general features of each auction event

Variable                      Description
AUCT_ID                       ID number of auction
ITEM_LEAF_CATEGORY_NAME       products category
LISTING_START_DATE            start date of auction
LISTING_END_DATE              end date of auction
LISTING_DURTN_DAYS            duration of auction
LISTING_TYPE_CODE             type of auction (normal auction, multi auction, etc)
QTY_AVAILABLE_PER_LISTING     amount of offered items for multi auction
FEEDBACK_SCORE_AT_LISTIN      feedback-rating of the seller of this auction listing
START_PRICE                   start price in EUR
BUY_IT_NOW_PRICE              buy it now price in EUR
BUY_IT_NOW_LISTING_FLAG       option for buy it now on this auction listing

                          Salford Systems © Copyright 2011                            14
Available Data Fields

  In addition, there are binary indicators of various “value added” features that
  can be turned on for each auction

Variable                         Description
BOLD_FEE_FLAG                    option for bold font on this auction listing
FEATUERD_FEE_FLAG                show this auction listing on top of homepage
CATEGORY_FEATURED_FEE_FLAG       show this auction listing on top of category
GALLERY_FEE_FLAG                 auction listing with picture gallery
GALLERY_FEATURED_FEE_FLAG        auction listing with gallery (in gallery view)
IPIX_FEATURED_FEE_FLAG           auction listing with IPIX (additional xxl, picture
                                 show, pack)
RESERVE_FEE_FLAG                 auction listing with reserve-price
HIGHLIGHT_FEE_FLAG               auction listing with background color
SCHEDULE_FEE_FLAG                auction listing, including the definition of the
                                 starting time
BORDER_FEE_FLAG                  auction listing with frame

                         Salford Systems © Copyright 2011                             15
Target Variable

  Finally, the target variable is defined based on the winning bid price revenue
  relative to the category average

Variable              Description
GMS                   scored sales revenue in EUR
CATEGORY_AVG_GMS      Average sales revenue for the product category
GMS_GREATER_AVG       zero when the revenue is less than or equal to the
                      category average sales and one otherwise
   The values were only disclosed on a randomly selected set of 8,000 auctions
   which we use to train a model
       4199 auctions with the revenue below the category average
       3801 auctions with the revenue above the category average
   During the competition the auction results for the remaining 8,000 auction
   results were kept secret, and used to score competitive entries
       We will only use these records at the very end of this tutorial to validate
       the performance of various models that will be built
                         Salford Systems © Copyright 2011                            16
Comments on Methodology

Predictive modeling and general analytics competitions are increasingly being
launched both by private companies and by professional organizations and
provide both public data sets and a wealth of illustrative examples using
different analytic techniques
When reviewing results from a competition, and especially when comparing
results generated by analysts running models after the competition, it is
important to keep in mind that there is an ocean of difference between being
a competitor during the actual competition and an after-the-fact commentator
Regardless of what is reported the after-the-fact analyst does have access to
“what really happened” and it is nearly impossible to simulate the competitive
environment once the results have been published
    We all learn in both direct and indirect ways from many sources including the
     outcomes of public competitions. This can affect anything that comes later in time.
In spite of this, we have tried to mimic the circumstances of the competitors
by presenting analyses based only on the original training data, and using
well-established guidelines we have been promoting for more than decade to
arrive at a final model
We urge you to never take as face value an analyst‟s report on what would
have happened if they had hypothetically participated
                          Salford Systems © Copyright 2011                                 17
First Round Modeling: Ignoring the TEXT Data

Even before doing any type of data preparation it is always valuable to run a
few preliminary CART models
    CART automatically handles missing values and is immune to outliers
    CART is flexible enough to adapt to any type of nonlinearity and interaction effects
     among predictors. The analyst does not need to do any data preparation to assist
     CART in this regard
    CART performs well enough out of the box that we are guaranteed to learn
     something of value without conducting any of the common data preparation

The only requirement for useful results is that we exclude any possible
perfect or near perfect illegitimate predictors
    Common examples of illegitimate predictors include repackaged versions of the
     dependent variable, ID variables, and data drawn from the future relative to the
     data to be predicted

We start with a quick model using 20 of the 25 available predictors. None of
these involve any of the text data we will focus on later.

                          Salford Systems © Copyright 2011                                  18
Quick Modeling Round with CART

We start by building a quick CART model using original raw variables and all
8,000 complete auction records
Assuming that you already have SPM launched
    Go to the
     File – Open – Data File menu
    Note that we have already
     configured the default working
     folder for SPM
    Make sure that the Files of Type
     is set to ASCII
    Highlight the dmc2006.csv dataset
    Press the [Open] button

                            Salford Systems © Copyright 2011                   19
Dataset Summary Window

The resulting window summarizes basic facts about the dataset
Note that even though the dataset has 16000 records, only top 8000 will be
used for modeling as was already pointed out

                      Salford Systems © Copyright 2011                       20
The View Data Window

Press the [View Data…] button to have a quick impression of physical
contents of the dataset
Out goal is to eventually use the unstructured information contained in the
text fields right next to the auction ID

                       Salford Systems © Copyright 2011                       21
Requesting Basic Descriptive Stats

We next produce some basic stats for all available variables:
    Go to the View – Data Info… menu
    Set the Sort mode into File Order
    Highlight the Include column
    Check the Select box
    Press the [OK] button

                             Salford Systems © Copyright 2011              22
Data Information Window

All basic descriptive statistics for all requested variables are now summarized in one
Note that the target variable GMS_GREATER_AVG is not defined for the one half of
the dataset (N Missing 8,000), all those records will be automatically discarded during
model building
Press the [Full] button to see more details

                          Salford Systems © Copyright 2011                                23
Setting Up CART Model

We are now ready to set up a basic CART run:
    Switch to the Classic Output window active
    Go to the Model – Construct Model… menu (alternatively, you could press one of
     the buttons located on the bar right below the menu bar)
    In the resulting Model Setup window make sure that the Analysis Method is set
     to CART
    In the Model tab make sure that the Sort is set to File Order and the Tree Type is
     set to Classification
    Check GMS_GREATER_AVG as the Target
    Check all of the remaining variables except AUCT_ID, LISTING_TITLE$,
    You should see something similar to what is shown on the next slide

                         Salford Systems © Copyright 2011                                 24
Model Setup Window: Model Tab

Salford Systems © Copyright 2011             25
Model Setup Window: Testing Tab

Switch to the Testing tab and confirm that the 10-fold cross-validation is used
as the optimal model selection method

                       Salford Systems © Copyright 2011                       26
Model Setup Window: Advanced Tab

Switch to the Advanced tab and set the minimum required number of records
for the parent nodes and the child nodes at 15 and 5
These limits were chosen to avoid extremely small nodes in the resulting tree

                      Salford Systems © Copyright 2011                          27
Building CART Model
Press the [Start] button, building progress window will appear for a while and then the Navigator
window containing model results will be displayed
Press on the little button right above the [+][-] pair of buttons, along the left border of the Navigator
window, note that all trees within one standard error (SE) of the optimal tree are now marked in green
Use the arrow keys to select the 64-node tree from the tree sequence, which is the smallest 1SE tree

                             Salford Systems © Copyright 2011                                         28
CART model observations

The selected CART model contains 64 terminal nodes and it is the smallest
model with the relative error still within one standard error of the optimal
model (the model with the smallest relative error) pointed by the green bar
    This approach to model selection is usually employed for easy comprehension
    We might also want to require terminal nodes to contain more than the 6 record
     minimum we observe in this out of the box tree

All 20 predictor variables play a role in the tree construction
    but there is more to observe about this when we look at the variable importance

Area under the ROC curve is a respectable 0.748

                         Salford Systems © Copyright 2011                              29
CART Model Performance

                                   Press the [Summary
                                   Reports…] button in the
                                   Navigator, select Prediction
                                   Success tab, and press the
                                   [Test] button to display cross-
                                   validated test performance of
                                   68.66% classification accuracy
                                   Now select the Variable
                                   Importance tab to review which
                                   variables entered into the model
                                   Interestingly enough, none of
                                   the “added value” paid options
                                   are important and exhibit
                                   practically no direct influence on
                                   the sales revenue
                                   A detailed look at the nodes
                                   might also be instructive for
                                   understanding the model

Salford Systems © Copyright 2011                                        30
Experimenting with TreeNet

We almost always follow initial CART models with similar TreeNet models
We start with CART because some glaring errors such as perfect predictors
are more quickly found and obviously displayed in CART
    A perfect predictor often yields a single split tree (two terminal nodes) for
     classification trees

TreeNet models have strengths similar to CART regarding flexibility and
robustness and has advantages and disadvantages relative to CART
    TreeNet is an ensemble of small CART trees that have been linked together in
     special ways. Thus TreeNet shares many desirable features of CART
    TreeNet is superior to CART in the context of errors in the dependent variable (not
     relevant in this data)
    TreeNet yields much more complex models but generally offers substantially better
     predictive accuracy. TreeNet may easily generate thousands of trees to arrive at
     an optimal model
    TreeNet yields more reliable variable importance rankings

                           Salford Systems © Copyright 2011                                31
A few words about TreeNet

TreeNet builds predictive models in stages. It first starts with a deliberately
very small first round tree (essentially a CART tree).
Then TreeNet calculates the prediction error made by this simple model and
builds a second tree to try to model that prediction error. The second tree
serves as tool to update, refine, and improve the first stage model.
A TreeNet model produces a “score” which is a simple of sum of all the
predictions made by each tree in the model
Typically the TreeNet score becomes progressively more accurate as the
number of trees is increased up to an optimal number of trees
Rarely the optimal number of trees is just one! Occasionally, a handful of
trees are optimal. More typically, hundreds or thousands of trees are optimal.
TreeNet models are very useful for the analysis of data with large numbers of
predictors as the models are built up in layers each of which makes use of
just a few predictors
More detail on TreeNet can be found at

                        Salford Systems © Copyright 2011                          32
Setting Up TN Model

Switch to the Classic Output window and go to the Model – Construct
Model… menu
Choose TreeNet as the Analysis Method
In the Model tab make sure that the Tree Type is set to Logistic Binary

                      Salford Systems © Copyright 2011                    33
Setting Up TN Parameters

Switch to the TreeNet tab and do the following:
    Set the Learnrate to 0.05
    Set the Number of trees to use: to 800 trees
    Leave all of the remaining options at their default values

                          Salford Systems © Copyright 2011                    34
TN Results Window

Press the [Start] button to initiate TN modeling run, the TreeNet Results
window will appear in the end

                      Salford Systems © Copyright 2011                      35
Checking TN Performance

Press on the [Summary] button and switch to the Prediction Success tab
Press the [Test] button to view cross-validation results
Lower the Threshold: to 0.45 to roughly equalize classification accuracy in
both classes (this makes it easier to compare the TN performance with the
earlier reported CART performance)

                      Salford Systems © Copyright 2011                        36
The Performance Has Improved!

The overall classification accuracy goes up to about 71%
Press the [ROC] button to see that the area under ROC is now a solid 0.800
This comes at the cost of added model complexity – 796 trees each with about 6
terminal nodes
Variable importance remains similar to CART

                        Salford Systems © Copyright 2011                         37
Understanding the TreeNet Model

TreeNet produces partial dependency plots for every predictor that
appears in the model, the plots can be viewed by pressing on the [Display
Plots…] button
Such plots are generally 2D illustrations of how the predictor in question
affects an outcome
    For example, in the graph below the Y axis represents the probability that an iPod
     will sell at an above category average price

      We see that for a BUY_IT_NOW price between 200 and 300 the probability of
      above average winning bid rises sharply with the BUY_IT_NOW_PRICE
      For prices above 300 or below 200 the curve is essentially flat meaning that
      changes in the predictor do not result in changes in the probable outcome
                          Salford Systems © Copyright 2011                                38
Understanding the Partial Dependency Plot (PD Plot)

The PD Plot is not a simple description of the data. If you plotted the raw data
as say the fraction of above average winning bids against prices intervals you
might see a somewhat different curve
The PD Plot is a plot that is extracted from the TreeNet model and it is
generated by examining TreeNet predictions (and not input data)
The PD Plot appears to be relate two variables but in fact other variables may
well play a role in the graph construction
Essentially the PD Plot shows the relationship between a predictor and the
target variable taking all other predictors into account
The important points to understand are that
    the graph is extracted from the model and not directly from raw data
    the graph provides an honest estimate of the typical effect of a predictor
    the graph displays not absolute outcomes but typical expected changes from some
     baseline as the predictor varies. The graph can be thought of as floating up or
     down depending on the values of other predictors

                          Salford Systems © Copyright 2011                         39
More TN Partial Dependency Plots

Salford Systems © Copyright 2011              40
Introducing the Text Mining Dimension

  To this point, we have been working only with the set of traditional structured
  data fields continuous and categorical variables
  Further substantial performance improvement can be achieved only if we
  utilize the text descriptions supplied by the seller in the following fields

Variable              Description
LISTING_TITLE         title of auction
LISTING_SUBTITLE      subtitle of auction

   Unfortunately, these two variables cannot be used “as is”. Sellers were free to
   enter free form text including misspellings, acronyms, slang, etc.
   So we must address the challenge of converting the unstructured text strings
   of the type shown here into a well structured representation

                         Salford Systems © Copyright 2011                           41
The Bag of Words Approach of Text Mining

The most straightforward strategy for dealing with free form text is to
represent each “word” that appears in the complete data set as a dummy
(0/1) indicator variable
For iPods on eBay we could imagine sellers wanting to use words like “new”
“slightly scratched”, “pink” etc. to describe their iPod. Of course the
descriptions may well be complete phrases like “autographed by Angela
Merkel” rather than just single term adjectives
Nevertheless in the simplest Bag of Words (BOW) approach we just create
dummy indicators for every word
Even though the headlines and descriptions are space limited the number of
distinct words that can appear in collections of free text can be huge
Text mining applications involving complete documents, e.g. newspaper
articles, the number of distinct words can easily reach several hundred
thousands or even millions

                      Salford Systems © Copyright 2011                       42
The End Goal of the Bag of Words

     Record_ID      RED        USED        SCRATCHED       CASE
     1001           0          1           0               1
     1002           0          0           0               0
     1003           1          0           0               0
     1004           0          0           0               0
     1005           1          1           1               0
     1006           0          0           0               0

•   Above we see an example of a database intended to describe each auction
    item by indicating which words appeared in the auction announcement
•   Observe that Record_ID 1005 contains the three words “RED”, “USED” and
•   Data in the above format looks just like the kind of numeric data used in
    traditional data mining and statistical modeling
•   We can use data in this form, as is, feeding it into CART, TreeNet, or
    regression tools such Generalized Path Seeker (GPS) or everyday regression
•   Observe that we have transformed the unstructured text into structured
    numerical data
                        Salford Systems © Copyright 2011                         43
Coding the Term Vector and TF weighting

In the sample data matrix on the previous slide we coded all of our indicators
as 0 or 1 to indicate presence or absence of a term
An alternative coding scheme is based on the FREQUENCY COUNT of the
terms with these variations:
    0 or 1 coding for presence/absence
    Actual term count (0,1,2,3,…)
    Three level indicator for absent, one occurrence, and more than one (0,1,2)

The text mining literature has established some useful weighted coding
schemes. We start with term frequency weighting (tf)
    Text mining can involve blocks of text of considerably different lengths
    It is thus desirable to normalize counts based on relative frequency. Two text fields
     might each contain the term “RED” twice, but one of the fields contains 10 words
     while the other contains 40 words. We might want our coding to reflect the fact that
     2/10 is more frequent than 2/40.
    This is nothing more than making counts relative to the total length of the unit of
     text (or document) and such coding yields the term frequency weighting
                          Salford Systems © Copyright 2011                                   44
Inverse Document Frequency (IDF) Weighting

IDF weighting is drawn from the information retrieval literature and is intended
to reflect the value of a term in narrowing the search for a specific document
within a larger corpus of documents
If a given term occurs very rarely in a collection of documents then that term
is very valuable as a tag to target those documents accurately
By contrast, if a term is very common, then knowing that such a term occurs
within the document you are looking for is not helpful in narrowing the search
While text mining has somewhat different goals than information retrieval the
concept of IDF weighting has caught on. IDF weighting serves to upweight
terms that occur relatively rarely.
IDF(term) =
   log { (Number of documents)/Number of documents containing(term))}
The IDF increases with the rarity of a term and is maximum for words that
occur in only one document
A common coding of the term vector uses the product: tf * idf

                       Salford Systems © Copyright 2011                          45
Coding the DMC2006 Text Data

The DMC2006 text data is unusual principally because of the limit on the amount of
text a seller was allowed to upload
This has the effect making the lengths of all the documents very similar
It also limits sharply the possibility that a term in a document would occur with a high
These factors contribute to making the TF-IDF weighting irrelevant to this challenge. In
fact, for this prediction task other coding schemes allow more accurate prediction.
STM offers these options for term vector coding
    0 – no/yes
    1 – no/yes/many – this one will be used in the remainder of this tutorial
    2 – 0/1
    3 – 0/1/2
    4 – term frequency (relative to document)
    5 – inversed document frequency (relative to corpus)
    6 – TF-IDF (traditional IR coding)

                             Salford Systems © Copyright 2011                              46
Text Mining Data Preparation

The heavy lifting in text mining technology is devoted to moving us from raw
unstructured text to structured numerical data
Once we have structured data we are free to use any of a large number of
traditional data mining and statistical tools to move forward
Typical analytical tools include logistic and multiple regression, predictive
modeling, and clustering tools
But before diving into the analysis stage we need move through the text
transformation stage in detail
The first step is to extract and identify the words or “terms” which can be
thought of as creating the list of all words recognized in the training data set
This stage is essentially one of defining the “dictionary”, the list of officially
recognized terms. Any new term encountered in the future will be
unrecognizable by the dictionary and will represent an unknown item
It is therefore very important to ensure that the training data set contains
almost all terms of interest that would be relevant for future prediction

                         Salford Systems © Copyright 2011                            47
Automatic Dictionary Building

The following steps will build an active dictionary for a collection of
documents (in our case, auction item description strings)
    Read all text values into one character string
    Tokenize this string into an array of words (token)
    Remove words without any letters or digits
    Remove “stop words” (words like “the”, “a”, “in”, “und”, “mit”, etc.) for both English
     and German languages
    Remove words that have fewer than 2 letters and encountered less than 10 times
     across the entire collection of documents (rare small words)
          At this point the too-common, too-rare, weird, obscure, and useless
           combinations of characters should have been eliminated
    Lemmatize words using WordNet lexical database
          This step combines words present in different grammatical forms (“go”, “went”,
           “going”, etc.) into the corresponding stem word (“go”)
    Remove all resulting words that appear less than MIN times (5 in the remainder of
     this tutorial)
                           Salford Systems © Copyright 2011                                   48
Build the Dictionary (or Term Vector)

For purpose of automatic dictionary building and preprocessing data we developed the
Salford Text Mining (STM) software - a stand alone collection of tools that perform all
the essential steps in preparing text documents for text mining
STM builds on the Python “Natural Language Toolkit” (NLTK)
From NLTK we use the following tools
    Tokenizer         (extract items most likely to be “words”)
    Porter Stemmer    (recognize different simple forms of same word – e.g. plural)
    Word Net lemmatizer (more complex recognition of same word variations)
    stop word list     (words that contribute little to no vale such as “the”, “a”)

Future versions of STM might use other tools to accomplish these essential tasks
“stm.exe” is a command line utility that must be run from a Command Prompt window
(assuming you are running Windows, go to the Start – All Programs – Accessories –
Command Prompt menu)
The version provided here resides in the stmtutorSTMbin folder

                           Salford Systems © Copyright 2011                               49
STM Commands and Options
Open a Command Prompt window in Windows, then CD to the
“stmtutorSTM” folder location, for example, on our system you would type in
cd c:stmtutorSTM

To obtain help type the following at the prompt:
 binstm --help

This command will return very concise information about STM:
 stm [-h] [-data DATAFILE] [-dict DICTFILE] [-source-dict SRCDICTFILE]
         [-score SCOREFILE] [-spm SPMAPP] [-t TARGET] [-ex EXCLUDE]

The details for each command line option are contained in the software
manual appearing in the appendix
You will also notice the “stm.cfg” configuration file – this file controls the default
behavior of the STM module and relieves you of specifying a large number of
configuration options each time “stm.exe” is launched
    Note the
     line which specifies the names of the text variables to be processed

Create Dictionary Options

For the purposes of this tutorial, we have prepackaged all of the text processing
steps into individual command files (extension *.bat). You can either double-
click on the referenced command file or alternatively type its contents into the
Command Prompt window opened in the directory that contains the files
The most important arguments for our purposes in this tutorial now are:
    --dataset DATAFILE      name and location of your input CSV format data set
    --dictionary DICTFILE name and location of the dictionary to be created

These two arguments are all you need to create your dictionary. By default,
STM will process every text field in your input data set to create a single
omnibus dictionary
Simply double click on the “stm_create_dictionary.bat” to create the dictionary
file for the DMC 2006 dataset, which will be saved in the “dmc2006_ynm.dict”
file in the “stmtutorSTMdmc2006” folder
In typical text mining practice the process of generating the final dictionary will
be iterative. A review of the first dictionary might reveal further words you wish
to exclude (“stop” words)

                          Salford Systems © Copyright 2011                            51
Internal Dictionary Format

         The dictionary file is a simple text file with extension
         The file contents can be viewed and edited in a
         standard text editor
         The name of the text mining variable that will be
         created later on appears on the left of the “=“ sign on
         each un-indented line
         The default value that will be assigned to this
         variable appears on the right side of the “=“ sign of
         the un-indented lines and it usually means the
         absence of the word(s) of interest
         Each indented line represents the value (left of the
         “=“) which will be entered for a single occurrence in a
         document for any of the word(s) appearing on the
         right of the “=“
             More than one occurrence will be recorded as
              “many” when requested (always the case in this

Salford Systems © Copyright 2011                                    52
Hand Made Dictionary

To use a multi-level coding you need to create a “hand made dictionary”, which is already
supplied to you as “hand.dict” in the “stmtutorSTMdmc2006” folder
Here is an example of an entry in this file
The un-indented line of an entry starts with the name we wish to give to the term
(HAND_MODEL) and also indicates that a BLANK or missing value is to be coded with
the default value of “standard”
The remaining indented entries are listed one-per-line and are an exhaustive list of the
acceptable values which the term HAND-MODEL can receive in the term vector
Another coding option is, for example:
which sets “no” as the default value but substitutes “yes” if one of the two values listed
above is encountered
You may study additional examples in our stmtutorSTMdmc2006hand.dict file on your
own, all of them were created manually based on common sense logic
Why Create Hand Made Dictionary Entries

Let‟s revisit the variable HAND_MODEL which brings together the terms
    Standard, mini, nano

Without a hand made dictionary entry we would have three terms created,
one for each model type, with “yes” and “no” values, and possibly “many”
By creating the hand made entry we
    Ensure that every auction is assigned a model (default=“standard”)
    All three models are brought together into one categorical variable with three
     possible values “standard”, “mini”, and “nano”

This representation of the information is helpful when using tree-based
learning machines but not helpful for regression-based learning machines
    The best choice of representation may vary from project to project
    Salford regression-based learning machines automatically repackage categorical
     predictors into 01 indicators meaning that you work with one representation
    But if you need to use other tools you may not have this flexibility

                            Salford Systems © Copyright 2011                          54
Further Dictionary Customization

  The following table summarizes some of the important fields introduced in the
  custom dictionary for this tutorial

Variable    Values           Combines word variants
CAPACITY    20               20gb,20 gb,20 gigabyte
            30               30gb,30 gb,30 gigabyte
            40               40gb,40 gb,40 gigabyte
                             80gb,80 gb,80 gigabyte
STATUS      Wieneu           Wie neu,super gepflegt,top gepflegt,top zustand,neuwertig
            Neu              neu,new,brandneu,brandneues
            Unbenutzt        Unbenu
            defekt           defekt.,--defekt--,defekt,-defekt-,-defekt,defekter,defektes
MODEL       Mini, nano,      Captures presence of the corresponding word in the auction
            standard         description
COLOR       Black, white,    Captures presence of the corresponding words or variants in
            Green, etc.      the auction description
IPOD_GENE   First,           Identified iPod generation from the information available in
RATION      second, etc.     the text description

                            Salford Systems © Copyright 2011                                55
Final Stage Dictionary Extraction

To generate a final version of the dictionary in most real world applications
you would also need to prepare an expanded list of stopwords
The NLTK provides a ready-made list of stopwords for English and another
14 major languages spanning Europe, Russia, Turkey, and Scandinavia
    These appear in the directory named stmtutorSTMdatacorporastopwords
     and should be left as they are

Additional stopwords, which might well vary from project to project, can be
entered into the file named “stopwords.dat” in the “stmtutorSTMdata”
    In the package distributed with this tutorial the “stopwords.dat” file is empty
    You can freely add words to this file, with one stopword per line

Once the custom “stopwords.dat” and “hand.dict” files have been prepared
you just run the dictionary extraction again but with the “--source-dictionary”
argument added (see the command files introduced in the later slides)
The resulting dictionary will now include all the introduced customizations

                          Salford Systems © Copyright 2011                             56
Creating Structured Text Mining Variables

The resulting dictionary file “dmc2006_ynm.dict” contains about 600 individual stems
In the final step of text processing the data dictionary is applied to each document entry
Each stem from the dictionary is represented by a categorical variable (usually binary)
with the corresponding name
The preparation process checks whether any of the known word variants associated
with each stem from the dictionary are present in the current auction description, and if
“yes”, the corresponding value is set to “yes”, otherwise, it is set to “no”
    When the “--code YNM” option is set, multiple instances of “yes” will be coded as “many”
    You can also request integer codes 0, 1, 2 in place of the character “yes/no/many”
    We have experimented with alternative variants of coding (see the “--code” help entry in the
     STM manual) and came to conclusion that the “YNM” approach works best in this tutorial
    Feel free to experiment with alternative coding schemas on your own

The resulting large collection of variables will be used as additional predictors in our
modeling efforts
Even though other more computationally intense text processing methods exist, further
investigation failed to demonstrate their utility on the current data which is most likely
related to extremely terse nature of the auction descriptions

                             Salford Systems © Copyright 2011                                       57
Creating Additional Variables

Finally, we spent additional efforts on reorganizing the original raw variables
into more useful measures
    MONTH_OF_START – based on the recorded start date of auction
    MONTH_OF_SALE – based on the recorded closing date of auction
    HIGH_BUY_IT_NOW – set to “yes” if BUY_IT_NOW_PRICE exceeds the
     CATEGORY_AVG_GMS as suggested by common sense and the nature of the
     classification problem
    In the original raw data, BUY_IT_NOW_PRICE was set to 0 on all items where that
     option was not available – we reset all such 0s to missing

All of these operations are encoded in the “” Python file
located in the “stmtutorSTMdmc2006” folder
    This component of the STM is under active development
    The file is automatically called by the main STM utility
    You may add/modify the contents of this file to allow alternative transformations of
     the original predictors

                           Salford Systems © Copyright 2011                                 58
Generation of the Analysis Data Set
As this point we are ready to move on to the next step which is data creation
This is nothing more than appending the relevant columns of data to the
original data set. Remember that the dictionary may contain tens of
thousands if not hundreds of thousands of terms
For the DMC2006 dataset the dictionary is quite small by text mining
standards containing just a little over 600 words
To generate processed dataset simply double-click on the stm_ynm.bat
command file or explicitly type in its contents in the Command Prompt
    The “--dataset” option specifies the input dataset to be processed
    The “--code YNM” option requests “yes/no/many” style of coding
    The “--source-dictionary” option specifies the hand dictionary
    The “--process” option specifies the output dataset
    Of course you may add other options as you prefer

This creates a processed dataset with the name dmc2006_res_ynm.csv
which resides in the stmtutorSTMdmc2006 folder

                          Salford Systems © Copyright 2011                      59
Analysis Data Set Observations

At this point we have a new modeling dataset with the text information
represented by the extra variables
    Note that he raw input data set is just shy of 3 MB in size in a plain text format
     while the prepared analysis data set is about 40 MB in size, 13 times larger

Process only training data or all data?
    For prediction purposes all data needs to be processed, both the data that will be
     used to train the predictive models and the holdout or future data that will receive
     predictions later
    In the DMC2006 data we happen to have access to both training and holdout data
     and thus have the option of processing all the text data at the same time
    Generating the term vector based only on the training data would generally be the
     norm because future data flows have not yet arrived
    In this project we elected to process all the data together for convenience knowing
     that the train and holdout partitions were created by random division of the data
    It is worth pointing out, though, that the final dictionary generated from training
     data only might be slightly different due to the infrequent word elimination
     component of the text processor

                           Salford Systems © Copyright 2011                                 60
Quick Modeling Round with CART

We are now ready to proceed with another CART run this time using all of the
newly created text fields as additional predictors
Assuming that you already have SPM launched
    Go to the
     File – Open – Data File menu
    Make sure that the Files of Type
     is set to ASCII
    Highlight the
    Press the [Open] button

                           Salford Systems © Copyright 2011                    61
Dataset Summary Window
Again, the resulting window summarizes basic facts about the dataset
Note the dramatic increase in the number of available variables

                         Salford Systems © Copyright 2011                 62
The View Data Window

Press the [View Data…] button to have a quick look at the physical contents
of the dataset
Note how the individual dictionary word entries are now coded with the “yes”,
“no”, or “many” values for each document row

                      Salford Systems © Copyright 2011                          63
Setting Up CART Model

Proceed with setting up a CART modeling run as before:
    Make the Classic Output window active
    Go to the Model – Construct Model… menu (alternatively, you could use one of
     the buttons located on the bar right below the menu)
    In the resulting Model Setup window make sure that the Analysis Method is set
     to CART
    In the Model tab make sure that the Sort is set to File Order and the Tree Type is
     set to Classification
    Check GMS_GREATER_AVG as the Target
    Check all of the remaining variables except AUCT_ID, LISTING_TITLE$,
    You should see something similar to what is shown on the next slide

                         Salford Systems © Copyright 2011                                 64
Model Setup Window: Model Tab

Salford Systems © Copyright 2011             65
Model Setup Window: Testing Tab

Switch to the Testing tab and confirm that the 10-fold cross-validation is used
as the optimal model selection method

                       Salford Systems © Copyright 2011                       66
Model Setup Window: Advanced Tab

Switch to the Advanced tab and set the minimum required number of records
for the parent nodes and the child nodes at 15 and 5
These limits were chosen to avoid extremely small nodes in the resulting tree

                      Salford Systems © Copyright 2011                          67
Building CART Model
Press the [Start] button, building progress window will appear for a while and then the Navigator
window containing model results will be displayed (this time, the process takes a few minutes!)
Press on the little button right above the [+][-] pair of buttons, along the left border of the Navigator
window, note that all trees within one standard error (SE) of the optimal tree are now marked in green
Use the arrow keys to select the 102-node tree from the tree sequence, which is the smallest 1SE tree

                             Salford Systems © Copyright 2011                                         68
CART Model Performance
The selected CART model contains 102 terminal nodes where nearly all available
predictor variables play a role in the tree construction
Area under the ROC curve (Test) is now an impressive 0.830, especially when
compared to the one reported earlier at 0.748 for the basic CART run or the 0.800 for
the basic TN run
Press on the [Summary Reports] button in the Navigator window, select the
Prediction Success tab, and finally press the [Test] button to see cross-validated test
performance at 76.58% classification accuracy – a significant improvement!
Also note the presence of the original and derived variables on the list shown in the
Variable Importance tab

                          Salford Systems © Copyright 2011                                69
Setting Up TN Model

Now switch to the Classic Output window and go to the Model – Construct
Model… menu
Choose TreeNet as the Analysis Method
In the Model tab make sure that the Tree Type is set to Logistic Binary

                      Salford Systems © Copyright 2011                    70
Setting Up TN Parameters

Switch to the TreeNet tab and do the following:
    Set the Learnrate: to 0.05
    Set the Number of trees to use: to 800
    Leave all of the remaining options at their default values

                          Salford Systems © Copyright 2011                    71
TN Results Window

Press the [Start] button to initiate TN modeling run, the TreeNet Results
window will appear in the end, even though you might want to take a coffee
break until the modeling run completes

                      Salford Systems © Copyright 2011                       72
Checking TN Performance
Press on the [Summary] button and switch to the Prediction Success tab
Press the [Test] button to view cross-validation results
Lower the Threshold: to 0.47 to roughly equalize classification accuracy in both classes
(this makes it easier to compare the TN performance with the earlier reported CART
and TN model performance)
You can clearly see the improvement!

                          Salford Systems © Copyright 2011                             73
Requesting TN Graphs

Here we present a sample collection of all 2-D contribution plots produced by
TN for the resulting model
The plots are available by pressing on the [Display Plots…] button in the
TreeNet Results window
The list is arranged according to the variable importance table

More Graphs

Salford Systems © Copyright 2011           75
Insights Suggested by the Model

Here is a list of insights we arrived at by looking into the selection of plots
    There is a distinct effect of the iPod category once all the other factors have been
     accounted for
    Larger start price means above the average sale (most likely relates to the quality
     of an item)
    A“new” and “unpacked” item should fetch a better price, while any “defect” brings
     the price down
    End of the year means better sales
    Having a good feedback score is important
    It is best to wait 10 days or more before closing the deal
    Interestingly, 1st and 3rd generations of iPod show poorer sales than the 2nd and 4th
    2G started to fall out of favor in 2005-2006
    Black is much more popular in Germany than other colors
    Mentioning “photo”, “video”, “color display”, etc. helps get a better price
    The paid advertising features are of little or marginal importance
                           Salford Systems © Copyright 2011                                  76
Final Validation of Models

At this point we are ready to check the performance of all our models using
the remaining 8,000 auctions originally not available for training
This way each model can be positioned with respect to all of the official 173
entries originally submitted to the DMC 2006 competition
However, in order to proceed with the evaluation, we must first score the
input data using all of the models we have generated up until now
The following slides explain how to score the most recently constructed
CART and TN models, the earlier models can be scored using similar steps
You may choose to skip the scoring steps as we have already included the
results of scoring in the “stmtutorSTMscored” folder:
    Score_cart_raw.csv – simple CART model predictions
    Score_tn_raw.csv – simple TN model predictions
    Score_cart_txt.csv – text mining enhanced CART model predictions
    Score_tn_txt.csv – text mining enhanced TN model predictions

                         Salford Systems © Copyright 2011                       77
Scoring a CART Model

Select the Navigator window for the model you wish to score
Select the tree from the tree sequence (in our runs we pick the 1SE trees as
more robust)
Press the [Score] button to open the “Score Data” window
Make sure that the “Data file” is set to “dmc2006_res_ynm.csv”, if not press
the [Select…] button on the right and select the dataset to be scored
Place a checkmark in the “Save results to a file” box, then press the [Select]
button right next to it, this will open the “Save As” window
Navigate to the “stmtutorSTMscored” folder under “Save in:” selection box,
enter “Scored_cart_txt.csv” in the “File name:” text entry box, and press the
[Save] button
You should now see something similar to what‟s shown on the next slide
Press the [OK] button to initiate the scoring process
You should now have the Scored_cart_txt.csv file in the stmtutorSTMscored
                        Salford Systems © Copyright 2011                         78
Scoring CART

Salford Systems © Copyright 2011            79
Scoring a TN Model

Select the “TreeNet Results” window for the model you wish to score
Go to the “Model – Score Data…“ menu to open the “Score Data” window
Make sure that the “Data file” is set to “dmc2006_res_ynm.csv”, if not press
the [Select…] button on the right and select the dataset to be scored
Place a checkmark in the “Save results to a file” box, then press the
[Select] button right next to it, this will open the “Save As” window
Navigate to the “stmtutorSTMscored” folder under “Save in:” selection
box, enter “Scored_tn_txt.csv” in the “File name:” text entry box, and press
the [Save] button
You should now see something similar to what‟s shown on the next slide
Press the [OK] button to initiate the scoring process
You should now have the Scored_tn_txt.csv file in the
stmtutorSTMscored folder

                       Salford Systems © Copyright 2011                        80
Scoring TN

Salford Systems © Copyright 2011          81
Using STM to Validate Performance

We can now use the STM machinery to do final model validation
Simply double-click the “stm_validate.bat” command file to proceed
Note the use of the following options inside of the command file:
    “-score” – specifies the output dataset where the model predictions will be written
    “--score-column” – specifies the name of the variable containing the actual model
     predictions (these variables are produced by CART or TN during the scoring
    “--check” – specifies the name of the dataset that contains the originally withheld
     values of the target
         this dataset was used by the organizers of the DMC 2006 competition to
          select the actual winners
    STM is currently configured to validate only the bottom 8,000 of the 16,000
     predictions generated by the model; the top 8,000 records (used for learning) are
     simply ignored

The results will be saved into text files with extensions “*.result” appended to
the original score file names in the “stmtutorSTMscored” folder
                          Salford Systems © Copyright 2011                                 82
Validation Results Format

The following window shows the validation results of the final TN model we

 8000 validation records were scored, of which:
     719 ones were misclassified as zeroes
     807 zeroes were misclassified as ones
     Thus 1,526 documents were misclassified
     This gives the final score of 8,000 – (1,526 * 2) = 4,948

                      Salford Systems © Copyright 2011                       83
Final Validation of Models

Based on the predicted class assignments, the final performance score is
calculated as 8,000 minus twice the total number of auction items
The following table summarizes how these virtually out-of-the-box elementary
modelings perform on the holdout data (the values are extracted from the four
*.result files produced by the STM validator)

Model                ROC Area       Missed 0s            Missed 1s   Score
CART raw data        75%            1123                 1387        2980
TN raw data          80%            1308                 926         3532
CART text data       83%            981                  848         4342
TN text data         89%            807                  719         4948

                      Salford Systems © Copyright 2011                       84
Visual Validation of the Results

The following graph summarizes the positioning of the four basic models with
respect to the 173 official competition entries
The TN model with text mining processing is among the top 10 winners!

                                                        TN text

                                            CART text

                                   TN raw
                       CART raw

                     Salford Systems © Copyright 2011                      85
Observations on the Results

We used the most basic form of text mining, the Bag of Words, with minor
    None of the authors speaks German although we did look up some of the words in
     an on-line dictionary. If there are any subtleties to be picked from seller wording
     choices we would have missed them.

We chose the coding scheme that performed best on the training data. We
have six coding options and one stands out as clearly best
We used common settings for the controls for CART and TreeNet
We did not use any of the modeling refinement techniques we teach in our
CART and TreeNet tutorials
We thus invite you to see if you can tweak the performances of these models
even higher

                          Salford Systems © Copyright 2011                                 86
Command Line Automation in SPM
SPM has a powerful command line processing component which allows you to completely
reproduce any modeling activity by creating and later submitting a command file
We have packaged the command files for the four modeling and scoring runs you have conducted
in the course of this tutorial
    SPM command files must have the extension *.cmd
    The four command files are stored in the “stmtutorSTMdmc2006” folder
You can create, open, or edit a command file using a simple text editor, like Notepad, etc.
SPM has a built-in editor, just go to the File – New Notepad… menu
You may also access the command line directly from inside of the SPM GUI, just make sure that the
File – Command Prompt menu item is checked
Just type in “help” in the Command Prompt part (starts with the “>” mark) of the Classic Output
window to get the listing of all available commands
Then you can request a more detailed help for any specific command of interest, for example “help
battery” will produce a long list of various batteries of automated runs available in SPM
Furthermore, you may view all of the commands issued during the current session by going to the
View – Open Command Log… menu, this way you can quickly learn which commands correspond
to the recent GUI activity you were involved with

                             Salford Systems © Copyright 2011                                       87
Basic CART Model Command File

You may now restart SPM to emulate a new fresh run
Go to the File – Open – Command File… menu
Select the “cart_raw.cmd” command file and press the [Open] button
The file is now opened in the built-in Notepad window

                      Salford Systems © Copyright 2011               88
CART Command File Contents
                                                                         OUT – saves the classic output into a
                                                                         text file
                                                                         USE – points to the modeling dataset
                                                                         GROVE – saves the model as a binary
                                                                         grove file
                                                                         MODEL – specifies the target variable
                                                                         CATEGORY – indicates which variables
                                                                         are categorical, including the target
                                                                         KEEP – specifies the list of predictors
                                                                         LIMIT – sets the node limits
                                                                         ERROR – requests cross-validation
                                                                         BUILD – builds a CART model
                                                                         SAVE – names the file where the CART
                                                                         model predictions will be saved
                                                                         HARVEST – specifies which tree is to be
                                                                         used in scoring
                                                                         IDVAR – requests saving of the
Note the use of the relative paths in the GROVE and SAVE commands        additional variables into the output
Also note the use of the forward slash “/” to separate folder names
                                                                         SCORE – scores the CART model
                                                                         OUTPUT * – closes the current text
                                                                         output file
                                      Salford Systems © Copyright 2011                                          89
Submitting Command File

With the Notepad window active, go to the File – Submit Window menu to
submit the command file into SPM
In the end you will see the Navigator and the Score windows opened which
should be identical to the ones you have already seen in the beginning of this
Furthermore, you should now have
    “cart_raw.dat” text file created in the “stmtutorSTMdmc2006” folder, the file
     contains the classic output you normally see in the “Classic Output” window
    “cart_raw.grv” binary grove file created in the “stmtutorSTMmodels” folder, the
     file contains the CART model itself, it can be opened in the GUI using the File –
     Open – Open Grove… menu which reopens the Navigator window, this file will be
     also needed to future scoring or translation
    “Score_cart_raw.csv” data file created in the “stmtutorSTMscored” folder, the
     file contains the selected CART model predictions on your data

You may proceed now with opening up the “tn_raw.cmd” file using the File –
Open – Command File… menu

                          Salford Systems © Copyright 2011                               90
TN Command File Contents
                                   OUT, USE, GROVE, MODEL,
                                   CATEGORY, KEEP, ERROR, SAVE,
                                   IDVAR, SCORE, OUTPUT – same as the
                                   CART command file introduced earlier
                                   MART TREES – sets the TN model size
                                   in trees
                                   MART NODES – sets the tree size in
                                   terminal nodes
                                   MART MINCHILD - set the minimum
                                   individual node size in records
                                   MART OPTIMAL – sets the evaluation
                                   criterion that will be used for optimal
                                   model selection
                                   MART BINARY – requests logistic
                                   regression processing in our case
                                   MART LEARNRATE – sets the learnrate
                                   MART SUBSAMPLE – sets the sampling
                                   MART INFLUENCE – sets the influence
                                   trimming value
                                   The rest of the MART commands
                                   requests automatic saving of the 2-D and
                                   3-D plots into the grove; type in “help
                                   mart” to get full descriptions
Salford Systems © Copyright 2011                                         91
Submitting the Rest of the Command Files

Again, with the current Notepad window active, use the File – Submit Window menu
to launch the basic TN modeling run automatically followed by scoring
This will create the output, grove, and scored data files in the corresponding locations
for the chosen TN model; also note the use of the EXCLUDE command in place of the
KEEP command inside of the command file – this saves a lot of typing
Now go back to the Classic Output window and notice that the File menu has
Go to the File – Sumbit Command File… menu, select the “cart_txt.cmd” command
file, and press the [Open] button
Notice the modeling activity in the Classic Output window, but no Results window is
produced – this is how the Submit Command File… menu item is different from the
Submit Window menu item used previously; nonetheless, the output, grove, and score
files are still created in the specified locations
Use the File – Open – Open Grove… menu to open the “tn_raw.grv” file located in
the “stmtutorSTMmodels folder”, you will need to navigate into this folder using the
Look in: selection box in the Open Grove File window
You may now proceed with the final TN run by submitting the “tn_txt.cmd” command
file using either the File – Open – Command File… / File – Submit Window or File –
Submit Command File… menu routes – don‟t forget that it does take long time to run!
                          Salford Systems © Copyright 2011                                 92
Final Remarks

This completes the Salford Systems Data Mining and Text Mining tutorial
In the process of going through the tutorial you have learned how to use both
GUI and command cine facilities of SPM as well as the command line text
mining facility STM
You managed to build two CART models, two TN models, as well as enriched
the original dataset with a variety of text mining fields
The final model puts you among the top winners in a major text mining
competition – a proud achievement
Even though we have barely scratched the surface, you are now ready to
proceed with exploring the remainder of the vast data mining activities offered
within SPM and STM on your own
We wish you best of luck on the exciting and never ending road of modern
data analysis and exploration
And don‟t forget that you can always reach us at
should you have further modeling questions and needs

                       Salford Systems © Copyright 2011                           93

Breiman, L., J. Friedman, R. Olshen and C. Stone (1984), Classification and
Regression Trees, Pacific Grove: Wadsworth

Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.
Hastie, T., Tibshirani, R., and Friedman, J.H (2000). The Elements of
Statistical Learning. Springer.
Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting
algorithm. In L. Saitta, ed., Machine Learning: Proceedings of the Thirteenth
National Conference, Morgan Kaufmann, pp. 148-156.
Friedman, J.H. (1999). Stochastic gradient boosting. Stanford: Statistics
Department, Stanford University.
Friedman, J.H. (1999). Greedy function approximation: a gradient boosting
machine. Stanford: Statistics Department, Stanford University.
Sholom M. Weiss, Nitin Indurkhya, Tong Zhang, Fred J. Damerau (2004).
Text Mining. Predictive Methods for Analyzing Unstructured Information.

                         Salford Systems © Copyright 2011                       94
STM Command Reference

Salford Text Miner is simple utility that should make text mining process
much easier. For this purpose application described in this manual have
different parameters and can execute Salford Predictive Miner at the data
mining backend
STM Workflow:
    Automatically generate dictionary based on dataset
    Process dataset and generate new with additional columns based on dictionary
    Generate model folder with dataset, command file and dictionary
    Run Salford Predictive Miner with generated command file
    Run checking process comparing results from scoring with real classes

All of these steps can be done in separate STM calls or in one call

                         Salford Systems © Copyright 2011                           95
STM Command Reference

Short Option          Long Option              Description
-data DATAFILE        --dataset DATAFILE       Specify dataset to work with
-dict DICTFILE        --dictionary             Specify dictionary to work with
-source-dict SDFILE   --source-dictionary      Dictionary that is used as source for
                      SDFILE                   automatic dictionary retrieval process
-score SFILE          --scoreresult SFILE      Specify file with score result, for
                                               checking process, default – „score.csv‟
-spm SPMAPP           --spmapplication         Path to spm application, default –
                      SPMAPP                   „spm.exe‟
-t TARGET             --target TARGET          Target variable to generate command
                                               file, default – „GMS_GREATER_AVG‟
-ex EXCLUDE           --exclude EXCLUDE        List of variables to exclude from keep
                                               list, when generate command file.
-cat CATEGORY         --category               List of variables to select as category
                      CATEGORY                 variables, when generate command
                           Salford Systems © Copyright 2011                              96
STM Command Reference

Short Option        Long Option              Description
-templ CMDTEMPL     --cmdtemplate            Specify template of command file, that will
                    CMDTEMPL                 be used for generation. Default –
-md MODEL_DIR       --modeldir               Dir, where model‟s folders will be created.
                    MODEL_DIR                Default – „models‟
-trees TREES        --trees TREES            Parameter for TreeNet command files,
                                             specify number of trees will be build.
                                             Default – 500
-maxnodes           --maxnodes               Parameter for TreeNet command files,
MAXNODES            MAXNODES                 specify numbers of nodes in one tree will
                                             be build. Default – 6
-fixwords           --fixwords               Enables heuristics that tries to fix words
                                             (find nearest by different metrics, searching
                                             spell checking, etc)
-textvars VARLIST   --text-variables         List of variables separated by commas,
                    VARLIST                  which will be used in dictionary retrieving

                         Salford Systems © Copyright 2011                                    97
STM Command Reference

Short Option       Long Option              Description
-outrmwords        --output-removed-        Enables outputting removed stop words to
                   words                    file „data/removed.dat‟
-code CODE         --column-coding          Specify how to code absence/presence of
                   CODE                     word in row:
                                            YN or 0 – no/yes
                                            YNM or 1 – no/yes/many
                                            01 or 2 – 0/1
                                            012 or 3 – 0/1/2
                                            TF or 4 – term frequency
                                            IDF or 5 – inversed document frequency
                                            TF-IDF or 6 – TF-IDF
                                            TC or 7 – term count (0,1,2,…)
                                            Default – YN
-mp MODELPATH      --model-path             Specify path where model files would be
                   MODELPATH                created
-cmd-path CMDPATH --command-file-path       Specify path to command file, which will be
                  CMDPATH                   executed by Salford Predictive Miner
-ppfile PPFILE     --preprocess-file        Path to python code that will be executed
                   PPFILE                   on process step for data manipulate data

                        Salford Systems © Copyright 2011                                  98
STM Command Reference

Short Option   Long Option           Description
-rc NAME       --realclass-          Specify column name for in real class dataset for
               column-name           check step. Default – GMS_GREATER_AVG
-e             --extract             Run first step – automatic extraction of dictionary
                                     from dataset. Need to specify --dataset
-p OUTFILE     --process             Run second step – process dataset and create new
               OUTFILE               dataset with name OUTPUTFILE were depending on
                                     dictionary will be created new columns. Need to
                                     specify --dataset and --dictionary
-g             --generate            Run third step – generate model folder with
                                     command file. Need specify --dataset, --dictionary

-m             --model               Run forth step. Run Salford Predictive Miner with
                                     generate command file. Works only with –generate
-c DATASET     --check DATASET Run fives step. Check score file with real classes
                               (from specified REALCLASSFILE) and outputs
                               misclassification table. Need to specify --scoreresult
-h             --help                Show help

                            Salford Systems © Copyright 2011                               99
STM Configuration File

Name                    Description                                            Default

SPM_APPLICATION         Path to Salford Predictive Miner                       spm.exe

CMD_TREES               Number of trees to build in TN models                  500

CMD_NODES               Tree size for TN modes                                 6

CMD_TEMPLATE            Command file template                                  data/template.cmd

MODELS_DIR              Dir, where model‟s folders will be created             models

LANGUAGES               Languages, stop words which will be used               English, German

SPELLCHECKER_DICT       Additional spell checker dictionary, with words that   data/spellchecker_dict.dat
                        are allowed (like “ipod”)
SPELLCHECKER_LANGUAGE   Language for spell checker                             de_DE

ADDITIONAL_STOPWORDS    File with additional stop words, which user can edit   data/stopwords.dat

REMOVED_WORDS_FILE      File, where removed words will be written on           data/removed.dat
                        “extract” step
WORD_FREQUENCY_THRESH   Lower threshold word frequency, which will be          5
OLD                     deleted on “extract” step
PREPROCESS_FILE         Include script to do additional processing             dmc2006/

                         Salford Systems © Copyright 2011                                                   100
STM Configuration File

Name                 Description                                                    Default

CHECK_RESULTS_FILE                                                                  data/score_results.csv

LOGFILE              Path to log file. Can be mask (%s for date).                   log/stm%s.log

TARGET               Default variable for target argument, which would be used to   GMS_GREATER_AVG
                     fill command file template
EXCLUDE              Default variable for keep argument, which would be used to     AUCT_ID,
                     fill command file template                                     LISTING_TITLE$,
CATEGORY             Default variable for category argument, which would be used    GMS_GREATER_AVG
                     to fill command file template
SCORE_FILE           Name of score file which need to be checked                    Score.csv

TEXT_VARIABLES       List of text variables in dataset separated by comma           ITEM_LEAF_CATEGORY_
                                                                                    NAME, LISTING_TITLE,
DEFAULT_CODING       Default coding for extract and preprocess steps                YN

REALCLASS_COLUMN_    Name of column in real class file, which would be used in      GMS_GREATE_AVG
NAME                 check step
SCORE_COLUMN_NAM     Name of column in score file, which would be used in check     PREDICTION
E                    step

                              Salford Systems © Copyright 2011                                               101

Más contenido relacionado


Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Salford Systems
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShareSlideShare
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web miningDatamining Tools
Introduction to Text Mining and Semantics
Introduction to Text Mining and SemanticsIntroduction to Text Mining and Semantics
Introduction to Text Mining and SemanticsSeth Grimes
Lecture 3: Structuring Unstructured Texts Through Sentiment Analysis
Lecture 3: Structuring Unstructured Texts Through Sentiment AnalysisLecture 3: Structuring Unstructured Texts Through Sentiment Analysis
Lecture 3: Structuring Unstructured Texts Through Sentiment AnalysisMarina Santini
Predicting Hospital Readmission Using TreeNet
Predicting Hospital Readmission Using TreeNetPredicting Hospital Readmission Using TreeNet
Predicting Hospital Readmission Using TreeNetSalford Systems
TreeNet Tree Ensembles and CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles and CART  Decision Trees:  A Winning CombinationTreeNet Tree Ensembles and CART  Decision Trees:  A Winning Combination
TreeNet Tree Ensembles and CART Decision Trees: A Winning CombinationSalford Systems
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text MiningMichel Bruley
R by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesR by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesJeffrey Breen
Opinion Mining Tutorial (Sentiment Analysis)
Opinion Mining Tutorial (Sentiment Analysis)Opinion Mining Tutorial (Sentiment Analysis)
Opinion Mining Tutorial (Sentiment Analysis)Kavita Ganesan
Machine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web DataMachine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web DataPier Luca Lanzi
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataYanchang Zhao
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...SlideShare
2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShareSlideShare
What to Upload to SlideShare
What to Upload to SlideShareWhat to Upload to SlideShare
What to Upload to SlideShareSlideShare
How to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & TricksHow to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & TricksSlideShare

Destacado (17)

Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShare
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
Introduction to Text Mining and Semantics
Introduction to Text Mining and SemanticsIntroduction to Text Mining and Semantics
Introduction to Text Mining and Semantics
Lecture 3: Structuring Unstructured Texts Through Sentiment Analysis
Lecture 3: Structuring Unstructured Texts Through Sentiment AnalysisLecture 3: Structuring Unstructured Texts Through Sentiment Analysis
Lecture 3: Structuring Unstructured Texts Through Sentiment Analysis
Predicting Hospital Readmission Using TreeNet
Predicting Hospital Readmission Using TreeNetPredicting Hospital Readmission Using TreeNet
Predicting Hospital Readmission Using TreeNet
TreeNet Tree Ensembles and CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles and CART  Decision Trees:  A Winning CombinationTreeNet Tree Ensembles and CART  Decision Trees:  A Winning Combination
TreeNet Tree Ensembles and CART Decision Trees: A Winning Combination
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
R by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesR by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlines
Opinion Mining Tutorial (Sentiment Analysis)
Opinion Mining Tutorial (Sentiment Analysis)Opinion Mining Tutorial (Sentiment Analysis)
Opinion Mining Tutorial (Sentiment Analysis)
Machine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web DataMachine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web Data
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare
What to Upload to SlideShare
What to Upload to SlideShareWhat to Upload to SlideShare
What to Upload to SlideShare
How to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & TricksHow to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & Tricks

Similar a Text mining tutorial

110006_perils_of_aging_emul_wpJessica Hirst
ADG S1000D Series - Benefits of Using S1000D
ADG S1000D Series - Benefits of Using S1000DADG S1000D Series - Benefits of Using S1000D
ADG S1000D Series - Benefits of Using S1000DAbsolute Data Group
M sc in reliable embedded systems
M sc in reliable embedded systemsM sc in reliable embedded systems
M sc in reliable embedded systemsvtsplgroup
Operating System Structure Of A Single Large Executable...
Operating System Structure Of A Single Large Executable...Operating System Structure Of A Single Large Executable...
Operating System Structure Of A Single Large Executable...Jennifer Lopez
What frameworks can do for you – and what not (IPC14 SE)
What frameworks can do for you – and what not (IPC14 SE)What frameworks can do for you – and what not (IPC14 SE)
What frameworks can do for you – and what not (IPC14 SE)Robert Lemke
FDM to FDMEE migration utility
FDM to FDMEE migration utilityFDM to FDMEE migration utility
FDM to FDMEE migration utilityBernard Ash
XMetaL Macros for Non-Programmers
XMetaL Macros for Non-ProgrammersXMetaL Macros for Non-Programmers
XMetaL Macros for Non-ProgrammersXMetaL
Product! - The road to production deployment
Product! - The road to production deploymentProduct! - The road to production deployment
Product! - The road to production deploymentFilippo Zanella
Hp trim vs objective
Hp trim vs objectiveHp trim vs objective
Hp trim vs objectivetraciep
BlueData Isilon Validation Brief
BlueData Isilon Validation BriefBlueData Isilon Validation Brief
BlueData Isilon Validation BriefBoni Bruno
Hol 1940-01-net pdf-en
Hol 1940-01-net pdf-enHol 1940-01-net pdf-en
Hol 1940-01-net pdf-endborsan
List and describe various features of electronic systems.List and .pdf
List and describe various features of electronic systems.List and .pdfList and describe various features of electronic systems.List and .pdf
List and describe various features of electronic systems.List and .pdfinfo824691
Why software performance reduces with time?.pdf
Why software performance reduces with time?.pdfWhy software performance reduces with time?.pdf
Why software performance reduces with time?.pdfMike Brown

Similar a Text mining tutorial (20)

Operating system done_by_ashok
Operating system done_by_ashokOperating system done_by_ashok
Operating system done_by_ashok
ADG S1000D Series - Benefits of Using S1000D
ADG S1000D Series - Benefits of Using S1000DADG S1000D Series - Benefits of Using S1000D
ADG S1000D Series - Benefits of Using S1000D
M sc in reliable embedded systems
M sc in reliable embedded systemsM sc in reliable embedded systems
M sc in reliable embedded systems
Operating System Structure Of A Single Large Executable...
Operating System Structure Of A Single Large Executable...Operating System Structure Of A Single Large Executable...
Operating System Structure Of A Single Large Executable...
What frameworks can do for you – and what not (IPC14 SE)
What frameworks can do for you – and what not (IPC14 SE)What frameworks can do for you – and what not (IPC14 SE)
What frameworks can do for you – and what not (IPC14 SE)
Rapidly deploying software
Rapidly deploying softwareRapidly deploying software
Rapidly deploying software
FDM to FDMEE migration utility
FDM to FDMEE migration utilityFDM to FDMEE migration utility
FDM to FDMEE migration utility
XMetaL Macros for Non-Programmers
XMetaL Macros for Non-ProgrammersXMetaL Macros for Non-Programmers
XMetaL Macros for Non-Programmers
Product! - The road to production deployment
Product! - The road to production deploymentProduct! - The road to production deployment
Product! - The road to production deployment
Hp trim vs objective
Hp trim vs objectiveHp trim vs objective
Hp trim vs objective
BlueData Isilon Validation Brief
BlueData Isilon Validation BriefBlueData Isilon Validation Brief
BlueData Isilon Validation Brief
Hol 1940-01-net pdf-en
Hol 1940-01-net pdf-enHol 1940-01-net pdf-en
Hol 1940-01-net pdf-en
List and describe various features of electronic systems.List and .pdf
List and describe various features of electronic systems.List and .pdfList and describe various features of electronic systems.List and .pdf
List and describe various features of electronic systems.List and .pdf
Why software performance reduces with time?.pdf
Why software performance reduces with time?.pdfWhy software performance reduces with time?.pdf
Why software performance reduces with time?.pdf
Embedded Systems
Embedded SystemsEmbedded Systems
Embedded Systems

Más de Salford Systems

Datascience101presentation4Salford Systems
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsSalford Systems
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Salford Systems
The Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningThe Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningSalford Systems
Introduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerIntroduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerSalford Systems
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like YouSalford Systems
Statistically Significant Quotes To Remember
Statistically Significant Quotes To RememberStatistically Significant Quotes To Remember
Statistically Significant Quotes To RememberSalford Systems
Using CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetUsing CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetSalford Systems
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideSalford Systems
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to marsSalford Systems
Data Mining for Higher Education
Data Mining for Higher EducationData Mining for Higher Education
Data Mining for Higher EducationSalford Systems
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingComparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingSalford Systems
Molecular data mining tool advances in hiv
Molecular data mining tool  advances in hivMolecular data mining tool  advances in hiv
Molecular data mining tool advances in hivSalford Systems
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees:  A Winning CombinationTreeNet Tree Ensembles & CART Decision Trees:  A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees: A Winning CombinationSalford Systems
SPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSalford Systems
Hybrid cart logit model 1998
Hybrid cart logit model 1998Hybrid cart logit model 1998
Hybrid cart logit model 1998Salford Systems
Session Logs Tutorial for SPM
Session Logs Tutorial for SPMSession Logs Tutorial for SPM
Session Logs Tutorial for SPMSalford Systems
Some of the new features in SPM 7
Some of the new features in SPM 7Some of the new features in SPM 7
Some of the new features in SPM 7Salford Systems
TreeNet Overview - Updated October 2012
TreeNet Overview  - Updated October 2012TreeNet Overview  - Updated October 2012
TreeNet Overview - Updated October 2012Salford Systems

Más de Salford Systems (20)

Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForests
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications
The Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningThe Do's and Don'ts of Data Mining
The Do's and Don'ts of Data Mining
Introduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerIntroduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele Cutler
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You
Statistically Significant Quotes To Remember
Statistically Significant Quotes To RememberStatistically Significant Quotes To Remember
Statistically Significant Quotes To Remember
Using CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetUsing CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example Dataset
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User Guide
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to mars
Data Mining for Higher Education
Data Mining for Higher EducationData Mining for Higher Education
Data Mining for Higher Education
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingComparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modeling
Molecular data mining tool advances in hiv
Molecular data mining tool  advances in hivMolecular data mining tool  advances in hiv
Molecular data mining tool advances in hiv
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees:  A Winning CombinationTreeNet Tree Ensembles & CART Decision Trees:  A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
SPM v7.0 Feature Matrix
SPM v7.0 Feature MatrixSPM v7.0 Feature Matrix
SPM v7.0 Feature Matrix
SPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARS
Hybrid cart logit model 1998
Hybrid cart logit model 1998Hybrid cart logit model 1998
Hybrid cart logit model 1998
Session Logs Tutorial for SPM
Session Logs Tutorial for SPMSession Logs Tutorial for SPM
Session Logs Tutorial for SPM
Some of the new features in SPM 7
Some of the new features in SPM 7Some of the new features in SPM 7
Some of the new features in SPM 7
TreeNet Overview - Updated October 2012
TreeNet Overview  - Updated October 2012TreeNet Overview  - Updated October 2012
TreeNet Overview - Updated October 2012


Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple

Último (20)

Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...

Text mining tutorial

  • 1. Getting Started with Text Mining: STM™, CART® and TreeNet® Dan Steinberg Mykhaylo Golovnya Ilya Polosukhin May, 2011
  • 2. Text Mining and Data Mining Text mining is an important and fascinating area of modern analytics On the one hand text mining can be thought of as just another application area for powerful learning machines On the other hand, text mining is a distinct field with its own dedicated concepts, vocabulary, tools, and techniques In this tutorial we aim to illustrate some important analytical methods and strategies from both perspectives on data mining  introducing tools specific to the analysis text, and,  deploying general machine learning technology The Salford Text Mining utility (STM) is a powerful text processing system that prepares data for advanced machine learning analytics Our machine learning tools are the Salford Systems flagship CART® decision tree and stochastic gradient boosting TreeNet® Evaluation copies of the the proprietary technology in CART and TreeNet as well as the STM are available from Salford Systems © Copyright 2011 2
  • 3. For Readers of this Tutorial To follow along this tutorial we recommend that you have the analytical tools we use installed on your computer. Everything you need may already be on a CD disk containing this tutorial and analytical software Create an empty folder named “stmtutor”, this is the root folder where all of the work files related to this tutorial will reside You may also use the following link to download Salford Systems Predictive Modeler (SPM) After downloading the package, unzip its contents into “stmtutor” which will create a new folder named “SPM680_Mulitple_Installs_2011_06_07”. Follow installation steps described on the next slide. For the original DMC2006 competition website visit We recommend that you visit the above site for information only; data and tools for preparing that data are available at the URL next below For the STM package, prepared data files, and other utilities developed for this tutorial please visit After downloading the archive, unzip its contents into “stmtutor” Salford Systems © Copyright 2011 3
  • 4. Important! Installing the SPM Software The Salford Systems software you‟ve just downloaded needs to be both installed and licensed. No-cost license codes for a 30 day period are available on request to visitors of this tutorial* Double click on the “Install_a_Transform_SPM.exe” file located in the “SPM680_Mulitple_Installs_2011_06_07” folder (see the previous slide) to install the specific version of SPM used in this tutorial  Following the above procedure will ensure that all of the currently installed versions of SPM, if any, will remain intact! Follow simple installation steps on your screen * Salford Systems reserves the right to decline to offer a no-cost license at its sole discretion Salford Systems © Copyright 2011 4
  • 5. Important! Licensing the SPM Software When you launch the Salford Systems Predictive Modeler (SPM) you will be greeted with a License dialog containing information needed to secure a license via email Please, send the necessary information to Salford Systems to secure your license by entering the “Unlock Code” which will be e- mailed back to you The software will operate for 3 days without any licensing; however, you can secure a 30-day license on request Salford Systems © Copyright 2011 5
  • 6. Installing the Salford Text Miner (STM) In addition to the Salford Predictive Modeler (SPM) you will also work with the Salford Text Miner (STM) software No installation is needed and you should already have the “stm.exe” executable in the “stmtutorSTMbin” folder as the result of unzipping the “” package earlier STM builds upon the Python 2.6 distribution and the NLTK (Natural Language Tool Kit) but makes text data processing for analytics very easy to conduct and manage  You do not need to add any other support software to use STM Expect to see several folders and a large number of files located under the “stmtutorSTM” folder. It is important to leave these files in the location to which you have installed them.  Please do not MOVE or alter any of the installed files other than those explicitly listed as user-modifiable! “stm.exe” will expire in the middle of 2012, contact Salford Systems to get an updated version beyond that Salford Systems © Copyright 2011 6
  • 7. The Example Project The best examples are drawn from real world data sets and we were fortunate to locate data publicly released by eBay. Good teaching examples also need to be simple.  Unfortunately, real world text mining could easily involve hundreds of thousands if not millions of features characterizing billions of records. Professionals need to be able to tackle such problems but to learn we need to start with simpler situations.  Fortunately, there are many applications in which text is important but the dimensions of the data set are radically smaller, either because the data available is limited or because a decision has been made to work with a reduced problem. We use our simpler example to illustrate many useful ideas for beginning text miners while pointing the way to working on larger problems. Salford Systems © Copyright 2011 7
  • 8. The DMC2006 Text Mining Challenge In 2006 the DMC data mining competition (restricted to student competitors only) introduced a predictive modeling problem for which much of the predictive information was in the form of unstructured text. The datasets for the DMC 2006 data mining competition can be downloaded from  For your convenience we have re-packaged this data and made it somewhat easier to work with. This re-packaged data is included in the STMU package described near the beginning of this tutorial. The data summarizes 16,000 iPod auctions held at eBay from May 2005 through May 2006 in Germany Each auction item is represented by a text description written by the seller (in German) as well as a number of flags and features available to the seller at the time of the auction Auction items were grouped into 15 mutually exclusive categories based on distinct iPod features: storage size, type (regular, mini, nano), and color The competition goal was to predict whether the closing price would be above or below the category average Salford Systems © Copyright 2011 8
  • 9. Comments on the Challenge One might think that a challenge with text in German might not be of general interest outside of Germany However, working with a language essentially unfamiliar to any member of the analysis team helps to illustrate one important point  Text mining via tools that have no “understanding” of the language can be strikingly effective We have no doubt that dedicated tools which embed knowledge of the language being analyzed can yield predictive benefits  We also believe we could have gained further valuable insight into the data if any of the authors spoke German! But our performance without this knowledge is still impressive. In contexts where simple methods can yield more than satisfactory results, or in contexts where the same methods must be applied uniformly across multiple languages, the methods described in this tutorial will be an excellent guide. Salford Systems © Copyright 2011 9
  • 10. Configuring Work Location in SPM The original datasets from the DMC 2006 challenge reside in the “stmtutorSTMdmc2006” folder To facilitate further modeling steps, we will configure SPM to use this location as the default location:  Start SPM  Go to the Edit – Options menu  Switch to the Directories tab  Enter the “stmtutorSTMdmc2006” folder location in all text entry boxes except the last one  Press the [Save as Defaults] button so that the configuration is restored the next time you start SPM Salford Systems © Copyright 2011 10
  • 11. Configuring TreeNet Engine Now switch to the TreeNet tab  Configure the Plot Creation section as shown on the screen shot  Press the [Save as Defaults] button  Press the [OK] button to exit Salford Systems © Copyright 2011 11
  • 12. Steps in the Analysis: Data Overview 1. Describe the data: (Data Dictionary and Dimensions of Data) a. What is the unit of observation? Each record of data is describing what? b. What is the dependent or target variable? c. What other variables (data base fields) are available? d. How many records are available? 2. Statistical Summary a. Basic summary including means, quantiles, frequency tables b. Dimensions of categorical predictors c. Number of distinct values of continuous variables 3. Outlier and Anomaly Assessment a. Detection of gross data errors such as extreme values b. Assessment of usability of levels of categorical predictors (rare levels) Salford Systems © Copyright 2011 12
  • 13. Data Fundamentals The original dataset is called “dmc2006.csv” and resides in the “stmtutorSTMdmc2006” folder 16,000 records divided into two equal sized partitions  Part 1: Complete data including target, available for training during the competition  Part 2: Data to be scored; during the competition the target was not availabler 25 database fields two of which were unstructured text written by the seller Each line of data describes an auction of an iPod including the final winning bid price An eBay seller must construct a headline and a description of the product being sold. Sellers can also pay for selling assistance  E.g. Seller can pay to list the item title in BOLD Salford Systems © Copyright 2011 13
  • 14. The Data: Available Fields The following variables describe general features of each auction event Variable Description AUCT_ID ID number of auction ITEM_LEAF_CATEGORY_NAME products category LISTING_START_DATE start date of auction LISTING_END_DATE end date of auction LISTING_DURTN_DAYS duration of auction LISTING_TYPE_CODE type of auction (normal auction, multi auction, etc) QTY_AVAILABLE_PER_LISTING amount of offered items for multi auction FEEDBACK_SCORE_AT_LISTIN feedback-rating of the seller of this auction listing START_PRICE start price in EUR BUY_IT_NOW_PRICE buy it now price in EUR BUY_IT_NOW_LISTING_FLAG option for buy it now on this auction listing Salford Systems © Copyright 2011 14
  • 15. Available Data Fields In addition, there are binary indicators of various “value added” features that can be turned on for each auction Variable Description BOLD_FEE_FLAG option for bold font on this auction listing FEATUERD_FEE_FLAG show this auction listing on top of homepage CATEGORY_FEATURED_FEE_FLAG show this auction listing on top of category GALLERY_FEE_FLAG auction listing with picture gallery GALLERY_FEATURED_FEE_FLAG auction listing with gallery (in gallery view) IPIX_FEATURED_FEE_FLAG auction listing with IPIX (additional xxl, picture show, pack) RESERVE_FEE_FLAG auction listing with reserve-price HIGHLIGHT_FEE_FLAG auction listing with background color SCHEDULE_FEE_FLAG auction listing, including the definition of the starting time BORDER_FEE_FLAG auction listing with frame Salford Systems © Copyright 2011 15
  • 16. Target Variable Finally, the target variable is defined based on the winning bid price revenue relative to the category average Variable Description GMS scored sales revenue in EUR CATEGORY_AVG_GMS Average sales revenue for the product category GMS_GREATER_AVG zero when the revenue is less than or equal to the category average sales and one otherwise The values were only disclosed on a randomly selected set of 8,000 auctions which we use to train a model 4199 auctions with the revenue below the category average 3801 auctions with the revenue above the category average During the competition the auction results for the remaining 8,000 auction results were kept secret, and used to score competitive entries We will only use these records at the very end of this tutorial to validate the performance of various models that will be built Salford Systems © Copyright 2011 16
  • 17. Comments on Methodology Predictive modeling and general analytics competitions are increasingly being launched both by private companies and by professional organizations and provide both public data sets and a wealth of illustrative examples using different analytic techniques When reviewing results from a competition, and especially when comparing results generated by analysts running models after the competition, it is important to keep in mind that there is an ocean of difference between being a competitor during the actual competition and an after-the-fact commentator Regardless of what is reported the after-the-fact analyst does have access to “what really happened” and it is nearly impossible to simulate the competitive environment once the results have been published  We all learn in both direct and indirect ways from many sources including the outcomes of public competitions. This can affect anything that comes later in time. In spite of this, we have tried to mimic the circumstances of the competitors by presenting analyses based only on the original training data, and using well-established guidelines we have been promoting for more than decade to arrive at a final model We urge you to never take as face value an analyst‟s report on what would have happened if they had hypothetically participated Salford Systems © Copyright 2011 17
  • 18. First Round Modeling: Ignoring the TEXT Data Even before doing any type of data preparation it is always valuable to run a few preliminary CART models  CART automatically handles missing values and is immune to outliers  CART is flexible enough to adapt to any type of nonlinearity and interaction effects among predictors. The analyst does not need to do any data preparation to assist CART in this regard  CART performs well enough out of the box that we are guaranteed to learn something of value without conducting any of the common data preparation operations The only requirement for useful results is that we exclude any possible perfect or near perfect illegitimate predictors  Common examples of illegitimate predictors include repackaged versions of the dependent variable, ID variables, and data drawn from the future relative to the data to be predicted We start with a quick model using 20 of the 25 available predictors. None of these involve any of the text data we will focus on later. Salford Systems © Copyright 2011 18
  • 19. Quick Modeling Round with CART We start by building a quick CART model using original raw variables and all 8,000 complete auction records Assuming that you already have SPM launched  Go to the File – Open – Data File menu  Note that we have already configured the default working folder for SPM  Make sure that the Files of Type is set to ASCII  Highlight the dmc2006.csv dataset  Press the [Open] button Salford Systems © Copyright 2011 19
  • 20. Dataset Summary Window The resulting window summarizes basic facts about the dataset Note that even though the dataset has 16000 records, only top 8000 will be used for modeling as was already pointed out Salford Systems © Copyright 2011 20
  • 21. The View Data Window Press the [View Data…] button to have a quick impression of physical contents of the dataset Out goal is to eventually use the unstructured information contained in the text fields right next to the auction ID Salford Systems © Copyright 2011 21
  • 22. Requesting Basic Descriptive Stats We next produce some basic stats for all available variables:  Go to the View – Data Info… menu  Set the Sort mode into File Order  Highlight the Include column  Check the Select box  Press the [OK] button Salford Systems © Copyright 2011 22
  • 23. Data Information Window All basic descriptive statistics for all requested variables are now summarized in one place Note that the target variable GMS_GREATER_AVG is not defined for the one half of the dataset (N Missing 8,000), all those records will be automatically discarded during model building Press the [Full] button to see more details Salford Systems © Copyright 2011 23
  • 24. Setting Up CART Model We are now ready to set up a basic CART run:  Switch to the Classic Output window active  Go to the Model – Construct Model… menu (alternatively, you could press one of the buttons located on the bar right below the menu bar)  In the resulting Model Setup window make sure that the Analysis Method is set to CART  In the Model tab make sure that the Sort is set to File Order and the Tree Type is set to Classification  Check GMS_GREATER_AVG as the Target  Check all of the remaining variables except AUCT_ID, LISTING_TITLE$, LISTING_SUBTITLE$, GMS, and CATEGORY_AVG_GMS as predictors  You should see something similar to what is shown on the next slide Salford Systems © Copyright 2011 24
  • 25. Model Setup Window: Model Tab Salford Systems © Copyright 2011 25
  • 26. Model Setup Window: Testing Tab Switch to the Testing tab and confirm that the 10-fold cross-validation is used as the optimal model selection method Salford Systems © Copyright 2011 26
  • 27. Model Setup Window: Advanced Tab Switch to the Advanced tab and set the minimum required number of records for the parent nodes and the child nodes at 15 and 5 These limits were chosen to avoid extremely small nodes in the resulting tree Salford Systems © Copyright 2011 27
  • 28. Building CART Model Press the [Start] button, building progress window will appear for a while and then the Navigator window containing model results will be displayed Press on the little button right above the [+][-] pair of buttons, along the left border of the Navigator window, note that all trees within one standard error (SE) of the optimal tree are now marked in green Use the arrow keys to select the 64-node tree from the tree sequence, which is the smallest 1SE tree Salford Systems © Copyright 2011 28
  • 29. CART model observations The selected CART model contains 64 terminal nodes and it is the smallest model with the relative error still within one standard error of the optimal model (the model with the smallest relative error) pointed by the green bar  This approach to model selection is usually employed for easy comprehension  We might also want to require terminal nodes to contain more than the 6 record minimum we observe in this out of the box tree All 20 predictor variables play a role in the tree construction  but there is more to observe about this when we look at the variable importance details Area under the ROC curve is a respectable 0.748 Salford Systems © Copyright 2011 29
  • 30. CART Model Performance Press the [Summary Reports…] button in the Navigator, select Prediction Success tab, and press the [Test] button to display cross- validated test performance of 68.66% classification accuracy Now select the Variable Importance tab to review which variables entered into the model Interestingly enough, none of the “added value” paid options are important and exhibit practically no direct influence on the sales revenue A detailed look at the nodes might also be instructive for understanding the model Salford Systems © Copyright 2011 30
  • 31. Experimenting with TreeNet We almost always follow initial CART models with similar TreeNet models We start with CART because some glaring errors such as perfect predictors are more quickly found and obviously displayed in CART  A perfect predictor often yields a single split tree (two terminal nodes) for classification trees TreeNet models have strengths similar to CART regarding flexibility and robustness and has advantages and disadvantages relative to CART  TreeNet is an ensemble of small CART trees that have been linked together in special ways. Thus TreeNet shares many desirable features of CART  TreeNet is superior to CART in the context of errors in the dependent variable (not relevant in this data)  TreeNet yields much more complex models but generally offers substantially better predictive accuracy. TreeNet may easily generate thousands of trees to arrive at an optimal model  TreeNet yields more reliable variable importance rankings Salford Systems © Copyright 2011 31
  • 32. A few words about TreeNet TreeNet builds predictive models in stages. It first starts with a deliberately very small first round tree (essentially a CART tree). Then TreeNet calculates the prediction error made by this simple model and builds a second tree to try to model that prediction error. The second tree serves as tool to update, refine, and improve the first stage model. A TreeNet model produces a “score” which is a simple of sum of all the predictions made by each tree in the model Typically the TreeNet score becomes progressively more accurate as the number of trees is increased up to an optimal number of trees Rarely the optimal number of trees is just one! Occasionally, a handful of trees are optimal. More typically, hundreds or thousands of trees are optimal. TreeNet models are very useful for the analysis of data with large numbers of predictors as the models are built up in layers each of which makes use of just a few predictors More detail on TreeNet can be found at Salford Systems © Copyright 2011 32
  • 33. Setting Up TN Model Switch to the Classic Output window and go to the Model – Construct Model… menu Choose TreeNet as the Analysis Method In the Model tab make sure that the Tree Type is set to Logistic Binary Salford Systems © Copyright 2011 33
  • 34. Setting Up TN Parameters Switch to the TreeNet tab and do the following:  Set the Learnrate to 0.05  Set the Number of trees to use: to 800 trees  Leave all of the remaining options at their default values Salford Systems © Copyright 2011 34
  • 35. TN Results Window Press the [Start] button to initiate TN modeling run, the TreeNet Results window will appear in the end Salford Systems © Copyright 2011 35
  • 36. Checking TN Performance Press on the [Summary] button and switch to the Prediction Success tab Press the [Test] button to view cross-validation results Lower the Threshold: to 0.45 to roughly equalize classification accuracy in both classes (this makes it easier to compare the TN performance with the earlier reported CART performance) Salford Systems © Copyright 2011 36
  • 37. The Performance Has Improved! The overall classification accuracy goes up to about 71% Press the [ROC] button to see that the area under ROC is now a solid 0.800 This comes at the cost of added model complexity – 796 trees each with about 6 terminal nodes Variable importance remains similar to CART Salford Systems © Copyright 2011 37
  • 38. Understanding the TreeNet Model TreeNet produces partial dependency plots for every predictor that appears in the model, the plots can be viewed by pressing on the [Display Plots…] button Such plots are generally 2D illustrations of how the predictor in question affects an outcome  For example, in the graph below the Y axis represents the probability that an iPod will sell at an above category average price We see that for a BUY_IT_NOW price between 200 and 300 the probability of above average winning bid rises sharply with the BUY_IT_NOW_PRICE For prices above 300 or below 200 the curve is essentially flat meaning that changes in the predictor do not result in changes in the probable outcome Salford Systems © Copyright 2011 38
  • 39. Understanding the Partial Dependency Plot (PD Plot) The PD Plot is not a simple description of the data. If you plotted the raw data as say the fraction of above average winning bids against prices intervals you might see a somewhat different curve The PD Plot is a plot that is extracted from the TreeNet model and it is generated by examining TreeNet predictions (and not input data) The PD Plot appears to be relate two variables but in fact other variables may well play a role in the graph construction Essentially the PD Plot shows the relationship between a predictor and the target variable taking all other predictors into account The important points to understand are that  the graph is extracted from the model and not directly from raw data  the graph provides an honest estimate of the typical effect of a predictor  the graph displays not absolute outcomes but typical expected changes from some baseline as the predictor varies. The graph can be thought of as floating up or down depending on the values of other predictors Salford Systems © Copyright 2011 39
  • 40. More TN Partial Dependency Plots Salford Systems © Copyright 2011 40
  • 41. Introducing the Text Mining Dimension To this point, we have been working only with the set of traditional structured data fields continuous and categorical variables Further substantial performance improvement can be achieved only if we utilize the text descriptions supplied by the seller in the following fields Variable Description LISTING_TITLE title of auction LISTING_SUBTITLE subtitle of auction Unfortunately, these two variables cannot be used “as is”. Sellers were free to enter free form text including misspellings, acronyms, slang, etc. So we must address the challenge of converting the unstructured text strings of the type shown here into a well structured representation Salford Systems © Copyright 2011 41
  • 42. The Bag of Words Approach of Text Mining The most straightforward strategy for dealing with free form text is to represent each “word” that appears in the complete data set as a dummy (0/1) indicator variable For iPods on eBay we could imagine sellers wanting to use words like “new” “slightly scratched”, “pink” etc. to describe their iPod. Of course the descriptions may well be complete phrases like “autographed by Angela Merkel” rather than just single term adjectives Nevertheless in the simplest Bag of Words (BOW) approach we just create dummy indicators for every word Even though the headlines and descriptions are space limited the number of distinct words that can appear in collections of free text can be huge Text mining applications involving complete documents, e.g. newspaper articles, the number of distinct words can easily reach several hundred thousands or even millions Salford Systems © Copyright 2011 42
  • 43. The End Goal of the Bag of Words Record_ID RED USED SCRATCHED CASE 1001 0 1 0 1 1002 0 0 0 0 1003 1 0 0 0 1004 0 0 0 0 1005 1 1 1 0 1006 0 0 0 0 • Above we see an example of a database intended to describe each auction item by indicating which words appeared in the auction announcement • Observe that Record_ID 1005 contains the three words “RED”, “USED” and “SCRATCHED” • Data in the above format looks just like the kind of numeric data used in traditional data mining and statistical modeling • We can use data in this form, as is, feeding it into CART, TreeNet, or regression tools such Generalized Path Seeker (GPS) or everyday regression • Observe that we have transformed the unstructured text into structured numerical data Salford Systems © Copyright 2011 43
  • 44. Coding the Term Vector and TF weighting In the sample data matrix on the previous slide we coded all of our indicators as 0 or 1 to indicate presence or absence of a term An alternative coding scheme is based on the FREQUENCY COUNT of the terms with these variations:  0 or 1 coding for presence/absence  Actual term count (0,1,2,3,…)  Three level indicator for absent, one occurrence, and more than one (0,1,2) The text mining literature has established some useful weighted coding schemes. We start with term frequency weighting (tf)  Text mining can involve blocks of text of considerably different lengths  It is thus desirable to normalize counts based on relative frequency. Two text fields might each contain the term “RED” twice, but one of the fields contains 10 words while the other contains 40 words. We might want our coding to reflect the fact that 2/10 is more frequent than 2/40.  This is nothing more than making counts relative to the total length of the unit of text (or document) and such coding yields the term frequency weighting Salford Systems © Copyright 2011 44
  • 45. Inverse Document Frequency (IDF) Weighting IDF weighting is drawn from the information retrieval literature and is intended to reflect the value of a term in narrowing the search for a specific document within a larger corpus of documents If a given term occurs very rarely in a collection of documents then that term is very valuable as a tag to target those documents accurately By contrast, if a term is very common, then knowing that such a term occurs within the document you are looking for is not helpful in narrowing the search While text mining has somewhat different goals than information retrieval the concept of IDF weighting has caught on. IDF weighting serves to upweight terms that occur relatively rarely. IDF(term) = log { (Number of documents)/Number of documents containing(term))} The IDF increases with the rarity of a term and is maximum for words that occur in only one document A common coding of the term vector uses the product: tf * idf Salford Systems © Copyright 2011 45
  • 46. Coding the DMC2006 Text Data The DMC2006 text data is unusual principally because of the limit on the amount of text a seller was allowed to upload This has the effect making the lengths of all the documents very similar It also limits sharply the possibility that a term in a document would occur with a high frequency These factors contribute to making the TF-IDF weighting irrelevant to this challenge. In fact, for this prediction task other coding schemes allow more accurate prediction. STM offers these options for term vector coding  0 – no/yes  1 – no/yes/many – this one will be used in the remainder of this tutorial  2 – 0/1  3 – 0/1/2  4 – term frequency (relative to document)  5 – inversed document frequency (relative to corpus)  6 – TF-IDF (traditional IR coding) Salford Systems © Copyright 2011 46
  • 47. Text Mining Data Preparation The heavy lifting in text mining technology is devoted to moving us from raw unstructured text to structured numerical data Once we have structured data we are free to use any of a large number of traditional data mining and statistical tools to move forward Typical analytical tools include logistic and multiple regression, predictive modeling, and clustering tools But before diving into the analysis stage we need move through the text transformation stage in detail The first step is to extract and identify the words or “terms” which can be thought of as creating the list of all words recognized in the training data set This stage is essentially one of defining the “dictionary”, the list of officially recognized terms. Any new term encountered in the future will be unrecognizable by the dictionary and will represent an unknown item It is therefore very important to ensure that the training data set contains almost all terms of interest that would be relevant for future prediction Salford Systems © Copyright 2011 47
  • 48. Automatic Dictionary Building The following steps will build an active dictionary for a collection of documents (in our case, auction item description strings)  Read all text values into one character string  Tokenize this string into an array of words (token)  Remove words without any letters or digits  Remove “stop words” (words like “the”, “a”, “in”, “und”, “mit”, etc.) for both English and German languages  Remove words that have fewer than 2 letters and encountered less than 10 times across the entire collection of documents (rare small words)  At this point the too-common, too-rare, weird, obscure, and useless combinations of characters should have been eliminated  Lemmatize words using WordNet lexical database  This step combines words present in different grammatical forms (“go”, “went”, “going”, etc.) into the corresponding stem word (“go”)  Remove all resulting words that appear less than MIN times (5 in the remainder of this tutorial) Salford Systems © Copyright 2011 48
  • 49. Build the Dictionary (or Term Vector) For purpose of automatic dictionary building and preprocessing data we developed the Salford Text Mining (STM) software - a stand alone collection of tools that perform all the essential steps in preparing text documents for text mining STM builds on the Python “Natural Language Toolkit” (NLTK) From NLTK we use the following tools  Tokenizer (extract items most likely to be “words”)  Porter Stemmer (recognize different simple forms of same word – e.g. plural)  Word Net lemmatizer (more complex recognition of same word variations)  stop word list (words that contribute little to no vale such as “the”, “a”) Future versions of STM might use other tools to accomplish these essential tasks “stm.exe” is a command line utility that must be run from a Command Prompt window (assuming you are running Windows, go to the Start – All Programs – Accessories – Command Prompt menu) The version provided here resides in the stmtutorSTMbin folder Salford Systems © Copyright 2011 49
  • 50. STM Commands and Options Open a Command Prompt window in Windows, then CD to the “stmtutorSTM” folder location, for example, on our system you would type in cd c:stmtutorSTM To obtain help type the following at the prompt: binstm --help This command will return very concise information about STM: stm [-h] [-data DATAFILE] [-dict DICTFILE] [-source-dict SRCDICTFILE] [-score SCOREFILE] [-spm SPMAPP] [-t TARGET] [-ex EXCLUDE] etc. The details for each command line option are contained in the software manual appearing in the appendix You will also notice the “stm.cfg” configuration file – this file controls the default behavior of the STM module and relieves you of specifying a large number of configuration options each time “stm.exe” is launched  Note the TEXT_VARIABLES : 'ITEM_LEAF_CATEGORY_NAME, LISTING_TITLE, LISTING_SUBTITLE‘ line which specifies the names of the text variables to be processed 50
  • 51. Create Dictionary Options For the purposes of this tutorial, we have prepackaged all of the text processing steps into individual command files (extension *.bat). You can either double- click on the referenced command file or alternatively type its contents into the Command Prompt window opened in the directory that contains the files The most important arguments for our purposes in this tutorial now are:  --dataset DATAFILE name and location of your input CSV format data set  --dictionary DICTFILE name and location of the dictionary to be created These two arguments are all you need to create your dictionary. By default, STM will process every text field in your input data set to create a single omnibus dictionary Simply double click on the “stm_create_dictionary.bat” to create the dictionary file for the DMC 2006 dataset, which will be saved in the “dmc2006_ynm.dict” file in the “stmtutorSTMdmc2006” folder In typical text mining practice the process of generating the final dictionary will be iterative. A review of the first dictionary might reveal further words you wish to exclude (“stop” words) Salford Systems © Copyright 2011 51
  • 52. Internal Dictionary Format The dictionary file is a simple text file with extension *.dict The file contents can be viewed and edited in a standard text editor The name of the text mining variable that will be created later on appears on the left of the “=“ sign on each un-indented line The default value that will be assigned to this variable appears on the right side of the “=“ sign of the un-indented lines and it usually means the absence of the word(s) of interest Each indented line represents the value (left of the “=“) which will be entered for a single occurrence in a document for any of the word(s) appearing on the right of the “=“  More than one occurrence will be recorded as “many” when requested (always the case in this tutorial) Salford Systems © Copyright 2011 52
  • 53. Hand Made Dictionary To use a multi-level coding you need to create a “hand made dictionary”, which is already supplied to you as “hand.dict” in the “stmtutorSTMdmc2006” folder Here is an example of an entry in this file hand_model=standard mini nano standard The un-indented line of an entry starts with the name we wish to give to the term (HAND_MODEL) and also indicates that a BLANK or missing value is to be coded with the default value of “standard” The remaining indented entries are listed one-per-line and are an exhaustive list of the acceptable values which the term HAND-MODEL can receive in the term vector Another coding option is, for example: hand_unused=no yes=unbenutzt,ungeoffnet which sets “no” as the default value but substitutes “yes” if one of the two values listed above is encountered You may study additional examples in our stmtutorSTMdmc2006hand.dict file on your own, all of them were created manually based on common sense logic 53
  • 54. Why Create Hand Made Dictionary Entries Let‟s revisit the variable HAND_MODEL which brings together the terms  Standard, mini, nano Without a hand made dictionary entry we would have three terms created, one for each model type, with “yes” and “no” values, and possibly “many” By creating the hand made entry we  Ensure that every auction is assigned a model (default=“standard”)  All three models are brought together into one categorical variable with three possible values “standard”, “mini”, and “nano” This representation of the information is helpful when using tree-based learning machines but not helpful for regression-based learning machines  The best choice of representation may vary from project to project  Salford regression-based learning machines automatically repackage categorical predictors into 01 indicators meaning that you work with one representation  But if you need to use other tools you may not have this flexibility Salford Systems © Copyright 2011 54
  • 55. Further Dictionary Customization The following table summarizes some of the important fields introduced in the custom dictionary for this tutorial Variable Values Combines word variants CAPACITY 20 20gb,20 gb,20 gigabyte 30 30gb,30 gb,30 gigabyte 40 40gb,40 gb,40 gigabyte 80gb,80 gb,80 gigabyte 80 … … STATUS Wieneu Wie neu,super gepflegt,top gepflegt,top zustand,neuwertig Neu neu,new,brandneu,brandneues Unbenutzt Unbenu defekt defekt.,--defekt--,defekt,-defekt-,-defekt,defekter,defektes MODEL Mini, nano, Captures presence of the corresponding word in the auction standard description COLOR Black, white, Captures presence of the corresponding words or variants in Green, etc. the auction description IPOD_GENE First, Identified iPod generation from the information available in RATION second, etc. the text description Salford Systems © Copyright 2011 55
  • 56. Final Stage Dictionary Extraction To generate a final version of the dictionary in most real world applications you would also need to prepare an expanded list of stopwords The NLTK provides a ready-made list of stopwords for English and another 14 major languages spanning Europe, Russia, Turkey, and Scandinavia  These appear in the directory named stmtutorSTMdatacorporastopwords and should be left as they are Additional stopwords, which might well vary from project to project, can be entered into the file named “stopwords.dat” in the “stmtutorSTMdata” folder  In the package distributed with this tutorial the “stopwords.dat” file is empty  You can freely add words to this file, with one stopword per line Once the custom “stopwords.dat” and “hand.dict” files have been prepared you just run the dictionary extraction again but with the “--source-dictionary” argument added (see the command files introduced in the later slides) The resulting dictionary will now include all the introduced customizations Salford Systems © Copyright 2011 56
  • 57. Creating Structured Text Mining Variables The resulting dictionary file “dmc2006_ynm.dict” contains about 600 individual stems In the final step of text processing the data dictionary is applied to each document entry Each stem from the dictionary is represented by a categorical variable (usually binary) with the corresponding name The preparation process checks whether any of the known word variants associated with each stem from the dictionary are present in the current auction description, and if “yes”, the corresponding value is set to “yes”, otherwise, it is set to “no”  When the “--code YNM” option is set, multiple instances of “yes” will be coded as “many”  You can also request integer codes 0, 1, 2 in place of the character “yes/no/many”  We have experimented with alternative variants of coding (see the “--code” help entry in the STM manual) and came to conclusion that the “YNM” approach works best in this tutorial  Feel free to experiment with alternative coding schemas on your own The resulting large collection of variables will be used as additional predictors in our modeling efforts Even though other more computationally intense text processing methods exist, further investigation failed to demonstrate their utility on the current data which is most likely related to extremely terse nature of the auction descriptions Salford Systems © Copyright 2011 57
  • 58. Creating Additional Variables Finally, we spent additional efforts on reorganizing the original raw variables into more useful measures  MONTH_OF_START – based on the recorded start date of auction  MONTH_OF_SALE – based on the recorded closing date of auction  HIGH_BUY_IT_NOW – set to “yes” if BUY_IT_NOW_PRICE exceeds the CATEGORY_AVG_GMS as suggested by common sense and the nature of the classification problem  In the original raw data, BUY_IT_NOW_PRICE was set to 0 on all items where that option was not available – we reset all such 0s to missing All of these operations are encoded in the “” Python file located in the “stmtutorSTMdmc2006” folder  This component of the STM is under active development  The file is automatically called by the main STM utility  You may add/modify the contents of this file to allow alternative transformations of the original predictors Salford Systems © Copyright 2011 58
  • 59. Generation of the Analysis Data Set As this point we are ready to move on to the next step which is data creation This is nothing more than appending the relevant columns of data to the original data set. Remember that the dictionary may contain tens of thousands if not hundreds of thousands of terms For the DMC2006 dataset the dictionary is quite small by text mining standards containing just a little over 600 words To generate processed dataset simply double-click on the stm_ynm.bat command file or explicitly type in its contents in the Command Prompt  The “--dataset” option specifies the input dataset to be processed  The “--code YNM” option requests “yes/no/many” style of coding  The “--source-dictionary” option specifies the hand dictionary  The “--process” option specifies the output dataset  Of course you may add other options as you prefer This creates a processed dataset with the name dmc2006_res_ynm.csv which resides in the stmtutorSTMdmc2006 folder Salford Systems © Copyright 2011 59
  • 60. Analysis Data Set Observations At this point we have a new modeling dataset with the text information represented by the extra variables  Note that he raw input data set is just shy of 3 MB in size in a plain text format while the prepared analysis data set is about 40 MB in size, 13 times larger Process only training data or all data?  For prediction purposes all data needs to be processed, both the data that will be used to train the predictive models and the holdout or future data that will receive predictions later  In the DMC2006 data we happen to have access to both training and holdout data and thus have the option of processing all the text data at the same time  Generating the term vector based only on the training data would generally be the norm because future data flows have not yet arrived  In this project we elected to process all the data together for convenience knowing that the train and holdout partitions were created by random division of the data  It is worth pointing out, though, that the final dictionary generated from training data only might be slightly different due to the infrequent word elimination component of the text processor Salford Systems © Copyright 2011 60
  • 61. Quick Modeling Round with CART We are now ready to proceed with another CART run this time using all of the newly created text fields as additional predictors Assuming that you already have SPM launched  Go to the File – Open – Data File menu  Make sure that the Files of Type is set to ASCII  Highlight the dmc2006_res_ynm.csv dataset  Press the [Open] button Salford Systems © Copyright 2011 61
  • 62. Dataset Summary Window Again, the resulting window summarizes basic facts about the dataset Note the dramatic increase in the number of available variables Salford Systems © Copyright 2011 62
  • 63. The View Data Window Press the [View Data…] button to have a quick look at the physical contents of the dataset Note how the individual dictionary word entries are now coded with the “yes”, “no”, or “many” values for each document row Salford Systems © Copyright 2011 63
  • 64. Setting Up CART Model Proceed with setting up a CART modeling run as before:  Make the Classic Output window active  Go to the Model – Construct Model… menu (alternatively, you could use one of the buttons located on the bar right below the menu)  In the resulting Model Setup window make sure that the Analysis Method is set to CART  In the Model tab make sure that the Sort is set to File Order and the Tree Type is set to Classification  Check GMS_GREATER_AVG as the Target  Check all of the remaining variables except AUCT_ID, LISTING_TITLE$, LISTING_SUBTITLE$, GMS, and CATEGORY_AVG_GMS as predictors  You should see something similar to what is shown on the next slide Salford Systems © Copyright 2011 64
  • 65. Model Setup Window: Model Tab Salford Systems © Copyright 2011 65
  • 66. Model Setup Window: Testing Tab Switch to the Testing tab and confirm that the 10-fold cross-validation is used as the optimal model selection method Salford Systems © Copyright 2011 66
  • 67. Model Setup Window: Advanced Tab Switch to the Advanced tab and set the minimum required number of records for the parent nodes and the child nodes at 15 and 5 These limits were chosen to avoid extremely small nodes in the resulting tree Salford Systems © Copyright 2011 67
  • 68. Building CART Model Press the [Start] button, building progress window will appear for a while and then the Navigator window containing model results will be displayed (this time, the process takes a few minutes!) Press on the little button right above the [+][-] pair of buttons, along the left border of the Navigator window, note that all trees within one standard error (SE) of the optimal tree are now marked in green Use the arrow keys to select the 102-node tree from the tree sequence, which is the smallest 1SE tree Salford Systems © Copyright 2011 68
  • 69. CART Model Performance The selected CART model contains 102 terminal nodes where nearly all available predictor variables play a role in the tree construction Area under the ROC curve (Test) is now an impressive 0.830, especially when compared to the one reported earlier at 0.748 for the basic CART run or the 0.800 for the basic TN run Press on the [Summary Reports] button in the Navigator window, select the Prediction Success tab, and finally press the [Test] button to see cross-validated test performance at 76.58% classification accuracy – a significant improvement! Also note the presence of the original and derived variables on the list shown in the Variable Importance tab Salford Systems © Copyright 2011 69
  • 70. Setting Up TN Model Now switch to the Classic Output window and go to the Model – Construct Model… menu Choose TreeNet as the Analysis Method In the Model tab make sure that the Tree Type is set to Logistic Binary Salford Systems © Copyright 2011 70
  • 71. Setting Up TN Parameters Switch to the TreeNet tab and do the following:  Set the Learnrate: to 0.05  Set the Number of trees to use: to 800  Leave all of the remaining options at their default values Salford Systems © Copyright 2011 71
  • 72. TN Results Window Press the [Start] button to initiate TN modeling run, the TreeNet Results window will appear in the end, even though you might want to take a coffee break until the modeling run completes Salford Systems © Copyright 2011 72
  • 73. Checking TN Performance Press on the [Summary] button and switch to the Prediction Success tab Press the [Test] button to view cross-validation results Lower the Threshold: to 0.47 to roughly equalize classification accuracy in both classes (this makes it easier to compare the TN performance with the earlier reported CART and TN model performance) You can clearly see the improvement! Salford Systems © Copyright 2011 73
  • 74. Requesting TN Graphs Here we present a sample collection of all 2-D contribution plots produced by TN for the resulting model The plots are available by pressing on the [Display Plots…] button in the TreeNet Results window The list is arranged according to the variable importance table 74
  • 75. More Graphs Salford Systems © Copyright 2011 75
  • 76. Insights Suggested by the Model Here is a list of insights we arrived at by looking into the selection of plots  There is a distinct effect of the iPod category once all the other factors have been accounted for  Larger start price means above the average sale (most likely relates to the quality of an item)  A“new” and “unpacked” item should fetch a better price, while any “defect” brings the price down  End of the year means better sales  Having a good feedback score is important  It is best to wait 10 days or more before closing the deal  Interestingly, 1st and 3rd generations of iPod show poorer sales than the 2nd and 4th  2G started to fall out of favor in 2005-2006  Black is much more popular in Germany than other colors  Mentioning “photo”, “video”, “color display”, etc. helps get a better price  The paid advertising features are of little or marginal importance Salford Systems © Copyright 2011 76
  • 77. Final Validation of Models At this point we are ready to check the performance of all our models using the remaining 8,000 auctions originally not available for training This way each model can be positioned with respect to all of the official 173 entries originally submitted to the DMC 2006 competition However, in order to proceed with the evaluation, we must first score the input data using all of the models we have generated up until now The following slides explain how to score the most recently constructed CART and TN models, the earlier models can be scored using similar steps You may choose to skip the scoring steps as we have already included the results of scoring in the “stmtutorSTMscored” folder:  Score_cart_raw.csv – simple CART model predictions  Score_tn_raw.csv – simple TN model predictions  Score_cart_txt.csv – text mining enhanced CART model predictions  Score_tn_txt.csv – text mining enhanced TN model predictions Salford Systems © Copyright 2011 77
  • 78. Scoring a CART Model Select the Navigator window for the model you wish to score Select the tree from the tree sequence (in our runs we pick the 1SE trees as more robust) Press the [Score] button to open the “Score Data” window Make sure that the “Data file” is set to “dmc2006_res_ynm.csv”, if not press the [Select…] button on the right and select the dataset to be scored Place a checkmark in the “Save results to a file” box, then press the [Select] button right next to it, this will open the “Save As” window Navigate to the “stmtutorSTMscored” folder under “Save in:” selection box, enter “Scored_cart_txt.csv” in the “File name:” text entry box, and press the [Save] button You should now see something similar to what‟s shown on the next slide Press the [OK] button to initiate the scoring process You should now have the Scored_cart_txt.csv file in the stmtutorSTMscored folder Salford Systems © Copyright 2011 78
  • 79. Scoring CART Salford Systems © Copyright 2011 79
  • 80. Scoring a TN Model Select the “TreeNet Results” window for the model you wish to score Go to the “Model – Score Data…“ menu to open the “Score Data” window Make sure that the “Data file” is set to “dmc2006_res_ynm.csv”, if not press the [Select…] button on the right and select the dataset to be scored Place a checkmark in the “Save results to a file” box, then press the [Select] button right next to it, this will open the “Save As” window Navigate to the “stmtutorSTMscored” folder under “Save in:” selection box, enter “Scored_tn_txt.csv” in the “File name:” text entry box, and press the [Save] button You should now see something similar to what‟s shown on the next slide Press the [OK] button to initiate the scoring process You should now have the Scored_tn_txt.csv file in the stmtutorSTMscored folder Salford Systems © Copyright 2011 80
  • 81. Scoring TN Salford Systems © Copyright 2011 81
  • 82. Using STM to Validate Performance We can now use the STM machinery to do final model validation Simply double-click the “stm_validate.bat” command file to proceed Note the use of the following options inside of the command file:  “-score” – specifies the output dataset where the model predictions will be written  “--score-column” – specifies the name of the variable containing the actual model predictions (these variables are produced by CART or TN during the scoring process)  “--check” – specifies the name of the dataset that contains the originally withheld values of the target  this dataset was used by the organizers of the DMC 2006 competition to select the actual winners  STM is currently configured to validate only the bottom 8,000 of the 16,000 predictions generated by the model; the top 8,000 records (used for learning) are simply ignored The results will be saved into text files with extensions “*.result” appended to the original score file names in the “stmtutorSTMscored” folder Salford Systems © Copyright 2011 82
  • 83. Validation Results Format The following window shows the validation results of the final TN model we built 8000 validation records were scored, of which: 719 ones were misclassified as zeroes 807 zeroes were misclassified as ones Thus 1,526 documents were misclassified This gives the final score of 8,000 – (1,526 * 2) = 4,948 Salford Systems © Copyright 2011 83
  • 84. Final Validation of Models Based on the predicted class assignments, the final performance score is calculated as 8,000 minus twice the total number of auction items misclassified The following table summarizes how these virtually out-of-the-box elementary modelings perform on the holdout data (the values are extracted from the four *.result files produced by the STM validator) Model ROC Area Missed 0s Missed 1s Score CART raw data 75% 1123 1387 2980 TN raw data 80% 1308 926 3532 CART text data 83% 981 848 4342 TN text data 89% 807 719 4948 Salford Systems © Copyright 2011 84
  • 85. Visual Validation of the Results The following graph summarizes the positioning of the four basic models with respect to the 173 official competition entries The TN model with text mining processing is among the top 10 winners! TN text CART text TN raw CART raw Salford Systems © Copyright 2011 85
  • 86. Observations on the Results We used the most basic form of text mining, the Bag of Words, with minor emendations  None of the authors speaks German although we did look up some of the words in an on-line dictionary. If there are any subtleties to be picked from seller wording choices we would have missed them. We chose the coding scheme that performed best on the training data. We have six coding options and one stands out as clearly best We used common settings for the controls for CART and TreeNet We did not use any of the modeling refinement techniques we teach in our CART and TreeNet tutorials We thus invite you to see if you can tweak the performances of these models even higher Salford Systems © Copyright 2011 86
  • 87. Command Line Automation in SPM SPM has a powerful command line processing component which allows you to completely reproduce any modeling activity by creating and later submitting a command file We have packaged the command files for the four modeling and scoring runs you have conducted in the course of this tutorial  SPM command files must have the extension *.cmd  The four command files are stored in the “stmtutorSTMdmc2006” folder You can create, open, or edit a command file using a simple text editor, like Notepad, etc. SPM has a built-in editor, just go to the File – New Notepad… menu You may also access the command line directly from inside of the SPM GUI, just make sure that the File – Command Prompt menu item is checked Just type in “help” in the Command Prompt part (starts with the “>” mark) of the Classic Output window to get the listing of all available commands Then you can request a more detailed help for any specific command of interest, for example “help battery” will produce a long list of various batteries of automated runs available in SPM Furthermore, you may view all of the commands issued during the current session by going to the View – Open Command Log… menu, this way you can quickly learn which commands correspond to the recent GUI activity you were involved with Salford Systems © Copyright 2011 87
  • 88. Basic CART Model Command File You may now restart SPM to emulate a new fresh run Go to the File – Open – Command File… menu Select the “cart_raw.cmd” command file and press the [Open] button The file is now opened in the built-in Notepad window Salford Systems © Copyright 2011 88
  • 89. CART Command File Contents OUT – saves the classic output into a text file USE – points to the modeling dataset GROVE – saves the model as a binary grove file MODEL – specifies the target variable CATEGORY – indicates which variables are categorical, including the target KEEP – specifies the list of predictors LIMIT – sets the node limits ERROR – requests cross-validation BUILD – builds a CART model SAVE – names the file where the CART model predictions will be saved HARVEST – specifies which tree is to be used in scoring IDVAR – requests saving of the Note the use of the relative paths in the GROVE and SAVE commands additional variables into the output dataset Also note the use of the forward slash “/” to separate folder names SCORE – scores the CART model OUTPUT * – closes the current text output file Salford Systems © Copyright 2011 89
  • 90. Submitting Command File With the Notepad window active, go to the File – Submit Window menu to submit the command file into SPM In the end you will see the Navigator and the Score windows opened which should be identical to the ones you have already seen in the beginning of this tutorial Furthermore, you should now have  “cart_raw.dat” text file created in the “stmtutorSTMdmc2006” folder, the file contains the classic output you normally see in the “Classic Output” window  “cart_raw.grv” binary grove file created in the “stmtutorSTMmodels” folder, the file contains the CART model itself, it can be opened in the GUI using the File – Open – Open Grove… menu which reopens the Navigator window, this file will be also needed to future scoring or translation  “Score_cart_raw.csv” data file created in the “stmtutorSTMscored” folder, the file contains the selected CART model predictions on your data You may proceed now with opening up the “tn_raw.cmd” file using the File – Open – Command File… menu Salford Systems © Copyright 2011 90
  • 91. TN Command File Contents OUT, USE, GROVE, MODEL, CATEGORY, KEEP, ERROR, SAVE, IDVAR, SCORE, OUTPUT – same as the CART command file introduced earlier MART TREES – sets the TN model size in trees MART NODES – sets the tree size in terminal nodes MART MINCHILD - set the minimum individual node size in records MART OPTIMAL – sets the evaluation criterion that will be used for optimal model selection MART BINARY – requests logistic regression processing in our case MART LEARNRATE – sets the learnrate parameter MART SUBSAMPLE – sets the sampling rate MART INFLUENCE – sets the influence trimming value The rest of the MART commands requests automatic saving of the 2-D and 3-D plots into the grove; type in “help mart” to get full descriptions Salford Systems © Copyright 2011 91
  • 92. Submitting the Rest of the Command Files Again, with the current Notepad window active, use the File – Submit Window menu to launch the basic TN modeling run automatically followed by scoring This will create the output, grove, and scored data files in the corresponding locations for the chosen TN model; also note the use of the EXCLUDE command in place of the KEEP command inside of the command file – this saves a lot of typing Now go back to the Classic Output window and notice that the File menu has changed Go to the File – Sumbit Command File… menu, select the “cart_txt.cmd” command file, and press the [Open] button Notice the modeling activity in the Classic Output window, but no Results window is produced – this is how the Submit Command File… menu item is different from the Submit Window menu item used previously; nonetheless, the output, grove, and score files are still created in the specified locations Use the File – Open – Open Grove… menu to open the “tn_raw.grv” file located in the “stmtutorSTMmodels folder”, you will need to navigate into this folder using the Look in: selection box in the Open Grove File window You may now proceed with the final TN run by submitting the “tn_txt.cmd” command file using either the File – Open – Command File… / File – Submit Window or File – Submit Command File… menu routes – don‟t forget that it does take long time to run! Salford Systems © Copyright 2011 92
  • 93. Final Remarks This completes the Salford Systems Data Mining and Text Mining tutorial In the process of going through the tutorial you have learned how to use both GUI and command cine facilities of SPM as well as the command line text mining facility STM You managed to build two CART models, two TN models, as well as enriched the original dataset with a variety of text mining fields The final model puts you among the top winners in a major text mining competition – a proud achievement Even though we have barely scratched the surface, you are now ready to proceed with exploring the remainder of the vast data mining activities offered within SPM and STM on your own We wish you best of luck on the exciting and never ending road of modern data analysis and exploration And don‟t forget that you can always reach us at should you have further modeling questions and needs Salford Systems © Copyright 2011 93
  • 94. References Breiman, L., J. Friedman, R. Olshen and C. Stone (1984), Classification and Regression Trees, Pacific Grove: Wadsworth Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140. Hastie, T., Tibshirani, R., and Friedman, J.H (2000). The Elements of Statistical Learning. Springer. Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In L. Saitta, ed., Machine Learning: Proceedings of the Thirteenth National Conference, Morgan Kaufmann, pp. 148-156. Friedman, J.H. (1999). Stochastic gradient boosting. Stanford: Statistics Department, Stanford University. Friedman, J.H. (1999). Greedy function approximation: a gradient boosting machine. Stanford: Statistics Department, Stanford University. Sholom M. Weiss, Nitin Indurkhya, Tong Zhang, Fred J. Damerau (2004). Text Mining. Predictive Methods for Analyzing Unstructured Information. Springer. Salford Systems © Copyright 2011 94
  • 95. STM Command Reference Salford Text Miner is simple utility that should make text mining process much easier. For this purpose application described in this manual have different parameters and can execute Salford Predictive Miner at the data mining backend STM Workflow:  Automatically generate dictionary based on dataset  Process dataset and generate new with additional columns based on dictionary  Generate model folder with dataset, command file and dictionary  Run Salford Predictive Miner with generated command file  Run checking process comparing results from scoring with real classes All of these steps can be done in separate STM calls or in one call Salford Systems © Copyright 2011 95
  • 96. STM Command Reference Short Option Long Option Description -data DATAFILE --dataset DATAFILE Specify dataset to work with -dict DICTFILE --dictionary Specify dictionary to work with DICTFILE -source-dict SDFILE --source-dictionary Dictionary that is used as source for SDFILE automatic dictionary retrieval process -score SFILE --scoreresult SFILE Specify file with score result, for checking process, default – „score.csv‟ -spm SPMAPP --spmapplication Path to spm application, default – SPMAPP „spm.exe‟ -t TARGET --target TARGET Target variable to generate command file, default – „GMS_GREATER_AVG‟ -ex EXCLUDE --exclude EXCLUDE List of variables to exclude from keep list, when generate command file. -cat CATEGORY --category List of variables to select as category CATEGORY variables, when generate command file Salford Systems © Copyright 2011 96
  • 97. STM Command Reference Short Option Long Option Description -templ CMDTEMPL --cmdtemplate Specify template of command file, that will CMDTEMPL be used for generation. Default – „data/template.cmd‟ -md MODEL_DIR --modeldir Dir, where model‟s folders will be created. MODEL_DIR Default – „models‟ -trees TREES --trees TREES Parameter for TreeNet command files, specify number of trees will be build. Default – 500 -maxnodes --maxnodes Parameter for TreeNet command files, MAXNODES MAXNODES specify numbers of nodes in one tree will be build. Default – 6 -fixwords --fixwords Enables heuristics that tries to fix words (find nearest by different metrics, searching spell checking, etc) -textvars VARLIST --text-variables List of variables separated by commas, VARLIST which will be used in dictionary retrieving process Salford Systems © Copyright 2011 97
  • 98. STM Command Reference Short Option Long Option Description -outrmwords --output-removed- Enables outputting removed stop words to words file „data/removed.dat‟ -code CODE --column-coding Specify how to code absence/presence of CODE word in row: YN or 0 – no/yes YNM or 1 – no/yes/many 01 or 2 – 0/1 012 or 3 – 0/1/2 TF or 4 – term frequency IDF or 5 – inversed document frequency TF-IDF or 6 – TF-IDF TC or 7 – term count (0,1,2,…) Default – YN -mp MODELPATH --model-path Specify path where model files would be MODELPATH created -cmd-path CMDPATH --command-file-path Specify path to command file, which will be CMDPATH executed by Salford Predictive Miner -ppfile PPFILE --preprocess-file Path to python code that will be executed PPFILE on process step for data manipulate data Salford Systems © Copyright 2011 98
  • 99. STM Command Reference Short Option Long Option Description -rc NAME --realclass- Specify column name for in real class dataset for column-name check step. Default – GMS_GREATER_AVG -e --extract Run first step – automatic extraction of dictionary from dataset. Need to specify --dataset -p OUTFILE --process Run second step – process dataset and create new OUTFILE dataset with name OUTPUTFILE were depending on dictionary will be created new columns. Need to specify --dataset and --dictionary -g --generate Run third step – generate model folder with command file. Need specify --dataset, --dictionary -m --model Run forth step. Run Salford Predictive Miner with generate command file. Works only with –generate -c DATASET --check DATASET Run fives step. Check score file with real classes (from specified REALCLASSFILE) and outputs misclassification table. Need to specify --scoreresult -h --help Show help Salford Systems © Copyright 2011 99
  • 100. STM Configuration File Name Description Default SPM_APPLICATION Path to Salford Predictive Miner spm.exe CMD_TREES Number of trees to build in TN models 500 CMD_NODES Tree size for TN modes 6 CMD_TEMPLATE Command file template data/template.cmd MODELS_DIR Dir, where model‟s folders will be created models LANGUAGES Languages, stop words which will be used English, German SPELLCHECKER_DICT Additional spell checker dictionary, with words that data/spellchecker_dict.dat are allowed (like “ipod”) SPELLCHECKER_LANGUAGE Language for spell checker de_DE ADDITIONAL_STOPWORDS File with additional stop words, which user can edit data/stopwords.dat REMOVED_WORDS_FILE File, where removed words will be written on data/removed.dat “extract” step WORD_FREQUENCY_THRESH Lower threshold word frequency, which will be 5 OLD deleted on “extract” step PREPROCESS_FILE Include script to do additional processing dmc2006/ Salford Systems © Copyright 2011 100
  • 101. STM Configuration File Name Description Default CHECK_RESULTS_FILE data/score_results.csv LOGFILE Path to log file. Can be mask (%s for date). log/stm%s.log TARGET Default variable for target argument, which would be used to GMS_GREATER_AVG fill command file template EXCLUDE Default variable for keep argument, which would be used to AUCT_ID, fill command file template LISTING_TITLE$, LISTING_SUBTITLE$, GMS, GMS_GREATER_AVG CATEGORY Default variable for category argument, which would be used GMS_GREATER_AVG to fill command file template SCORE_FILE Name of score file which need to be checked Score.csv TEXT_VARIABLES List of text variables in dataset separated by comma ITEM_LEAF_CATEGORY_ NAME, LISTING_TITLE, LISTING_SUBTITLE DEFAULT_CODING Default coding for extract and preprocess steps YN REALCLASS_COLUMN_ Name of column in real class file, which would be used in GMS_GREATE_AVG NAME check step SCORE_COLUMN_NAM Name of column in score file, which would be used in check PREDICTION E step Salford Systems © Copyright 2011 101