Text mining tutorial

Getting Started with Text Mining:
STM™, CART® and TreeNet®

Dan Steinberg
Mykhaylo Golovnya
Ilya Polosukhin
May, 2011

Text Mining and Data Mining

Text mining is an important and fascinating area of modern analytics
On the one hand text mining can be thought of as just another application
area for powerful learning machines
On the other hand, text mining is a distinct field with its own dedicated
concepts, vocabulary, tools, and techniques
In this tutorial we aim to illustrate some important analytical methods and
strategies from both perspectives on data mining
 introducing tools specific to the analysis text, and,
 deploying general machine learning technology

The Salford Text Mining utility (STM) is a powerful text processing system
that prepares data for advanced machine learning analytics
Our machine learning tools are the Salford Systems flagship CART® decision
tree and stochastic gradient boosting TreeNet®
Evaluation copies of the the proprietary technology in CART and TreeNet as
well as the STM are available from http://www.salford-systems.com
Salford Systems © Copyright 2011 2

For Readers of this Tutorial

To follow along this tutorial we recommend that you have the analytical tools we use
installed on your computer. Everything you need may already be on a CD disk
containing this tutorial and analytical software
Create an empty folder named “stmtutor”, this is the root folder where all of the work
files related to this tutorial will reside
You may also use the following link to download Salford Systems Predictive Modeler
(SPM)
http://www.salford-systems.com/dist/SPM/SPM680_Mulitple_Installs_2011_06_07.zip
After downloading the package, unzip its contents into “stmtutor” which will create a
new folder named “SPM680_Mulitple_Installs_2011_06_07”. Follow installation steps
described on the next slide.
For the original DMC2006 competition website visit
http://www.data-mining-cup.de/en/review/dmc-2006/
We recommend that you visit the above site for information only; data and tools for
preparing that data are available at the URL next below
For the STM package, prepared data files, and other utilities developed for this tutorial
please visit
http://www.salford-systems.com/dist/STM.zip
After downloading the archive, unzip its contents into “stmtutor”

Important! Installing the SPM Software

The Salford Systems software you‟ve just downloaded needs to be both
installed and licensed. No-cost license codes for a 30 day period are
available on request to visitors of this tutorial*
Double click on the “Install_a_Transform_SPM.exe” file located in the
“SPM680_Mulitple_Installs_2011_06_07” folder (see the previous slide) to
install the specific version of SPM used in this tutorial
 Following the above procedure will ensure that all of the currently installed
versions of SPM, if any, will remain intact!

Follow simple installation steps on your screen
* Salford Systems reserves the right to decline to offer a no-cost license at its sole discretion

Important! Licensing the SPM Software

When you launch the Salford Systems Predictive Modeler (SPM) you will be
greeted with a License dialog containing information needed to secure a
license via email
Please, send the
necessary information
to Salford Systems to
secure your license by
entering the “Unlock
Code” which will be e-
mailed back to you
The software will
operate for 3 days
without any licensing;
however, you can
secure a 30-day
license on request


Installing the Salford Text Miner (STM)

In addition to the Salford Predictive Modeler (SPM) you will also work with the
Salford Text Miner (STM) software
No installation is needed and you should already have the “stm.exe”
executable in the “stmtutorSTMbin” folder as the result of unzipping the
“STM.zip” package earlier
STM builds upon the Python 2.6 distribution and the NLTK (Natural Language
Tool Kit) but makes text data processing for analytics very easy to conduct
and manage
 You do not need to add any other support software to use STM

Expect to see several folders and a large number of files located under the
“stmtutorSTM” folder. It is important to leave these files in the location to
which you have installed them.
 Please do not MOVE or alter any of the installed files other than those explicitly
listed as user-modifiable!

“stm.exe” will expire in the middle of 2012, contact Salford Systems to get an
updated version beyond that

The Example Project

The best examples are drawn from real world data sets and we were
fortunate to locate data publicly released by eBay.
Good teaching examples also need to be simple.
 Unfortunately, real world text mining could easily involve hundreds of thousands if
not millions of features characterizing billions of records. Professionals need to be
able to tackle such problems but to learn we need to start with simpler situations.
 Fortunately, there are many applications in which text is important but the
dimensions of the data set are radically smaller, either because the data available
is limited or because a decision has been made to work with a reduced problem.

We use our simpler example to illustrate many useful ideas for beginning text
miners while pointing the way to working on larger problems.


The DMC2006 Text Mining Challenge

In 2006 the DMC data mining competition (restricted to student competitors
only) introduced a predictive modeling problem for which much of the
predictive information was in the form of unstructured text.
The datasets for the DMC 2006 data mining competition can be downloaded
from http://www.data-mining-cup.de/en/review/dmc-2006/
 For your convenience we have re-packaged this data and made it somewhat
easier to work with. This re-packaged data is included in the STMU package
described near the beginning of this tutorial.

The data summarizes 16,000 iPod auctions held at eBay from May 2005
through May 2006 in Germany
Each auction item is represented by a text description written by the seller (in
German) as well as a number of flags and features available to the seller at
the time of the auction
Auction items were grouped into 15 mutually exclusive categories based on
distinct iPod features: storage size, type (regular, mini, nano), and color
The competition goal was to predict whether the closing price would be above
or below the category average

Comments on the Challenge

One might think that a challenge with text in German might not be of general
interest outside of Germany
However, working with a language essentially unfamiliar to any member of
the analysis team helps to illustrate one important point
 Text mining via tools that have no “understanding” of the language can be
strikingly effective

We have no doubt that dedicated tools which embed knowledge of the
language being analyzed can yield predictive benefits
 We also believe we could have gained further valuable insight into the data if any
of the authors spoke German! But our performance without this knowledge is still
impressive.

In contexts where simple methods can yield more than satisfactory results, or
in contexts where the same methods must be applied uniformly across
multiple languages, the methods described in this tutorial will be an excellent
guide.


Configuring Work Location in SPM

The original datasets from the DMC 2006 challenge reside in the
“stmtutorSTMdmc2006” folder
To facilitate further modeling steps, we will configure SPM to use this location
as the default location:
 Start SPM
 Go to the Edit – Options menu
 Switch to the Directories tab
 Enter the “stmtutorSTMdmc2006”
folder location in all text entry boxes
except the last one
 Press the [Save as Defaults] button
so that the configuration is restored
the next time you start SPM


Configuring TreeNet Engine

Now switch to the TreeNet tab
 Configure the Plot Creation
section as shown on the screen
shot
 Press the
[Save as Defaults]
button
 Press the [OK] button
to exit


Steps in the Analysis: Data Overview

1. Describe the data: (Data Dictionary and Dimensions of Data)
a. What is the unit of observation? Each record of data is describing what?
b. What is the dependent or target variable?
c. What other variables (data base fields) are available?
d. How many records are available?

2. Statistical Summary
a. Basic summary including means, quantiles, frequency tables
b. Dimensions of categorical predictors
c. Number of distinct values of continuous variables

3. Outlier and Anomaly Assessment
a. Detection of gross data errors such as extreme values
b. Assessment of usability of levels of categorical predictors (rare levels)


Data Fundamentals

The original dataset is called “dmc2006.csv” and resides in the
“stmtutorSTMdmc2006” folder
16,000 records divided into two equal sized partitions
 Part 1: Complete data including target, available for training during the competition
 Part 2: Data to be scored; during the competition the target was not availabler

25 database fields two of which were unstructured text written by the seller
Each line of data describes an auction of an iPod including the final winning
bid price
An eBay seller must construct a headline and a description of the product
being sold. Sellers can also pay for selling assistance
 E.g. Seller can pay to list the item title in BOLD


The Data: Available Fields

The following variables describe general features of each auction event

Variable Description
AUCT_ID ID number of auction
ITEM_LEAF_CATEGORY_NAME products category
LISTING_START_DATE start date of auction
LISTING_END_DATE end date of auction
LISTING_DURTN_DAYS duration of auction
LISTING_TYPE_CODE type of auction (normal auction, multi auction, etc)
QTY_AVAILABLE_PER_LISTING amount of offered items for multi auction
FEEDBACK_SCORE_AT_LISTIN feedback-rating of the seller of this auction listing
START_PRICE start price in EUR
BUY_IT_NOW_PRICE buy it now price in EUR
BUY_IT_NOW_LISTING_FLAG option for buy it now on this auction listing


Available Data Fields

In addition, there are binary indicators of various “value added” features that
can be turned on for each auction

BOLD_FEE_FLAG option for bold font on this auction listing
FEATUERD_FEE_FLAG show this auction listing on top of homepage
CATEGORY_FEATURED_FEE_FLAG show this auction listing on top of category
GALLERY_FEE_FLAG auction listing with picture gallery
GALLERY_FEATURED_FEE_FLAG auction listing with gallery (in gallery view)
IPIX_FEATURED_FEE_FLAG auction listing with IPIX (additional xxl, picture
show, pack)
RESERVE_FEE_FLAG auction listing with reserve-price
HIGHLIGHT_FEE_FLAG auction listing with background color
SCHEDULE_FEE_FLAG auction listing, including the definition of the
starting time
BORDER_FEE_FLAG auction listing with frame


Target Variable

Finally, the target variable is defined based on the winning bid price revenue
relative to the category average

GMS scored sales revenue in EUR
CATEGORY_AVG_GMS Average sales revenue for the product category
GMS_GREATER_AVG zero when the revenue is less than or equal to the
category average sales and one otherwise
The values were only disclosed on a randomly selected set of 8,000 auctions
which we use to train a model
4199 auctions with the revenue below the category average
3801 auctions with the revenue above the category average
During the competition the auction results for the remaining 8,000 auction
results were kept secret, and used to score competitive entries
We will only use these records at the very end of this tutorial to validate
the performance of various models that will be built

Comments on Methodology

Predictive modeling and general analytics competitions are increasingly being
launched both by private companies and by professional organizations and
provide both public data sets and a wealth of illustrative examples using
different analytic techniques
When reviewing results from a competition, and especially when comparing
results generated by analysts running models after the competition, it is
important to keep in mind that there is an ocean of difference between being
a competitor during the actual competition and an after-the-fact commentator
Regardless of what is reported the after-the-fact analyst does have access to
“what really happened” and it is nearly impossible to simulate the competitive
environment once the results have been published
 We all learn in both direct and indirect ways from many sources including the
outcomes of public competitions. This can affect anything that comes later in time.
In spite of this, we have tried to mimic the circumstances of the competitors
by presenting analyses based only on the original training data, and using
well-established guidelines we have been promoting for more than decade to
arrive at a final model
We urge you to never take as face value an analyst‟s report on what would
have happened if they had hypothetically participated

First Round Modeling: Ignoring the TEXT Data

Even before doing any type of data preparation it is always valuable to run a
few preliminary CART models
 CART automatically handles missing values and is immune to outliers
 CART is flexible enough to adapt to any type of nonlinearity and interaction effects
among predictors. The analyst does not need to do any data preparation to assist
CART in this regard
 CART performs well enough out of the box that we are guaranteed to learn
something of value without conducting any of the common data preparation
operations

The only requirement for useful results is that we exclude any possible
perfect or near perfect illegitimate predictors
 Common examples of illegitimate predictors include repackaged versions of the
dependent variable, ID variables, and data drawn from the future relative to the
data to be predicted

We start with a quick model using 20 of the 25 available predictors. None of
these involve any of the text data we will focus on later.


Quick Modeling Round with CART

We start by building a quick CART model using original raw variables and all
8,000 complete auction records
Assuming that you already have SPM launched
 Go to the
File – Open – Data File menu
 Note that we have already
configured the default working
folder for SPM
 Make sure that the Files of Type
is set to ASCII
 Highlight the dmc2006.csv dataset
 Press the [Open] button


Dataset Summary Window

The resulting window summarizes basic facts about the dataset
Note that even though the dataset has 16000 records, only top 8000 will be
used for modeling as was already pointed out


The View Data Window

Press the [View Data…] button to have a quick impression of physical
contents of the dataset
Out goal is to eventually use the unstructured information contained in the
text fields right next to the auction ID


Requesting Basic Descriptive Stats

We next produce some basic stats for all available variables:
 Go to the View – Data Info… menu
 Set the Sort mode into File Order
 Highlight the Include column
 Check the Select box
 Press the [OK] button


Data Information Window

All basic descriptive statistics for all requested variables are now summarized in one
place
Note that the target variable GMS_GREATER_AVG is not defined for the one half of
the dataset (N Missing 8,000), all those records will be automatically discarded during
model building
Press the [Full] button to see more details


Setting Up CART Model

We are now ready to set up a basic CART run:
 Switch to the Classic Output window active
 Go to the Model – Construct Model… menu (alternatively, you could press one of
the buttons located on the bar right below the menu bar)
 In the resulting Model Setup window make sure that the Analysis Method is set
to CART
 In the Model tab make sure that the Sort is set to File Order and the Tree Type is
set to Classification
 Check GMS_GREATER_AVG as the Target
 Check all of the remaining variables except AUCT_ID, LISTING_TITLE$,
LISTING_SUBTITLE$, GMS, and CATEGORY_AVG_GMS as predictors
 You should see something similar to what is shown on the next slide


Model Setup Window: Model Tab


Model Setup Window: Testing Tab

Switch to the Testing tab and confirm that the 10-fold cross-validation is used
as the optimal model selection method


Model Setup Window: Advanced Tab

Switch to the Advanced tab and set the minimum required number of records
for the parent nodes and the child nodes at 15 and 5
These limits were chosen to avoid extremely small nodes in the resulting tree


Building CART Model
Press the [Start] button, building progress window will appear for a while and then the Navigator
window containing model results will be displayed
Press on the little button right above the [+][-] pair of buttons, along the left border of the Navigator
window, note that all trees within one standard error (SE) of the optimal tree are now marked in green
Use the arrow keys to select the 64-node tree from the tree sequence, which is the smallest 1SE tree


CART model observations

The selected CART model contains 64 terminal nodes and it is the smallest
model with the relative error still within one standard error of the optimal
model (the model with the smallest relative error) pointed by the green bar
 This approach to model selection is usually employed for easy comprehension
 We might also want to require terminal nodes to contain more than the 6 record
minimum we observe in this out of the box tree

All 20 predictor variables play a role in the tree construction
 but there is more to observe about this when we look at the variable importance
details

Area under the ROC curve is a respectable 0.748


CART Model Performance

Press the [Summary
Reports…] button in the
Navigator, select Prediction
Success tab, and press the
[Test] button to display cross-
validated test performance of
68.66% classification accuracy
Now select the Variable
Importance tab to review which
variables entered into the model
Interestingly enough, none of
the “added value” paid options
are important and exhibit
practically no direct influence on
the sales revenue
A detailed look at the nodes
might also be instructive for
understanding the model


Experimenting with TreeNet

We almost always follow initial CART models with similar TreeNet models
We start with CART because some glaring errors such as perfect predictors
are more quickly found and obviously displayed in CART
 A perfect predictor often yields a single split tree (two terminal nodes) for
classification trees

TreeNet models have strengths similar to CART regarding flexibility and
robustness and has advantages and disadvantages relative to CART
 TreeNet is an ensemble of small CART trees that have been linked together in
special ways. Thus TreeNet shares many desirable features of CART
 TreeNet is superior to CART in the context of errors in the dependent variable (not
relevant in this data)
 TreeNet yields much more complex models but generally offers substantially better
predictive accuracy. TreeNet may easily generate thousands of trees to arrive at
an optimal model
 TreeNet yields more reliable variable importance rankings


A few words about TreeNet

TreeNet builds predictive models in stages. It first starts with a deliberately
very small first round tree (essentially a CART tree).
Then TreeNet calculates the prediction error made by this simple model and
builds a second tree to try to model that prediction error. The second tree
serves as tool to update, refine, and improve the first stage model.
A TreeNet model produces a “score” which is a simple of sum of all the
predictions made by each tree in the model
Typically the TreeNet score becomes progressively more accurate as the
number of trees is increased up to an optimal number of trees
Rarely the optimal number of trees is just one! Occasionally, a handful of
trees are optimal. More typically, hundreds or thousands of trees are optimal.
TreeNet models are very useful for the analysis of data with large numbers of
predictors as the models are built up in layers each of which makes use of
just a few predictors
More detail on TreeNet can be found at http://www.salford-systems.com


Setting Up TN Model

Switch to the Classic Output window and go to the Model – Construct
Model… menu
Choose TreeNet as the Analysis Method
In the Model tab make sure that the Tree Type is set to Logistic Binary


Setting Up TN Parameters

Switch to the TreeNet tab and do the following:
 Set the Learnrate to 0.05
 Set the Number of trees to use: to 800 trees
 Leave all of the remaining options at their default values


TN Results Window

Press the [Start] button to initiate TN modeling run, the TreeNet Results
window will appear in the end


Checking TN Performance

Press on the [Summary] button and switch to the Prediction Success tab
Press the [Test] button to view cross-validation results
Lower the Threshold: to 0.45 to roughly equalize classification accuracy in
both classes (this makes it easier to compare the TN performance with the
earlier reported CART performance)


The Performance Has Improved!

The overall classification accuracy goes up to about 71%
Press the [ROC] button to see that the area under ROC is now a solid 0.800
This comes at the cost of added model complexity – 796 trees each with about 6
terminal nodes
Variable importance remains similar to CART


Understanding the TreeNet Model

TreeNet produces partial dependency plots for every predictor that
appears in the model, the plots can be viewed by pressing on the [Display
Plots…] button
Such plots are generally 2D illustrations of how the predictor in question
affects an outcome
 For example, in the graph below the Y axis represents the probability that an iPod
will sell at an above category average price

We see that for a BUY_IT_NOW price between 200 and 300 the probability of
above average winning bid rises sharply with the BUY_IT_NOW_PRICE
For prices above 300 or below 200 the curve is essentially flat meaning that
changes in the predictor do not result in changes in the probable outcome

Understanding the Partial Dependency Plot (PD Plot)

The PD Plot is not a simple description of the data. If you plotted the raw data
as say the fraction of above average winning bids against prices intervals you
might see a somewhat different curve
The PD Plot is a plot that is extracted from the TreeNet model and it is
generated by examining TreeNet predictions (and not input data)
The PD Plot appears to be relate two variables but in fact other variables may
well play a role in the graph construction
Essentially the PD Plot shows the relationship between a predictor and the
target variable taking all other predictors into account
The important points to understand are that
 the graph is extracted from the model and not directly from raw data
 the graph provides an honest estimate of the typical effect of a predictor
 the graph displays not absolute outcomes but typical expected changes from some
baseline as the predictor varies. The graph can be thought of as floating up or
down depending on the values of other predictors


More TN Partial Dependency Plots


Introducing the Text Mining Dimension

To this point, we have been working only with the set of traditional structured
data fields continuous and categorical variables
Further substantial performance improvement can be achieved only if we
utilize the text descriptions supplied by the seller in the following fields

LISTING_TITLE title of auction
LISTING_SUBTITLE subtitle of auction

Unfortunately, these two variables cannot be used “as is”. Sellers were free to
enter free form text including misspellings, acronyms, slang, etc.
So we must address the challenge of converting the unstructured text strings
of the type shown here into a well structured representation


The Bag of Words Approach of Text Mining

The most straightforward strategy for dealing with free form text is to
represent each “word” that appears in the complete data set as a dummy
(0/1) indicator variable
For iPods on eBay we could imagine sellers wanting to use words like “new”
“slightly scratched”, “pink” etc. to describe their iPod. Of course the
descriptions may well be complete phrases like “autographed by Angela
Merkel” rather than just single term adjectives
Nevertheless in the simplest Bag of Words (BOW) approach we just create
dummy indicators for every word
Even though the headlines and descriptions are space limited the number of
distinct words that can appear in collections of free text can be huge
Text mining applications involving complete documents, e.g. newspaper
articles, the number of distinct words can easily reach several hundred
thousands or even millions


The End Goal of the Bag of Words

Record_ID RED USED SCRATCHED CASE
1001 0 1 0 1
1002 0 0 0 0
1003 1 0 0 0
1004 0 0 0 0
1005 1 1 1 0
1006 0 0 0 0

• Above we see an example of a database intended to describe each auction
item by indicating which words appeared in the auction announcement
• Observe that Record_ID 1005 contains the three words “RED”, “USED” and
“SCRATCHED”
• Data in the above format looks just like the kind of numeric data used in
traditional data mining and statistical modeling
• We can use data in this form, as is, feeding it into CART, TreeNet, or
regression tools such Generalized Path Seeker (GPS) or everyday regression
• Observe that we have transformed the unstructured text into structured
numerical data

Coding the Term Vector and TF weighting

In the sample data matrix on the previous slide we coded all of our indicators
as 0 or 1 to indicate presence or absence of a term
An alternative coding scheme is based on the FREQUENCY COUNT of the
terms with these variations:
 0 or 1 coding for presence/absence
 Actual term count (0,1,2,3,…)
 Three level indicator for absent, one occurrence, and more than one (0,1,2)

The text mining literature has established some useful weighted coding
schemes. We start with term frequency weighting (tf)
 Text mining can involve blocks of text of considerably different lengths
 It is thus desirable to normalize counts based on relative frequency. Two text fields
might each contain the term “RED” twice, but one of the fields contains 10 words
while the other contains 40 words. We might want our coding to reflect the fact that
2/10 is more frequent than 2/40.
 This is nothing more than making counts relative to the total length of the unit of
text (or document) and such coding yields the term frequency weighting

Inverse Document Frequency (IDF) Weighting

IDF weighting is drawn from the information retrieval literature and is intended
to reflect the value of a term in narrowing the search for a specific document
within a larger corpus of documents
If a given term occurs very rarely in a collection of documents then that term
is very valuable as a tag to target those documents accurately
By contrast, if a term is very common, then knowing that such a term occurs
within the document you are looking for is not helpful in narrowing the search
While text mining has somewhat different goals than information retrieval the
concept of IDF weighting has caught on. IDF weighting serves to upweight
terms that occur relatively rarely.
IDF(term) =
log { (Number of documents)/Number of documents containing(term))}
The IDF increases with the rarity of a term and is maximum for words that
occur in only one document
A common coding of the term vector uses the product: tf * idf


Coding the DMC2006 Text Data

The DMC2006 text data is unusual principally because of the limit on the amount of
text a seller was allowed to upload
This has the effect making the lengths of all the documents very similar
It also limits sharply the possibility that a term in a document would occur with a high
frequency
These factors contribute to making the TF-IDF weighting irrelevant to this challenge. In
fact, for this prediction task other coding schemes allow more accurate prediction.
STM offers these options for term vector coding
 0 – no/yes
 1 – no/yes/many – this one will be used in the remainder of this tutorial
 2 – 0/1
 3 – 0/1/2
 4 – term frequency (relative to document)
 5 – inversed document frequency (relative to corpus)
 6 – TF-IDF (traditional IR coding)


Text Mining Data Preparation

The heavy lifting in text mining technology is devoted to moving us from raw
unstructured text to structured numerical data
Once we have structured data we are free to use any of a large number of
traditional data mining and statistical tools to move forward
Typical analytical tools include logistic and multiple regression, predictive
modeling, and clustering tools
But before diving into the analysis stage we need move through the text
transformation stage in detail
The first step is to extract and identify the words or “terms” which can be
thought of as creating the list of all words recognized in the training data set
This stage is essentially one of defining the “dictionary”, the list of officially
recognized terms. Any new term encountered in the future will be
unrecognizable by the dictionary and will represent an unknown item
It is therefore very important to ensure that the training data set contains
almost all terms of interest that would be relevant for future prediction


Automatic Dictionary Building

The following steps will build an active dictionary for a collection of
documents (in our case, auction item description strings)
 Read all text values into one character string
 Tokenize this string into an array of words (token)
 Remove words without any letters or digits
 Remove “stop words” (words like “the”, “a”, “in”, “und”, “mit”, etc.) for both English
and German languages
 Remove words that have fewer than 2 letters and encountered less than 10 times
across the entire collection of documents (rare small words)
 At this point the too-common, too-rare, weird, obscure, and useless
combinations of characters should have been eliminated
 Lemmatize words using WordNet lexical database
 This step combines words present in different grammatical forms (“go”, “went”,
“going”, etc.) into the corresponding stem word (“go”)
 Remove all resulting words that appear less than MIN times (5 in the remainder of
this tutorial)

Build the Dictionary (or Term Vector)

For purpose of automatic dictionary building and preprocessing data we developed the
Salford Text Mining (STM) software - a stand alone collection of tools that perform all
the essential steps in preparing text documents for text mining
STM builds on the Python “Natural Language Toolkit” (NLTK)
From NLTK we use the following tools
 Tokenizer (extract items most likely to be “words”)
 Porter Stemmer (recognize different simple forms of same word – e.g. plural)
 Word Net lemmatizer (more complex recognition of same word variations)
 stop word list (words that contribute little to no vale such as “the”, “a”)

Future versions of STM might use other tools to accomplish these essential tasks
“stm.exe” is a command line utility that must be run from a Command Prompt window
(assuming you are running Windows, go to the Start – All Programs – Accessories –
Command Prompt menu)
The version provided here resides in the stmtutorSTMbin folder


STM Commands and Options
Open a Command Prompt window in Windows, then CD to the
“stmtutorSTM” folder location, for example, on our system you would type in
cd c:stmtutorSTM

To obtain help type the following at the prompt:
binstm --help

This command will return very concise information about STM:
stm [-h] [-data DATAFILE] [-dict DICTFILE] [-source-dict SRCDICTFILE]
[-score SCOREFILE] [-spm SPMAPP] [-t TARGET] [-ex EXCLUDE]
etc.

The details for each command line option are contained in the software
manual appearing in the appendix
You will also notice the “stm.cfg” configuration file – this file controls the default
behavior of the STM module and relieves you of specifying a large number of
configuration options each time “stm.exe” is launched
 Note the
TEXT_VARIABLES : 'ITEM_LEAF_CATEGORY_NAME, LISTING_TITLE, LISTING_SUBTITLE‘
line which specifies the names of the text variables to be processed

50

Create Dictionary Options

For the purposes of this tutorial, we have prepackaged all of the text processing
steps into individual command files (extension *.bat). You can either double-
click on the referenced command file or alternatively type its contents into the
Command Prompt window opened in the directory that contains the files
The most important arguments for our purposes in this tutorial now are:
 --dataset DATAFILE name and location of your input CSV format data set
 --dictionary DICTFILE name and location of the dictionary to be created

These two arguments are all you need to create your dictionary. By default,
STM will process every text field in your input data set to create a single
omnibus dictionary
Simply double click on the “stm_create_dictionary.bat” to create the dictionary
file for the DMC 2006 dataset, which will be saved in the “dmc2006_ynm.dict”
file in the “stmtutorSTMdmc2006” folder
In typical text mining practice the process of generating the final dictionary will
be iterative. A review of the first dictionary might reveal further words you wish
to exclude (“stop” words)


Internal Dictionary Format

The dictionary file is a simple text file with extension
*.dict
The file contents can be viewed and edited in a
standard text editor
The name of the text mining variable that will be
created later on appears on the left of the “=“ sign on
each un-indented line
The default value that will be assigned to this
variable appears on the right side of the “=“ sign of
the un-indented lines and it usually means the
absence of the word(s) of interest
Each indented line represents the value (left of the
“=“) which will be entered for a single occurrence in a
document for any of the word(s) appearing on the
right of the “=“
 More than one occurrence will be recorded as
“many” when requested (always the case in this
tutorial)


Hand Made Dictionary

To use a multi-level coding you need to create a “hand made dictionary”, which is already
supplied to you as “hand.dict” in the “stmtutorSTMdmc2006” folder
Here is an example of an entry in this file
hand_model=standard
mini
nano
standard
The un-indented line of an entry starts with the name we wish to give to the term
(HAND_MODEL) and also indicates that a BLANK or missing value is to be coded with
the default value of “standard”
The remaining indented entries are listed one-per-line and are an exhaustive list of the
acceptable values which the term HAND-MODEL can receive in the term vector
Another coding option is, for example:
hand_unused=no
yes=unbenutzt,ungeoffnet
which sets “no” as the default value but substitutes “yes” if one of the two values listed
above is encountered
You may study additional examples in our stmtutorSTMdmc2006hand.dict file on your
own, all of them were created manually based on common sense logic
53

Why Create Hand Made Dictionary Entries

Let‟s revisit the variable HAND_MODEL which brings together the terms
 Standard, mini, nano

Without a hand made dictionary entry we would have three terms created,
one for each model type, with “yes” and “no” values, and possibly “many”
By creating the hand made entry we
 Ensure that every auction is assigned a model (default=“standard”)
 All three models are brought together into one categorical variable with three
possible values “standard”, “mini”, and “nano”

This representation of the information is helpful when using tree-based
learning machines but not helpful for regression-based learning machines
 The best choice of representation may vary from project to project
 Salford regression-based learning machines automatically repackage categorical
predictors into 01 indicators meaning that you work with one representation
 But if you need to use other tools you may not have this flexibility


Further Dictionary Customization

The following table summarizes some of the important fields introduced in the
custom dictionary for this tutorial

Variable Values Combines word variants
CAPACITY 20 20gb,20 gb,20 gigabyte
30 30gb,30 gb,30 gigabyte
40 40gb,40 gb,40 gigabyte
80gb,80 gb,80 gigabyte
80
…
…
STATUS Wieneu Wie neu,super gepflegt,top gepflegt,top zustand,neuwertig
Neu neu,new,brandneu,brandneues
Unbenutzt Unbenu
defekt defekt.,--defekt--,defekt,-defekt-,-defekt,defekter,defektes
MODEL Mini, nano, Captures presence of the corresponding word in the auction
standard description
COLOR Black, white, Captures presence of the corresponding words or variants in
Green, etc. the auction description
IPOD_GENE First, Identified iPod generation from the information available in
RATION second, etc. the text description


Final Stage Dictionary Extraction

To generate a final version of the dictionary in most real world applications
you would also need to prepare an expanded list of stopwords
The NLTK provides a ready-made list of stopwords for English and another
14 major languages spanning Europe, Russia, Turkey, and Scandinavia
 These appear in the directory named stmtutorSTMdatacorporastopwords
and should be left as they are

Additional stopwords, which might well vary from project to project, can be
entered into the file named “stopwords.dat” in the “stmtutorSTMdata”
folder
 In the package distributed with this tutorial the “stopwords.dat” file is empty
 You can freely add words to this file, with one stopword per line

Once the custom “stopwords.dat” and “hand.dict” files have been prepared
you just run the dictionary extraction again but with the “--source-dictionary”
argument added (see the command files introduced in the later slides)
The resulting dictionary will now include all the introduced customizations


Creating Structured Text Mining Variables

The resulting dictionary file “dmc2006_ynm.dict” contains about 600 individual stems
In the final step of text processing the data dictionary is applied to each document entry
Each stem from the dictionary is represented by a categorical variable (usually binary)
with the corresponding name
The preparation process checks whether any of the known word variants associated
with each stem from the dictionary are present in the current auction description, and if
“yes”, the corresponding value is set to “yes”, otherwise, it is set to “no”
 When the “--code YNM” option is set, multiple instances of “yes” will be coded as “many”
 You can also request integer codes 0, 1, 2 in place of the character “yes/no/many”
 We have experimented with alternative variants of coding (see the “--code” help entry in the
STM manual) and came to conclusion that the “YNM” approach works best in this tutorial
 Feel free to experiment with alternative coding schemas on your own

The resulting large collection of variables will be used as additional predictors in our
modeling efforts
Even though other more computationally intense text processing methods exist, further
investigation failed to demonstrate their utility on the current data which is most likely
related to extremely terse nature of the auction descriptions


Creating Additional Variables

Finally, we spent additional efforts on reorganizing the original raw variables
into more useful measures
 MONTH_OF_START – based on the recorded start date of auction
 MONTH_OF_SALE – based on the recorded closing date of auction
 HIGH_BUY_IT_NOW – set to “yes” if BUY_IT_NOW_PRICE exceeds the
CATEGORY_AVG_GMS as suggested by common sense and the nature of the
classification problem
 In the original raw data, BUY_IT_NOW_PRICE was set to 0 on all items where that
option was not available – we reset all such 0s to missing

All of these operations are encoded in the “preprocess.py” Python file
located in the “stmtutorSTMdmc2006” folder
 This component of the STM is under active development
 The file is automatically called by the main STM utility
 You may add/modify the contents of this file to allow alternative transformations of
the original predictors


Generation of the Analysis Data Set
As this point we are ready to move on to the next step which is data creation
This is nothing more than appending the relevant columns of data to the
original data set. Remember that the dictionary may contain tens of
thousands if not hundreds of thousands of terms
For the DMC2006 dataset the dictionary is quite small by text mining
standards containing just a little over 600 words
To generate processed dataset simply double-click on the stm_ynm.bat
command file or explicitly type in its contents in the Command Prompt
 The “--dataset” option specifies the input dataset to be processed
 The “--code YNM” option requests “yes/no/many” style of coding
 The “--source-dictionary” option specifies the hand dictionary
 The “--process” option specifies the output dataset
 Of course you may add other options as you prefer

This creates a processed dataset with the name dmc2006_res_ynm.csv
which resides in the stmtutorSTMdmc2006 folder


Analysis Data Set Observations

At this point we have a new modeling dataset with the text information
represented by the extra variables
 Note that he raw input data set is just shy of 3 MB in size in a plain text format
while the prepared analysis data set is about 40 MB in size, 13 times larger

Process only training data or all data?
 For prediction purposes all data needs to be processed, both the data that will be
used to train the predictive models and the holdout or future data that will receive
predictions later
 In the DMC2006 data we happen to have access to both training and holdout data
and thus have the option of processing all the text data at the same time
 Generating the term vector based only on the training data would generally be the
norm because future data flows have not yet arrived
 In this project we elected to process all the data together for convenience knowing
that the train and holdout partitions were created by random division of the data
 It is worth pointing out, though, that the final dictionary generated from training
data only might be slightly different due to the infrequent word elimination
component of the text processor


Quick Modeling Round with CART

We are now ready to proceed with another CART run this time using all of the
newly created text fields as additional predictors
Assuming that you already have SPM launched
 Go to the
File – Open – Data File menu
 Make sure that the Files of Type
is set to ASCII
 Highlight the
dmc2006_res_ynm.csv
dataset
 Press the [Open] button


Dataset Summary Window
Again, the resulting window summarizes basic facts about the dataset
Note the dramatic increase in the number of available variables


The View Data Window

Press the [View Data…] button to have a quick look at the physical contents
of the dataset
Note how the individual dictionary word entries are now coded with the “yes”,
“no”, or “many” values for each document row


Setting Up CART Model

Proceed with setting up a CART modeling run as before:
 Make the Classic Output window active
 Go to the Model – Construct Model… menu (alternatively, you could use one of
the buttons located on the bar right below the menu)
 In the resulting Model Setup window make sure that the Analysis Method is set
to CART
 In the Model tab make sure that the Sort is set to File Order and the Tree Type is
set to Classification
 Check GMS_GREATER_AVG as the Target
 Check all of the remaining variables except AUCT_ID, LISTING_TITLE$,
LISTING_SUBTITLE$, GMS, and CATEGORY_AVG_GMS as predictors
 You should see something similar to what is shown on the next slide


Model Setup Window: Model Tab


Model Setup Window: Testing Tab

Switch to the Testing tab and confirm that the 10-fold cross-validation is used
as the optimal model selection method


Model Setup Window: Advanced Tab

Switch to the Advanced tab and set the minimum required number of records
for the parent nodes and the child nodes at 15 and 5
These limits were chosen to avoid extremely small nodes in the resulting tree


Building CART Model
Press the [Start] button, building progress window will appear for a while and then the Navigator
window containing model results will be displayed (this time, the process takes a few minutes!)
Press on the little button right above the [+][-] pair of buttons, along the left border of the Navigator
window, note that all trees within one standard error (SE) of the optimal tree are now marked in green
Use the arrow keys to select the 102-node tree from the tree sequence, which is the smallest 1SE tree


CART Model Performance
The selected CART model contains 102 terminal nodes where nearly all available
predictor variables play a role in the tree construction
Area under the ROC curve (Test) is now an impressive 0.830, especially when
compared to the one reported earlier at 0.748 for the basic CART run or the 0.800 for
the basic TN run
Press on the [Summary Reports] button in the Navigator window, select the
Prediction Success tab, and finally press the [Test] button to see cross-validated test
performance at 76.58% classification accuracy – a significant improvement!
Also note the presence of the original and derived variables on the list shown in the
Variable Importance tab


Setting Up TN Model

Now switch to the Classic Output window and go to the Model – Construct
Model… menu
Choose TreeNet as the Analysis Method
In the Model tab make sure that the Tree Type is set to Logistic Binary


Setting Up TN Parameters

Switch to the TreeNet tab and do the following:
 Set the Learnrate: to 0.05
 Set the Number of trees to use: to 800
 Leave all of the remaining options at their default values


TN Results Window

Press the [Start] button to initiate TN modeling run, the TreeNet Results
window will appear in the end, even though you might want to take a coffee
break until the modeling run completes


Checking TN Performance
Press on the [Summary] button and switch to the Prediction Success tab
Press the [Test] button to view cross-validation results
Lower the Threshold: to 0.47 to roughly equalize classification accuracy in both classes
(this makes it easier to compare the TN performance with the earlier reported CART
and TN model performance)
You can clearly see the improvement!


Requesting TN Graphs

Here we present a sample collection of all 2-D contribution plots produced by
TN for the resulting model
The plots are available by pressing on the [Display Plots…] button in the
TreeNet Results window
The list is arranged according to the variable importance table

74

More Graphs


Insights Suggested by the Model

Here is a list of insights we arrived at by looking into the selection of plots
 There is a distinct effect of the iPod category once all the other factors have been
accounted for
 Larger start price means above the average sale (most likely relates to the quality
of an item)
 A“new” and “unpacked” item should fetch a better price, while any “defect” brings
the price down
 End of the year means better sales
 Having a good feedback score is important
 It is best to wait 10 days or more before closing the deal
 Interestingly, 1st and 3rd generations of iPod show poorer sales than the 2nd and 4th
 2G started to fall out of favor in 2005-2006
 Black is much more popular in Germany than other colors
 Mentioning “photo”, “video”, “color display”, etc. helps get a better price
 The paid advertising features are of little or marginal importance

Final Validation of Models

At this point we are ready to check the performance of all our models using
the remaining 8,000 auctions originally not available for training
This way each model can be positioned with respect to all of the official 173
entries originally submitted to the DMC 2006 competition
However, in order to proceed with the evaluation, we must first score the
input data using all of the models we have generated up until now
The following slides explain how to score the most recently constructed
CART and TN models, the earlier models can be scored using similar steps
You may choose to skip the scoring steps as we have already included the
results of scoring in the “stmtutorSTMscored” folder:
 Score_cart_raw.csv – simple CART model predictions
 Score_tn_raw.csv – simple TN model predictions
 Score_cart_txt.csv – text mining enhanced CART model predictions
 Score_tn_txt.csv – text mining enhanced TN model predictions


Scoring a CART Model

Select the Navigator window for the model you wish to score
Select the tree from the tree sequence (in our runs we pick the 1SE trees as
more robust)
Press the [Score] button to open the “Score Data” window
Make sure that the “Data file” is set to “dmc2006_res_ynm.csv”, if not press
the [Select…] button on the right and select the dataset to be scored
Place a checkmark in the “Save results to a file” box, then press the [Select]
button right next to it, this will open the “Save As” window
Navigate to the “stmtutorSTMscored” folder under “Save in:” selection box,
enter “Scored_cart_txt.csv” in the “File name:” text entry box, and press the
[Save] button
You should now see something similar to what‟s shown on the next slide
Press the [OK] button to initiate the scoring process
You should now have the Scored_cart_txt.csv file in the stmtutorSTMscored
folder

Scoring CART


Scoring a TN Model

Select the “TreeNet Results” window for the model you wish to score
Go to the “Model – Score Data…“ menu to open the “Score Data” window
Make sure that the “Data file” is set to “dmc2006_res_ynm.csv”, if not press
the [Select…] button on the right and select the dataset to be scored
Place a checkmark in the “Save results to a file” box, then press the
[Select] button right next to it, this will open the “Save As” window
Navigate to the “stmtutorSTMscored” folder under “Save in:” selection
box, enter “Scored_tn_txt.csv” in the “File name:” text entry box, and press
the [Save] button
You should now see something similar to what‟s shown on the next slide
Press the [OK] button to initiate the scoring process
You should now have the Scored_tn_txt.csv file in the
stmtutorSTMscored folder


Scoring TN


Using STM to Validate Performance

We can now use the STM machinery to do final model validation
Simply double-click the “stm_validate.bat” command file to proceed
Note the use of the following options inside of the command file:
 “-score” – specifies the output dataset where the model predictions will be written
 “--score-column” – specifies the name of the variable containing the actual model
predictions (these variables are produced by CART or TN during the scoring
process)
 “--check” – specifies the name of the dataset that contains the originally withheld
values of the target
 this dataset was used by the organizers of the DMC 2006 competition to
select the actual winners
 STM is currently configured to validate only the bottom 8,000 of the 16,000
predictions generated by the model; the top 8,000 records (used for learning) are
simply ignored

The results will be saved into text files with extensions “*.result” appended to
the original score file names in the “stmtutorSTMscored” folder

Validation Results Format

The following window shows the validation results of the final TN model we
built

8000 validation records were scored, of which:
719 ones were misclassified as zeroes
807 zeroes were misclassified as ones
Thus 1,526 documents were misclassified
This gives the final score of 8,000 – (1,526 * 2) = 4,948


Final Validation of Models

Based on the predicted class assignments, the final performance score is
calculated as 8,000 minus twice the total number of auction items
misclassified
The following table summarizes how these virtually out-of-the-box elementary
modelings perform on the holdout data (the values are extracted from the four
*.result files produced by the STM validator)

Model ROC Area Missed 0s Missed 1s Score
CART raw data 75% 1123 1387 2980
TN raw data 80% 1308 926 3532
CART text data 83% 981 848 4342
TN text data 89% 807 719 4948


Visual Validation of the Results

The following graph summarizes the positioning of the four basic models with
respect to the 173 official competition entries
The TN model with text mining processing is among the top 10 winners!

TN text

CART text

TN raw
CART raw


Observations on the Results

We used the most basic form of text mining, the Bag of Words, with minor
emendations
 None of the authors speaks German although we did look up some of the words in
an on-line dictionary. If there are any subtleties to be picked from seller wording
choices we would have missed them.

We chose the coding scheme that performed best on the training data. We
have six coding options and one stands out as clearly best
We used common settings for the controls for CART and TreeNet
We did not use any of the modeling refinement techniques we teach in our
CART and TreeNet tutorials
We thus invite you to see if you can tweak the performances of these models
even higher


Command Line Automation in SPM
SPM has a powerful command line processing component which allows you to completely
reproduce any modeling activity by creating and later submitting a command file
We have packaged the command files for the four modeling and scoring runs you have conducted
in the course of this tutorial
 SPM command files must have the extension *.cmd
 The four command files are stored in the “stmtutorSTMdmc2006” folder
You can create, open, or edit a command file using a simple text editor, like Notepad, etc.
SPM has a built-in editor, just go to the File – New Notepad… menu
You may also access the command line directly from inside of the SPM GUI, just make sure that the
File – Command Prompt menu item is checked
Just type in “help” in the Command Prompt part (starts with the “>” mark) of the Classic Output
window to get the listing of all available commands
Then you can request a more detailed help for any specific command of interest, for example “help
battery” will produce a long list of various batteries of automated runs available in SPM
Furthermore, you may view all of the commands issued during the current session by going to the
View – Open Command Log… menu, this way you can quickly learn which commands correspond
to the recent GUI activity you were involved with


Basic CART Model Command File

You may now restart SPM to emulate a new fresh run
Go to the File – Open – Command File… menu
Select the “cart_raw.cmd” command file and press the [Open] button
The file is now opened in the built-in Notepad window


CART Command File Contents
OUT – saves the classic output into a
text file
USE – points to the modeling dataset
GROVE – saves the model as a binary
grove file
MODEL – specifies the target variable
CATEGORY – indicates which variables
are categorical, including the target
KEEP – specifies the list of predictors
LIMIT – sets the node limits
ERROR – requests cross-validation
BUILD – builds a CART model
SAVE – names the file where the CART
model predictions will be saved
HARVEST – specifies which tree is to be
used in scoring
IDVAR – requests saving of the
Note the use of the relative paths in the GROVE and SAVE commands additional variables into the output
dataset
Also note the use of the forward slash “/” to separate folder names
SCORE – scores the CART model
OUTPUT * – closes the current text
output file

Submitting Command File

With the Notepad window active, go to the File – Submit Window menu to
submit the command file into SPM
In the end you will see the Navigator and the Score windows opened which
should be identical to the ones you have already seen in the beginning of this
tutorial
Furthermore, you should now have
 “cart_raw.dat” text file created in the “stmtutorSTMdmc2006” folder, the file
contains the classic output you normally see in the “Classic Output” window
 “cart_raw.grv” binary grove file created in the “stmtutorSTMmodels” folder, the
file contains the CART model itself, it can be opened in the GUI using the File –
Open – Open Grove… menu which reopens the Navigator window, this file will be
also needed to future scoring or translation
 “Score_cart_raw.csv” data file created in the “stmtutorSTMscored” folder, the
file contains the selected CART model predictions on your data

You may proceed now with opening up the “tn_raw.cmd” file using the File –
Open – Command File… menu


TN Command File Contents
OUT, USE, GROVE, MODEL,
CATEGORY, KEEP, ERROR, SAVE,
IDVAR, SCORE, OUTPUT – same as the
CART command file introduced earlier
MART TREES – sets the TN model size
in trees
MART NODES – sets the tree size in
terminal nodes
MART MINCHILD - set the minimum
individual node size in records
MART OPTIMAL – sets the evaluation
criterion that will be used for optimal
model selection
MART BINARY – requests logistic
regression processing in our case
MART LEARNRATE – sets the learnrate
parameter
MART SUBSAMPLE – sets the sampling
rate
MART INFLUENCE – sets the influence
trimming value
The rest of the MART commands
requests automatic saving of the 2-D and
3-D plots into the grove; type in “help
mart” to get full descriptions

Submitting the Rest of the Command Files

Again, with the current Notepad window active, use the File – Submit Window menu
to launch the basic TN modeling run automatically followed by scoring
This will create the output, grove, and scored data files in the corresponding locations
for the chosen TN model; also note the use of the EXCLUDE command in place of the
KEEP command inside of the command file – this saves a lot of typing
Now go back to the Classic Output window and notice that the File menu has
changed
Go to the File – Sumbit Command File… menu, select the “cart_txt.cmd” command
file, and press the [Open] button
Notice the modeling activity in the Classic Output window, but no Results window is
produced – this is how the Submit Command File… menu item is different from the
Submit Window menu item used previously; nonetheless, the output, grove, and score
files are still created in the specified locations
Use the File – Open – Open Grove… menu to open the “tn_raw.grv” file located in
the “stmtutorSTMmodels folder”, you will need to navigate into this folder using the
Look in: selection box in the Open Grove File window
You may now proceed with the final TN run by submitting the “tn_txt.cmd” command
file using either the File – Open – Command File… / File – Submit Window or File –
Submit Command File… menu routes – don‟t forget that it does take long time to run!

Final Remarks

This completes the Salford Systems Data Mining and Text Mining tutorial
In the process of going through the tutorial you have learned how to use both
GUI and command cine facilities of SPM as well as the command line text
mining facility STM
You managed to build two CART models, two TN models, as well as enriched
the original dataset with a variety of text mining fields
The final model puts you among the top winners in a major text mining
competition – a proud achievement
Even though we have barely scratched the surface, you are now ready to
proceed with exploring the remainder of the vast data mining activities offered
within SPM and STM on your own
We wish you best of luck on the exciting and never ending road of modern
data analysis and exploration
And don‟t forget that you can always reach us at www.salford-systems.com
should you have further modeling questions and needs


References

Breiman, L., J. Friedman, R. Olshen and C. Stone (1984), Classification and
Regression Trees, Pacific Grove: Wadsworth

Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.
Hastie, T., Tibshirani, R., and Friedman, J.H (2000). The Elements of
Statistical Learning. Springer.
Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting
algorithm. In L. Saitta, ed., Machine Learning: Proceedings of the Thirteenth
National Conference, Morgan Kaufmann, pp. 148-156.
Friedman, J.H. (1999). Stochastic gradient boosting. Stanford: Statistics
Department, Stanford University.
Friedman, J.H. (1999). Greedy function approximation: a gradient boosting
machine. Stanford: Statistics Department, Stanford University.
Sholom M. Weiss, Nitin Indurkhya, Tong Zhang, Fred J. Damerau (2004).
Text Mining. Predictive Methods for Analyzing Unstructured Information.
Springer.


STM Command Reference

Salford Text Miner is simple utility that should make text mining process
much easier. For this purpose application described in this manual have
different parameters and can execute Salford Predictive Miner at the data
mining backend
STM Workflow:
 Automatically generate dictionary based on dataset
 Process dataset and generate new with additional columns based on dictionary
 Generate model folder with dataset, command file and dictionary
 Run Salford Predictive Miner with generated command file
 Run checking process comparing results from scoring with real classes

All of these steps can be done in separate STM calls or in one call



Short Option Long Option Description
-data DATAFILE --dataset DATAFILE Specify dataset to work with
-dict DICTFILE --dictionary Specify dictionary to work with
DICTFILE
-source-dict SDFILE --source-dictionary Dictionary that is used as source for
SDFILE automatic dictionary retrieval process
-score SFILE --scoreresult SFILE Specify file with score result, for
checking process, default – „score.csv‟
-spm SPMAPP --spmapplication Path to spm application, default –
SPMAPP „spm.exe‟
-t TARGET --target TARGET Target variable to generate command
file, default – „GMS_GREATER_AVG‟
-ex EXCLUDE --exclude EXCLUDE List of variables to exclude from keep
list, when generate command file.
-cat CATEGORY --category List of variables to select as category
CATEGORY variables, when generate command
file


-templ CMDTEMPL --cmdtemplate Specify template of command file, that will
CMDTEMPL be used for generation. Default –
„data/template.cmd‟
-md MODEL_DIR --modeldir Dir, where model‟s folders will be created.
MODEL_DIR Default – „models‟
-trees TREES --trees TREES Parameter for TreeNet command files,
specify number of trees will be build.
Default – 500
-maxnodes --maxnodes Parameter for TreeNet command files,
MAXNODES MAXNODES specify numbers of nodes in one tree will
be build. Default – 6
-fixwords --fixwords Enables heuristics that tries to fix words
(find nearest by different metrics, searching
spell checking, etc)
-textvars VARLIST --text-variables List of variables separated by commas,
VARLIST which will be used in dictionary retrieving
process



-outrmwords --output-removed- Enables outputting removed stop words to
words file „data/removed.dat‟
-code CODE --column-coding Specify how to code absence/presence of
CODE word in row:
YN or 0 – no/yes
YNM or 1 – no/yes/many
01 or 2 – 0/1
012 or 3 – 0/1/2
TF or 4 – term frequency
IDF or 5 – inversed document frequency
TF-IDF or 6 – TF-IDF
TC or 7 – term count (0,1,2,…)
Default – YN
-mp MODELPATH --model-path Specify path where model files would be
MODELPATH created
-cmd-path CMDPATH --command-file-path Specify path to command file, which will be
CMDPATH executed by Salford Predictive Miner
-ppfile PPFILE --preprocess-file Path to python code that will be executed
PPFILE on process step for data manipulate data



-rc NAME --realclass- Specify column name for in real class dataset for
column-name check step. Default – GMS_GREATER_AVG
-e --extract Run first step – automatic extraction of dictionary
from dataset. Need to specify --dataset
-p OUTFILE --process Run second step – process dataset and create new
OUTFILE dataset with name OUTPUTFILE were depending on
dictionary will be created new columns. Need to
specify --dataset and --dictionary
-g --generate Run third step – generate model folder with
command file. Need specify --dataset, --dictionary

-m --model Run forth step. Run Salford Predictive Miner with
generate command file. Works only with –generate
-c DATASET --check DATASET Run fives step. Check score file with real classes
(from specified REALCLASSFILE) and outputs
misclassification table. Need to specify --scoreresult
-h --help Show help


STM Configuration File

Name Description Default

SPM_APPLICATION Path to Salford Predictive Miner spm.exe

CMD_TREES Number of trees to build in TN models 500

CMD_NODES Tree size for TN modes 6

CMD_TEMPLATE Command file template data/template.cmd

MODELS_DIR Dir, where model‟s folders will be created models

LANGUAGES Languages, stop words which will be used English, German

SPELLCHECKER_DICT Additional spell checker dictionary, with words that data/spellchecker_dict.dat
are allowed (like “ipod”)
SPELLCHECKER_LANGUAGE Language for spell checker de_DE

ADDITIONAL_STOPWORDS File with additional stop words, which user can edit data/stopwords.dat

REMOVED_WORDS_FILE File, where removed words will be written on data/removed.dat
“extract” step
WORD_FREQUENCY_THRESH Lower threshold word frequency, which will be 5
OLD deleted on “extract” step
PREPROCESS_FILE Include script to do additional processing dmc2006/preprocess.py


STM Configuration File

Name Description Default

CHECK_RESULTS_FILE data/score_results.csv

LOGFILE Path to log file. Can be mask (%s for date). log/stm%s.log

TARGET Default variable for target argument, which would be used to GMS_GREATER_AVG
fill command file template
EXCLUDE Default variable for keep argument, which would be used to AUCT_ID,
fill command file template LISTING_TITLE$,
LISTING_SUBTITLE$,
GMS,
GMS_GREATER_AVG
CATEGORY Default variable for category argument, which would be used GMS_GREATER_AVG
to fill command file template
SCORE_FILE Name of score file which need to be checked Score.csv

TEXT_VARIABLES List of text variables in dataset separated by comma ITEM_LEAF_CATEGORY_
NAME, LISTING_TITLE,
LISTING_SUBTITLE
DEFAULT_CODING Default coding for extract and preprocess steps YN

REALCLASS_COLUMN_ Name of column in real class file, which would be used in GMS_GREATE_AVG
NAME check step
SCORE_COLUMN_NAM Name of column in score file, which would be used in check PREDICTION
E step


Text mining tutorial

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (17)

Similar a Text mining tutorial

Similar a Text mining tutorial (20)

Más de Salford Systems

Más de Salford Systems (20)

Último

Último (20)

Text mining tutorial