1. A Management and Visualisation Tool for
Text Mining Applications
Student Peishan Mao
MSc Computing Science Project Report
School of Computing Science and Information System
Birkbeck College, University of London 2005
Status Draft
Last saved 26 Apr. 10
1 of 93
2. 1 TABLE OF CONTENTS
1 TABLE OF CONTENTS 2
2 ACKNOWLEDGEMENT 5
3 ABSTRACT 6
4 INTRODUCTION 7
5 BACKGROUND 8
5.1 Written Text 8
5.2 Natural Language Text Classification 8
5.2.1 Text Classification 8
5.2.2 The Classifier 9
5.3 Text Classifier Experimentations 12
6 HIGH-LEVEL APPLICATION DESCRIPTION 14
6.1 Description and Rationale 14
6.1.1 Build a Classifier 14
6.1.2 Evaluate and Refine the Classifier 15
6.2 Development and Technologies 15
7 DESIGN 17
7.1 Functional Requirements 17
7.2 Non-Functional Requirements 22
7.2.1 Usability 22
7.2.2 Hardware and Software Constraint 22
7.2.3 Documentation 23
7.3 System Framework 23
7.4 Components in Detail 25
7.4.1 The Client - User Interface 25
7.4.2 Display Manager 26
7.4.3 The Classifier 26
7.4.4 Data Manipulation and Cleansing 28
7.4.5 Experimentation 29
7.4.6 Results Manager 30
7.4.7 Error Handling 31
2 of 93
3. 7.5 Class Diagram 32
8 DATABASE 33
8.1 Entities 33
8.1.1 Score Table 33
8.1.2 Source Table 33
8.1.3 Configuration Table 33
8.1.4 Score Functions Table 33
8.1.5 Match Normalisation Functions Table 34
8.1.6 Tree Normalisation Functions Table 34
8.1.7 Classification Condition Table 34
8.1.8 Class Weights Table 34
8.1.9 Temporary Max and Min Score Table 34
8.2 Views 35
8.2.1 Weighted Scores 35
8.2.2 Maximum and Minimum Scores 35
8.2.3 Misclassified Documents 35
8.3 Relation Design for the Main Tables 35
9 IMPLEMENTATION 37
9.1 Main User Interface 37
9.2 Display Manager 39
9.3 Classifier Classes 40
9.4 Results Output Classes 41
9.5 Other Controller Classes 43
9.6 TreeView Controller Class 44
9.7 Error Interface 45
10 IMPLEMENTATION SPECIFICS 46
10.1 Generic Selection Form Class 46
10.2 Visualisation of the Suffix Tree 48
10.3 Dynamic Sub-String Matching 49
10.4 User Interaction Warnings 50
11 USER GUIDE 53
3 of 93
4. 11.1 Getting Started 53
11.1.1 Input Data 53
11.2 Loading a Resource Corpus 54
11.3 Selecting a Sampling Set 57
11.4 Performing Pre-processing 61
11.5 Running N-Fold Cross-Validation 64
11.5.1 Set Up Cross-Validation Set 64
11.5.2 Perform experiments on the data 67
11.5.2.1 Create the Suffix Tree 67
11.5.2.2 Display Suffix Tree 69
11.5.2.3 Delete Suffix Tree 71
11.5.2.4 N-Gram Matching 71
11.5.2.5 Score Documents 73
11.5.2.6 Classify documents 74
11.5.2.7 Add New Document to Classify 76
11.6 Creating a Classifier 79
12 TESTING 81
13 CONCLUSION 83
13.1 Evaluation 83
13.2 Future Work 84
14 BIBLIOGRAPHY 86
15 APPENDIX A DATABASE 88
16 APPENDIX B CLASS DEFINITIONS 90
17 APPENDIX C SOURCE CODE 93
4 of 93
5. 2 ACKNOWLEDGEMENT
I would like to thank the following people for their help over the course of this project:
Rajesh Pampapathi: for his spectrum of help on the project, ranging from his patient and
advice on the whole area of text classification, and pointing me in the right direction for
information on the topic to being interviewed as a potential user to the proposed system
as part of the requirement collection.
Timothy Yip: for laboriously proof reading the draft for the report despite not having
much interest in information technology.
5 of 93
6. 3 ABSTRACT
This report describes the design and implementation of a management and visualisation
tool for text classification applications. The system is built as a wrapper for machine
learning classification tool. It aims to provide a flexible framework to accommodate for
future changes to the system. The system is implemented in C# .Net with a Windows
Forms front end and an Access Database as an example, but should be flexible enough
to add different underlying components.
6 of 93
7. 4 INTRODUCTION
This report describes the project carried out to implement a management and
visualisation tool for text classification. It covers background information about the
project, the design, implementation and conclusion. The report is organised as follows:
Section 4 this section. It describes the organisation of the report.
Section 5 takes a look at the background of the project. This section covers discussion
on natural language classification, and suffix tree data structure used in Pampapathi et
al‟s study.
Section 6 a high-level description and rationale of the system.
Section 7 describes the design of the system. Lays out the system requirements,
system framework, and describes system components and classes.
Section 8 explains the database design and description of the database entities and
table relations.
Section 9 discusses how the system was implemented and goes into class definitions.
Section 10 focuses on specific system implementations and looks at the implementation
of the generic selection form class, visualisation of the suffix tree, dynamic sub-string
matching on documents, and user warnings.
Section 11 is the user guide to the system.
Section 13 concludes the project. This section discusses whether the system built has
met the requirements laid out at the beginning of the project. It also looks at future work.
Appendix A Database
Appendix B Class Definitions
Error! Reference source not found.
7 of 93
8. 5 BACKGROUND
5.1 Written Text
Writing has long been an important means of exchanging information, ideas and
concepts from one individual to another, or to a group. Indeed, it is even thought to be
the single most advantageous evolutionary adaptation for species preservation [2]. The
written text available contains a vast amount of information. The advent of the internet
and on-line documents has contributed to the proliferation of digital textual data readily
available for our perusal. Consequently, it is increasingly important to have a systematic
method of organising this corpus of information.
Tools for textual data mining are proving to be increasingly important to our growing
mass of text based data. The discipline of computing science has provided significant
contributions to this area by means of automating the data mining process. To encode
unstructured text data into a more structured form is not a straightforward task. Natural
language is rich and ambiguous. Working with free text is one of the most challenging
areas in computer science.
This project aims to investigate how computer science can help to evaluate some of the
vast amounts of textual information available to us, and how to provide a convenient way
to access this type of unstructured data. In particular, the focus will be on the data
classification aspect of data mining. The next section will explore this topic in more
depth.
5.2 Natural Language Text Classification
5.2.1 Text Classification
F Sebastiani [3] described automated text categorisation as
“The task of automatically sorting a set of documents into categories (or classes, or
topics) from a predefined set. The task, that falls at the crossroads of information
retrieval, machine learning, and (statistical) natural language processing, has witnessed
a booming interest in the last ten years from researchers and developers alike.”
Classification maps data into predefined groups or classes. Examples of classification
applications include image and pattern recognition, medical diagnosis, loan approval,
detecting faults in industry applications, and classifying financial trends. Until the late
80‟s, knowledge engineering was the dominant paradigm in automated text
categorisation. Knowledge engineering consists of the manual definition of a set of rules
which form part of a classifier by domain experts. Although this approach has produced
results with accuracies as high as 90% [3], it is labour intensive and domain specific.
The emergence of a new paradigm based on machine learning which answers many of
the limitations with knowledge engineering has superseded its predecessor.
Machine learning encompasses a variety of methods that represent the convergence of
statistics, biological modelling, adaptive control theory, psychology, and artificial
8 of 93
9. intelligence (AI) [11]. Data classification by machine learning is a two-phase process
(Figure 1). The first phase involves a general inductive process to automatically build a
model by using classification algorithm that describes a predetermined set of data
classes which are non-overlapping. This step is referred to as supervised learning
because the classes are determined before examining the data and the set of data is
known as the training data set. Data in text classification comes in the form of files and
each file is often described as documents. Classification algorithms require that the
classes are defined based on purely the content of the documents. They describe these
classes by looking at the characteristics of the documents in the training set already
known to belong to the class. The learned model constitutes the classifier and can be
used to categorise future corpus samples. In the second phase, the classifier
constructed in the phase one is used for classification.
Machine leaning approach to text classification is less labour intensive, and is domain
independent. Since the attribution of documents to categories is based purely on the
content of the documents effort is thus concentrated on constructing an automatic
builder of classifiers (also known as the learner), and not the classifier itself [3]. The
automatic builder is a tool that extracts the characteristics from the training set which is
represented by a classification model. This means that once a learner is built, new
classifiers can be automatically constructed from sets of manually classified documents.
Training Classification Classification
Set Algorithm Model
a)
Classification
Model
Test Set New
Documents
b)
Figure 1. a) Step One in Text Classification b) Step two in text classification
5.2.2 The Classifier
In general a text classifier comprises a number of basic components. As noted in the
previous section, the text classifier begins with an inductive stage. A classifier requires
some sort of text representation of documents. In order to build an internal model the
inductive step involves a set of examples used for training the classifier. This set of
examples is known as the training set and each document in the training set is assigned
to a class C = {c1, c2, … cn}. All the documents used in the training phase are
transformed into internal representations.
Currently, a dominant learning method in text classification is based on a vector space
model [5]. The Naïve Bayesian is one example and is often used as a benchmark in text
9 of 93
10. classification experiments. Bayesian classifiers are statistical classifiers. Classification
is based on the probability that a given document belongs to a particular class. The
approach is „naïve‟ because it assumes that the contribution by all attributes on a given
class is independent and each contributed equally to the classification problem. By
analysing the contribution of each „independent‟ attribute, a conditional probability is
determined. Attributes in this approach are the words that appear in the documents of
the training set.
Documents are represented by a vector with dimensions equal to the number of different
words within the documents of the training set. The value of each individual entry within
the vector is set at the frequency of the corresponding word. According to this approach,
training data are used to estimate parameters of a probability distribution, and Bayes
theorem is used to estimate the probability of a class. A new document is assigned to
the class that yields the highest probability. It is important to perform pre-processing to
remove frequent words such as stop words before a training set is used in the inductive
phase.
The Naïve Bayesian approach has several advantages. Firstly, it is easy to use;
secondly only one scan of the training data is required. It can also easily handle missing
values by simply omitting that probability when calculating the likelihoods of membership
in each class. Although the Naïve Bayesian-based classifier is popular, documents are
represented as a „bag-of-words‟ where words in the document have no relationships with
each other. However words that appear in a document are usually not independent.
Furthermore, the smallest unit of representation is a word.
Research is continuously investigating how designs of text classifiers can be further
improved and Pampapathi et al [1] at Birkbeck College, London recently proposed a new
innovative approach to the internal modelling of text classifiers. They used a well known
data structure called a suffix tree [11] which allows for indexing the characteristics of
documents at a more granular level, with documents represented by substrings. The
suffix tree is a compact trie containing all the suffixes of strings represented. A trie is a
tree structure, where each node represents one character, and the root represents the
null string. Each path from the root represents a string, described by the characters
labelling the nodes traversed. All strings sharing a common prefix will branch off from a
common node. When strings are words over a to z, a node has at most 26 children, one
for each letter (or 27 children, plus a terminator). Suffix trees have traditionally been
used for complex string matching problems in matching string sequences (data
compression, DNA sequencing). Pampapathi et al‟s research is the first to apply suffix
trees to natural language text classification.
Pampapathi et al‟s method of constructing the suffix tree varies slightly from the
standard way. Firstly, the tree nodes are labelled instead of the edges in order to
associate directly the frequency with the characters and substrings. Secondly, a special
terminal character is not used as the focus is on the substrings and not the suffixes.
Each suffix tree has a depth. The depth is described by the maximum number of levels
in the tree. A level is defined by the number of nodes away from the root node. For
example the suffix tree illustrated in Figure 2 has a depth of 4. Pampapathi et al‟s sets a
limit to the tree depth and each node of the suffix tree stores the frequency and the
character.
For example, to construct a suffix tree for the string S1 = “COOL”, the suffix tree in Figure
2 is created. The substrings are COOL; OOL; OL; and L.
10 of 93
11. C (1) O (1) O (1) L (1)
Root
O (1) L (1)
O (1)
L (1)
L (1)
Figure 2. Suffix Tree for String „COOL‟
If a second string S2 =”FOOL” is inserted into the suffix tree, it will look like the diagram
illustrated in Figure 3. The substrings for S2 are FOOL; OOL; OL; and L. Notice that the
last three substrings in S2 are duplicates of some of the substrings already seen in S1,
and new nodes are not created for these repeated substrings.
F (1) O (1) O (1) L (1)
Root C (1) O (1) O (1) L (1)
O (2) L (2)
O (2)
L (2)
L (2)
Figure 3. Suffix Tree with String „FOOL‟ Added
Similar to the Naïve Bayesian method, a classifier using the suffix tree for its internal
model undergoes supervised learning from a training set which contains documents that
have been pre-classified into classes. Unlike the Naïve Bayesian approach, the suffix
tree, by capturing the characteristics of documents at the character level, does not
require pre-processing of the training set. A suffix tree is built for each class and a new
document is classified by scoring it against each of the trees. The class of the highest
scoring tree is assigned to the document. Pampapathi et al‟s study was based on email
11 of 93
12. classification and the result of the experiment showed that a classifier employing a suffix
tree outperformed the Naïve Bayesian method.
In order to solve a classification problem, not only is the classifier one of the central
components, but as seen with the Naïve Bayesian method it is also important to perform
pre-processing on data used for training. The next section looks at other processes
involved in text classification other than the classifier component itself.
5.3 Text Classifier Experimentations
As described in previous sections that there is a two-step process to classification:
1. Create a specific model by evaluating the training data. This step has as input
the training data (including the category/class labels) and as output a definition of
the model developed. The model created which is the classifier classifies the
training data as accurately as possible.
2. Apply the model developed by classifying new sets of documents.
In the research community or for those interested in evaluating the performance of a
classifier the second step can be more involved. First, the predictive accuracy of the
classifier is estimated. A simple yet popular technique is called the holdout method
which uses a test set of class-labelled samples. These samples are usually randomly
selected and it is important that they are independent of the training samples, otherwise
the estimate could be optimistic since the learned model is based on that data, and
therefore tend to overfit. The accuracy of a classifier on a given test set is the
percentage of test set samples that are correctly classified by the classifier. For each
test sample the known class label is compared with the classifier‟s class prediction for
that sample.
If the accuracy of the classifier model is considered as acceptable, the model can be
used to classify new documents.
Training Derive Estimate
Set Classifier Accuracy
Corpus
data
Test Set
Figure 4. Estimating Classifier Accuracy with the Holdout Method
The estimate using the holdout method is pessimistic since only a portion of the initial
data is used to derive the classifier. Another technique call N-fold cross-validation is
often used in research. Cross-validation is a statistical technique which can mitigate
bias caused by a particular partition of training and test set. It is also useful when the
amount of data is limited. The method can be used to evaluate and estimate the
performance of a classifier, and the aim is to obtain as honest an estimation as possible
about the classification accuracy of the system. N-fold cross-validation involves
12 of 93
13. partitioning the dataset (initial corpus) randomly into N equally sized non-overlapping
blocks/folds. Then the training-testing process is run N times, with a different test set.
For example, when N=3, we will have the following training and test sets.
Block 1 Train Test
Run 1 1, 2 3
Block 2
Run 2 1, 3 2
Block 3 Run 3 2, 3 1
Figure 5. 3-Fold Cross-Validation
For each cross-validation run the user will be able to use a training set to build the
classifier.
Stratified N-fold cross-validation is a recommended method for estimating classifier
accuracy due to its low bias and variance [13]. In stratified cross-validation, the folds are
stratified so that the class distribution of the samples in each fold is approximately the
same as that of the initial training set.
Preparing the training set data for classification using pre-processing can help improve
the accuracy, efficiency, and scalability of the evaluation of the classification. Methods
include stop word removal, punctuation removal, and stemming.
The use of the above techniques to prepare the data and estimate classifier accuracy
increases the overall computational time yet is useful for evaluating a classifier, and
selecting among several classifiers.
The current project aims to build a system which is a wrapper to a text classifier and
incorporates the suffix tree that was used in the research done by Pampapathi et al as
an example. The next section and beyond describes the project in detail.
13 of 93
14. 6 HIGH-LEVEL APPLICATION DESCRIPTION
6.1 Description and Rationale
The aim of this project is to build a management and visualisation tool that will allow
researchers to perform data manipulation support for underlying text classification
algorithms. The tool will provide a software infrastructure for a data mining system
based on machine learning. The goal is to build a flexible framework that would allow
changes to the underlying components with relative ease. Functions maybe added to
the system in the future. Adding new functionalities should have minimal effect on the
current system.
The system will be built as a wrapper for the two-step process involved in classification.
First, a component will be built that will automatically build a classifier given some
training data. Secondly, to provide capabilities to perform classification and evaluate the
performance of a classifier. Additionally, the tool will provide functionalities to run data
sampling and various pre-processing on data.
For the researcher it is incumbent to clearly define the training set (this will be known as
the „resource corpus‟ in this report) used for the training the classifier. When the
resource corpus is small the user can choose to use the entire corpus in the study. If the
resource corpus is large, the tool gives the option to select sampling sets to represent it.
A number of sampling methodologies is implemented that allows the user to select a
sample, which will reflect the characteristics of the resource corpus from which it is
drawn.
Note that a resource corpus is grouped into classes and this structure needs to be taken
into consideration when the sampling mechanism was developed. Three popular
sampling methods will be developed. Although other sampling methods can be added,
such as convenience sampling, judgement sampling, quota sampling, and snowball
sampling.
Note that the user can choose to evaluate data used to construct the classier before
actually building the classifier. The tool will be designed to be generic enough to
analyse a corpus of any categorisation type e.g. automated indexing of scientific articles,
emails routing, spam filtering, criminal profiling, and expertise profiling.
6.1.1 Build a Classifier
The tool allows the user to build a classifier. The current framework only implements the
suffix tree-based classifier developed by Birkbeck College using the suffix tree, but will
be flexible enough to incorporate other classification models in the future. The research
on suffix trees applied to classification is new, and there is currently no such application.
The learning process of the classifier follows the machine learning approach to
automated text classification, whereby the system automatically builds a classifier for the
categories of interest. From the graphical user interface (GUI), the user can select a
corpus to use as training data. The application provides links to .dll files developed by
Birkbeck College which allow the user to build a suffix tree from the selected corpus. The
internal data representation is constructed by generalising from a training set of pre-
classified documents. Once the classifier is built the user can load new documents into
the system to be classified.
14 of 93
15. 6.1.2 Evaluate and Refine the Classifier
In research once a classifier has been built it is desirable to evaluate its effectiveness.
Even before the construction of the classifier the tool provides a platform for users to
perform a number of experiments and refinements on the source (training) data. Hence,
the second focus of the project is to provide a user-friendly front-end and a base
application for testing classification algorithms.
The user can load in a text based corpus and perform standard pre-processing functions
to remove noise and prepare the data for experimentation. There is also a choice of
sampling methods to use in order to reduce the size of the initial corpus making it more
manageable.
Sebastiani [2] notes that any classifier is prone to classification error, whether the
classifier is human or machine. This is due to a central notion to text classification that
the membership of a document in a class based on the characteristics of the document
and the class is inherently subjective, since the characteristics of both the documents
and class cannot be formally specified. As a result automatic text classifiers are
evaluated using a set of pre-classified documents. The accuracy of classifiers is
compared to the classification decision and the original category the documents were
assigned to. For experimentation and evaluation purpose, this set of pre-classified
documents is split into two sets: a training set and test set, not necessarily of equal
sizes.
The tool implements an extra level of experimentation using n-fold cross-validation.
When employing cross-validation in classification it must take into account that the data
is grouped by classes therefore this project will implement stratified cross-validation.
Once a classifier has been constructed, it is possible to perform data classification
experiments as well as other tasks such as single document analysis. For example, for
the implementation of a suffix tree-based classifier the user will be able to view the
structure of the suffix tree, as well, the documents in the test sets or load a new
document and obtain a full matrix of output data about it. The output data is persisted in
an information system which is subsequently used to perform analysis and visualisation
tasks.
6.2 Development and Technologies
Development was done in C#, using the .NET framework. The architect of the system
was designed to be an extensible platform to enable users and developers to leverage
the existing framework for future system upgrades. The tool was built from several
components and aims to be modular. There are a number of controller components to
provide functionalities for the tool. A set of libraries is used to provide the functionalities
for the suffix tree. Working closely with researchers from Birkbeck College on the
interface, these libraries for the suffix tree were provided by Birkbeck College.
The suffix tree data structure is built in memory and can become very large. One
solution to better utilise resources is to have the data structure physically stored as one
tree, although it is logically represented as individual trees for each class. Further
discussion can be found in subsequent sections.
15 of 93
16. A Windows application was built as the client. This forms the interface that the user
interacts with to gain access to the functionalities of the tool. The output data is cached
in a database.
The main targeted users for the tool are researchers in the research community for
natural language text classification, and other users who want to mine textual data.
16 of 93
17. 7 DESIGN
7.1 Functional Requirements
Requirements for the application were collected from research on natural language text
classification and discussions with targeted users in the research community.
Requirements are the capabilities and conditions to which the application must conform.
The functional requirements of the system are captured using „use cases‟. Use cases
are a useful tool in describing how a user interacts with a system. They are written
stories that describe the interaction between the system and the user that is easy to
understand. Requirements can often change over the course of development and for
this reason there was no attempt to define and freeze all requirements from the onset of
the project. The following use cases were produced. Note some use cases were added
throughout the development of the system
Use Case Name: Load Directory as Source Corpus
Primary Actor: User
Pre-conditions: The application is running
Post-conditions: A source corpus is loaded into the application
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. The user selects a valid directory 2. The system checks for directory
and has at least read access to the path validity and access
directory, and loads it as a corpus 3. Builds a tree structure of classes
into the system based on the sub-folders in the
directory and displays the classes
in the GUI
Use Case Name: View a Document in Corpus
Primary Actor: User
Pre-conditions: A corpus is successfully loaded
Post-conditions:
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. Select the document to view 2. Display content of document in the
GUI
Use Case Name: Create Sampling Set
17 of 93
18. Primary Actor: User
Preconditions: A source corpus is successfully loaded
Postconditions: A sampling set based on the source corpus is created. New
file directory created for the corpus.
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. User selects how they want to 3. Creates a sampling set based on
select the sampling set parameters given by the user
2. User specifies location to store the 4. Creates the directory structure and
documents/files created for the document/files in the location
sampling set specified by the user
5. Displays new corpus created in the
GUI
Use Case Name: Run Pre-Processing
Primary Actor: User
Pre-conditions: A training set exist in the system
Post-conditions: A new pre-processed sampling set created. New file directory
created for the corpus.
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. Select type of pre-processing to 4. Performs pre-processing
perform 5. Creates a new pre-processed set
2. User specifies location to store the 6. Stores the directory structure and
documents/files created for the pre- documents/files at the location
pre-processing set specified by the user.
3. Run pre-processing 7. Displays the corpus as a directory
structure in the GUI
Use Case Name: Run N-Fold Cross-Validation
Primary Actor: User
Preconditions: A sampling set is successfully created
Postconditions: N-fold cross-validation set is created virtually
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. User selects sampling set to 2. Builds n-fold cross-validation set
process and the number of fold based on parameters given by the
user, which includes the n-runs,
18 of 93
19. each run containing training set and
test set.
3. Displays new cross-validation set
created in the GUI
Use Case Name Create Classifier (Suffix Tree)
Primary Actor: User
Preconditions: A cross-validation set or classification set exist
Postconditions: Classifier created in memory
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. User actives an event to build 3. Builds classifier in memory, based
classifier for a cross-validation set on the corpus set selected
or classification set 4. indicate in the GUI that the
2. User choose any additional classifier of the corpus has been
conditions to apply created
Use Case Name: Score Documents
Primary Actor: User
Preconditions: An n-fold cross-validation set is created. Classifier for the
corpus set is created
Postconditions: Documents in the cross-validation set is scored and data
stored in the database
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. User selects the cross-validation 2. Scores all documents under the
run to score selected corpus set
3. Inserts score data into database
Use Case Name: Classify Documents
Primary Actor: User
Preconditions: An n-fold cross-validation set is created. Classifier for the set
is created and the documents have been scored
Postconditions: Misclassified documents in the cross-validation set is flagged
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
19 of 93
20. 1. User selects the cross-validation 2. Classify all documents under the
run to classify selected cross-validation set
3. Flag all misclassified documents in
the GUI
Use Case Name: Create Classification Set
Primary Actor: User
Preconditions: A source corpus is successfully loaded
Postconditions: A classification set is created virtually
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. User selects the corpus set they 2. Display new corpus created in the
want to use to create a classifier GUI as a classification corpus set
Use Case Name: Load New Document to Classify
Primary Actor: User
Preconditions: Cross-validation set or classification set exist
Postconditions: Substring matches and relates output data is store in
database
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. User decides which suffix tree to 2. Document name and relevant
use for classification and loads in a information is displayed in the GUI
valid textual document as an item ready to be analysed
to be classified and analysed 3. Score and classify document
4. Stores output data in database
Use Case Name: View a Document
Primary Actor: User
Pre-conditions: Document loaded into the system
Post-conditions:
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. Select the document to view 2. Display content of document on
GUI
20 of 93
21. Use Case Name View n-Gram Matches in document
Primary Actor: User
Preconditions: The document in concern is successfully loaded and suffix
classifier created
Postconditions:
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. User selects a string/substring in a 2. Queries the classifier to retrieve the
document to match n length substring matches
3. Displays to user the frequency for
the string/substring selected
Use Case Name View Statistics on Matches
Primary Actor: User
Preconditions: Document successfully loaded, scored and output exists in
database
Postconditions: Displays information in GUI
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. User selects to view output 2. System queries and retrieves
relevant data in the database
3. Displays the output in table form in
the GUI
Use Case Name Visualise Representation of Classifier (View Suffix Tree)
Primary Actor: User
Preconditions: Classifier was successfully built
Postconditions: Classifier visual representation displayed on GUI
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
1. User selects option to display suffix 2. Builds visual representation of the
tree classifier and displays in GUI
21 of 93
22. Use Case Name Delete Classifier
Primary Actor: User
Preconditions: Classifier was successfully built
Postconditions: Classifier is deleted
Main Success Scenarios:
Actor Action (or Intention) System Responsibility
3. User selects classifier to delete 4. Remove classifier, and clear
displayed tree in GUI
7.2 Non-Functional Requirements
The non-functional requirements for the use cases are as follows.
7.2.1 Usability
The user should have one main single user interface to interact with the system. The
user interface should be user friendly and the complexity of computation e.g. building an
n-fold cross-validation set, scoring documents against a classification model, should be
hidden from the user.
An experimental run of the suffix tree classifier could involve as many as 126 scoring
configurations, all of which could together take some considerable time to calculate. It
therefore makes sense to keep a store of all calculated scores, rather than calculate
them on-the-fly whenever they are requested. The results will be cached in a data store,
which is implemented as database in this project. Hence, optimizing system
responsiveness.
Some system requests can only be activated once a pre-condition has been satisfied
e.g. the user can only score documents when the suffix tree has been created. The
system should give informative warning messages if the user attempts to perform a task
without pre-conditions being satisfied. Where appropriate, upon a task being performed,
the system may automatically carry out pre-conditions before performing the requested
task.
7.2.2 Hardware and Software Constraint
The application should be easily extensible and scalable. Developers should be able to
add both extra functionality and expand the workload the application can handle with
relative ease.
The design should consider the future enhancement of the system and should be
reasonably easy to maintain and upgrade. Codes should also be well documented.
The system should use an RDBMS to manage its data layer, but be independent of the
RDBMS it uses to manage its data.
22 of 93
23. 7.2.3 Documentation
Help menus and tool tips will be available to help users interact with the system. The
application will also come with a user manual, including screen shots. The application
will be available along with written documentation for its installation and configuration.
7.3 System Framework
It was decided to build the system with a number of components. Each component has
a specialised function in the system. Figure 6 illustrates the main components and the
system boundary. The next section will describe the functions of each component in
more detail and section 7.5 contains the class diagram. By isolating system
responsibilities the following main components were identified.
User interface
Display Manager
Classifier (Central Manager, STClassifier Manager, STClassifier)
Sampling Set Generator
Pre-processor
Cross-validation
Results Manager (Database Manager, OLEDB, Database)
Figure 7 shows how the system is divided into a client/server architecture. The
advantage of this set up is its ease of maintenance as the server implementation can be
an abstraction to the client. All the functionalities of the system are accessed through
the graphical user interface (GUI). The implementation is in the server, isolating users
from the system complexities not relevant to the user.
One of the main aims of the design of the system was to create a flexible framework.
Others...
The green boxes seen in Figure 8 represent new or alternative components that
can be added to the system in the future with relative ease.
23 of 93
24. Input Data
System Boundary
Random
Graphical User
DisplayManager
Interface
Sampling Set
Generator
Utility
Results Manager Central Manager
Pre-processor
OLEBD Database STClassifier
Manager Manager
Stemmer
Database
STClassifier Cross-Validation
Figure 6. System Components and Boundary
Input Data
Graphical User Client
Interface
Server
DisplayManager
Random
v
Sampling Set
Results Manager Central Manager Generator
Utility
Pre-processor
Database STClassifier
OLEBD
Manager Manager
Stemmer
Database
STClassifier
Cross-Validation
Figure 7. Client Server Division
24 of 93
25. Graphical User
Others...
Interface
Random Others...
Input Data DisplayManager
Sampling Set
Generator
Utility
Others... Results Manager Central Manager
Pre-processor
STClassifier
Database Others...
OLEBD Manager
Manager
Stemmer Others..
STClassifier Cross-Validation
Database
Figure 8. Additional or Alternative Components
7.4 Components in Detail
7.4.1 The Client - User Interface
Graphical User
Interface
The user interacts with the system via a single graphical user interface which is also the
client. In this project the client is implemented as a set of Windows forms and controls in
.NET. There is one main form where users can access all the functionalities of the
system. There are a number of other dialog boxes and forms to help with the navigation
and interaction with the system. For example there is a Select Scoring Method form,
used to request from the user the scoring methodology to use when scoring a new
document. Other more generic forms such as the Select Dialog form are employed for a
number of uses and do not display specific types of information (see section 10
Implementation Specifics for further discussion).
The client is simply an event handler for each of the GUI controls that calls the Central
Manager via the Display Manager for actual data processing. The GUI contains no
implementation, but delegates to the Display Manager, thus decoupling the interface
from the implementation. There is a two-way communication between the client and the
Display Manager, whereby a user invokes an event and related messages are passed to
the Central Manager. The Central Manager passes the messages to the Central
Manager which subsequently either delegates to other more specialised controllers to
handle the task, or resolves the request itself.
The design of the screens was done in speaking with potential users. The user should
be able to perform all the tasks described by the use cases seen earlier in the Functional
Requirements section (the functions will not be reiterated here).
25 of 93
26. For this project Windows forms were chosen for the implementation because most users
are familiar with the Windows form interface. It creates a familiar interface on initial
interaction with the system and facilitates use of the system. In particular, the .NET
framework provides a wealth of controls and functionalities, which help to build a user
friendly interface and hides the complexity of the underlying workings from the user. The
different components are built as separate classes and the user interface or the client
can be implemented using a different methodology from Windows forms, such as
command line as illustrated.
Select Select Scoring
Dialog Method
Graphical User
Command Line
Interface
Input Data
Display Manager
Figure 9. Client interface and Its Collaborating Components
7.4.2 Display Manager
DisplayManager
The Display Manager is a layer between the User Interface and the Central Manager
and the rest of the system. It essentially passes messages between these two
components. The Display Manager is responsible for information displayed back to the
user and it manages also the input data.
Graphical User
Others...
Interface
Input Data DisplayManager
Central Manager
7.4.3 The Classifier
It was mentioned in the previous section that the Central Manager is part of the
classifier. Figure 10 illustrates the classifier, which is enclosed by the red box and its
connecting components. The classifier comprises of the Central Manager, a controller
26 of 93
27. that manages the underlying model of the classifier, and the underlying model itself. The
Central Manager is a controller that handles the communication between all the main
components in the system which communicates with the classifier. The Central
Manager should provide the following functionalities:
Select Sampling Set for a corpus
Pre-process all documents in a corpus
Run cross-validation on a corpus
Create a classifier for a given corpus
Score all documents in a corpus
Classify all documents in a corpus
Obtain classification results for a corpus
There are further controller classes called by the Central Manager to provide more
specialised functionalities, these are the Output Manager, Suffix Tree Manager,
Sampling Set Generator, Pre-processor, and Cross-validation.
When a user loads a corpus into the system it is managed by the Central Manager. If
there is a request to create a sampling set for example, the Central Manager should
know where the corpus is located and delegates the Sampling Set Generator the task of
creating a sampling set based on parameters set by the user. Similarly, a request from
the user to perform pre-processing on the corpus is delegated to the Pre-processor to
carry out the task by the central manager.
The various components is designed to have specialised tasks, they do not need to
know where the data is located as this information is passed to the components when
the Central Manger invokes a request. The Sampling Set generator does not need to
know how the Pre-processor carries out its task, nor does it need to know about the
Cross-validation component. The three components receive data and requests from the
Central Manager, perform its task and return any information back to the Central
Manager.
The classifier has to be connected to an internal model. In this project the suffix tree
data structure is employed to model the representation of document characteristics. As
seen in Figure 10, the classifier can be implemented with different types of models such
as a Naïve Bayesian or Neural Networks. There is a dual way communication between
the Central Manager and the STClassifier via the STClassifier Manager. The
STClassifier is a DLL library built by Birkbeck research. It provides public interfaces to:
Building the representation of documents using the suffix tree data structure
Training the classifier
Score a document
Returns classification results
The STClassifier Manager controls the flow of messages between the Central Manager
and the STClassifier. Responsibilities involve converting data to the format that is
accepted by the STClassifier, and converting output from the STClassifier which is
27 of 93
28. passed back to the STClassifier Manager. It is essentially a wrapper class for the
STClassifier.
The suffix tree is built using the contents of documents in a training set. Once a suffix
tree is built it will be cached in an ArrayList that is managed by the STClassifier
Manager. An ArrayList is a C# collection class implemented in .NET. The suffix tree
remains stored in memory until the user activates an event to delete the suffix tree. As a
result the system does not need to create a suffix tree every subsequent action that
references it. Hence, only methods in the STClassifier Manager are called and it is not
necessary to call methods in the STClassifier.
The classifier generates output data when a request is invoked to classify and score
documents. These two actions can be a time consuming activities. The Central
Manager decides what type of output data needs to be saved and passes the data from
the classifier to the Results Manager to handle. Section Figure 13 describes the design
of the Results manager.
Graphical User
Interface Command Line
Results Manager
Display Manager
Sampling Set
Generator
Central Manager
Pre-processor
NBClassifier NNClassifier STClassifier
Manager Manager Manager
Cross-Validation
NBClassifier NNClassifier STClassifier
Classifier
Figure 10. The Classifier and Its Collaborating Components
7.4.4 Data Manipulation and Cleansing
Sampling Set
Pre-processor
Generator
28 of 93
29. When a corpus is loaded into the system as input data. The user can create sampling
sets from the initial corpus and also prepare the data for experimentation by performing
various types of pre-processing on the data. The input data is given to the classifier,
which sends it to the Sampling Set Generator to handle the generation of sampling sets.
Various sampling methodologies can be plugged into the Sampling Set Generator. For
this project the system will implement random sampling and systematic sampling
methodologies. The pre-processor provides the functionality for pre-processing data
passed to it. Similarly, various methods of pre-processing can be plugged into the
system with relative ease. Currently, the system provides stemming, stop word removal,
and punctuation removal.
In order for a method to plug into the system, a method class must implement an
IMethod interface so that it guarantees the following:
A method class must have a name property to return the name of the
method. This is necessary, so if new methods are added to the system it
will be identified by its name.
A method class must have a Run method. This method is where all the
work is done
A set of utility classes will provide helper functionalities such as random number
generator, common divisor, and file system.
Systematic Random Snowball
Sampling Set
Generator
Utility
Central Manager
Pre-processor
Stop
Word Punctuation
Stemmer Others..
Removal Removal
Figure 11. Data Manipulation and Cleansing Components and Its Collaborating Components
7.4.5 Experimentation
Cross-Validation
Setting up data for experimentation is the main responsibly of the Cross-validation class.
The Central Manager passes a corpus to the Cross-validation component, which uses
the data to build N-fold cross-validation sets. It divides the given set of corpus into N
blocks and builds a training set and test set for each N run. The data is stored as an
array that is passed back to the Central Manager.
29 of 93
30. The methods the Cross-Validation class is expected to perform are:
Set the number of N-folds
Run N-fold cross-validation on a given source data
Return the cross-validation sets in an array data structure
Central Manager
Cross-Validation
Figure 12. Cross-validation and Its Collaborating Components
7.4.6 Results Manager
Results Manager
The Results Manager handles the output of the classifier and the repository of the
output. The underlying RDBMS of this project is an Access database, which is used to
cache the data generated by the classifier. The OLEDB component is responsible for
the direct communication with the database. This class needs to provide the basic
database functionalities such as read/write/ delete in a generic fashion. It is through the
Database Manager object that all communication with the OLEDB library occurs, and the
data flow between the Results Manager. The Database Manager manages the OLEDB.
The green boxes illustrate that the information system for the system does not
necessarily has to be an Access database. The system is designed to be able to store
the data using a different means with relative ease, e.g. XML files, SQL server etc.
30 of 93
31. Results Manager Central Manager
XML File Database
Manager Manager
XML OLEDB
XML File(s)
Database
Figure 13. Results Manager and Its Collaborating Components
7.4.7 Error Handling
Adequate error handling for an end user application is essential. Displays of warnings
and errors should be handled in the higher level of the system, namely by the Display
manager and then displayed to the user in a reasonable fashion. Errors that occur in the
other classes should be propagated to the Display Manager. All classes apart from the
User Interface and the Display Manager are expected to implement an IErrorRecord
interface. A class that implements this interface will guarantee that it has a property
called error which returns the error message.
31 of 93
32. 7.5 Class Diagram
Figure 14 shows a class diagram of the main components of the system discussed above
Controllers::DisplayManager
MainForm
-nodeMgr : TreeViewNodeManager
-tvExplorer -classifier : CentralManager
-sTreeView -dbProvider : string
-rtxtView -dbUserId : string
-rtxtInfo -dbPassword : string
-mItemAddRCorpus_Click(in sender : object, in e) -dbName : string
-mitemSelectSampling_Click(in sender : object, in e) -Controlled By
-dbAccessMode : string 1
-mitemPreprocess_Click(in sender : object, in e) +AddNode(in destNode : TreeNode, in nodeNames : string[], in imageIdx : TreeImages, in selectedImageIdx : TreeImages)
-mitemCrossValidation_Click(in sender : object, in e) +FindNode(in selectedNode : TreeNode, in nodeName : string) : TreeNode
-CreateSTree_Click(in sender : object, in e) 1..*
+DisplayBlank()
-DeleteSTree_Click(in sender : object, in e) +DisplayFile(in filePathname : string)
-DisplaySuffixTree_Click(in sender : object, in e) +SelectSampleCorpus(in defaultCorpus : string, in sourceNode : TreeNode, in destNode : TreeNode)
-AddNewDoc_Click(in sender : object, in e) +AddNewClassificationSet(in treeStructure : TreeView, in sourceNode : TreeNode, in destRoot : string)
-AddClassificationSet_Click(in sender : object, in e) 1 +PerformPreprocessing(in defaultCorpus : string, in sourceNode : TreeNode, in destNode : TreeNode)
-ScoreAllDoc_Click(in sender : object, in e) -PerformCrossValidation(in defaultCorpus : string, in sourceNode : TreeNode, in destNode : TreeNode)
-ClassifyAllDocs_Click(in sender : object, in e) +SetupSTree(in defaultCorpus : string, in sourceFilesNode : TreeNode, in STreeNode : TreeNode)
+DisplayScoresByDoc(in displayView : ListView, in sourceNode : TreeNode, in filepath : string)
+ScoreAllDocuments(in sourceDataNode : TreeNode, in sTreeNodeName : string)
+ClassifyAllDocuments(in sourceDataNode : TreeNode, in sTreeNodeName : string)
+FlagMisClassifiedDocuments(in sourceNodePath : string, in sourceDataNode : TreeNode, in sf : int, in mn : int, in tn : int)
+DeleteScores(in parentPath : string)
+DeleteSTree(in STreeNode : TreeNode)
+DisplaySTree(in displayTxt : Label, in diplayView : TreeView, in defaultCorpus : string, in dataSource : TreeNode, in STreeNode : TreeNode) Controllers::SampleSetGenerator
+GetMatchInfo(in text : string, in STreeNode : TreeNode) : string -error : string
+CleanupDatabase() -Controls
-methodNames : string[] = new string[] {"Census", "Random", "Systematic"}
+ErrorMessage() : string
1 1 -CodeToName(in code : int) : string
+Run(in resourcePath : string, in destPath : string, in selectMethod : string)
1 -Controls +MethodNames() : string[]
Classifier::CentralManager
-sampler : SampleSetGenerator
-preprocessor : Preprocessor Controllers::CrossValidation
1 -crossValidator : CrossValidation -folds : Array[]
-dataModelMgr : SuffixTreeManager -noOfFolds : int
-outputMgr : DatabaseManager -minFold : int = 2
1
-error : string -maxFold : int = 10
1 -Controls
+Create(in key : string, in classNames : string[], in depth : int, in classFiles : FileInfo[][]) : bool -error : string
+Contains(in key : string) : bool -Performs 1 +ErrorMessage() : string
Output::DatabaseManager +Remove(in key : string) +CrossValidation(in folds : int)
+GetClassNames(in key : string) : string[] +Run(in path : string) : Array[]
-dbAccess : OLEDB +GetClassScores(in key : string, in className : string, in doc : string) : double[,,] 1
-dbProvider : string +FoldCount() : int
+ErrorMessage() : string
-dbUserId : string +CentralManager() -Controls
-dbPassword : string +GetModel(in key : string) : EMSTreeClassifier
-dbName : string +GetFrequency(in key : string, in matchText : string, in classIdx : int) : int Controllers::Preprocessor
-ScoresTable : string = "Scores" +Sampler() : SampleSetGenerator 1
-ConfigTable : string = "Config" -stopWordFile : string
+Preprocessor() : Preprocessor
-ClassWeightsTable : string = "ClassWeights" -punctuationFile : string
+CrossValidator() : CrossValidation
-ClassifiedTable : string = "qry3a_MaxWScoreClass" -methodNames : string[] = new string[methodCount]
+OutputManager() : DatabaseManager
-MisClassifyFiles : string = "qry2b_MisClassifiedByFile" -error : string
-MatchByClass : string = "zqry2b_matchByClass_Crosstab" +ErrorMessage() : string
-error : string 1 1 +Preprocessor()
-bOpen : bool -SetupMethodNames()
+ErrorMessage() : string -CodeToName(in code : int) : string
+DatabaseManager() +Run(in content : string, in type : string) : string
+SelectScoresByFile(in parentPathNode : string, in filePath : string) : OleDbDataReader +MethodNames() : string[]
+SelectMisClassifiedDocuments(in parentPathNode : string, in sf : int, in mn : int, in tn : int) : OleDbDataReader
+SelectClassifiedClass(in sourceNodePath : string, in filepath : string, in sf : int, in mn : int, in tn : int) : OleDbDataReader 1
+DeleteScores(in ParentNodePath : string)
+Provider() : string
+UserId() : string
+Password() : string 1 -Controls 1 -Has
+DatabaseName() : string
Classifier::SuffixTreeManager DataMining::StopWord
1
-createdSTreeList : SortedList -name : string
-error : string -stringList : ArrayList = new ArrayList()
1 -Access Database -error : string
+Create(in key : string, in classNames : string[], in depth : int, in classFiles : FileInfo[][]) : bool
+Contains(in key : string) : bool +Name() : string
Output::OLEDB +Remove(in key : string) +Run(in text : string) : string
-oleDbDataAdapter : OleDbDataAdapter +GetClassNames(in key : string) : string[] +ErrorMessage() : string
-oleDbConnection : OleDbConnection +GetClassScores(in key : string, in className : string, in doc : string) : double[,,] +StopWord(in filePathName : string)
-oleDbInsertCommand : OleDbCommand +ErrorMessage() : string +Add(in filePathName : string)
-oleDbDeleteCommand : OleDbCommand +SuffixTreeManager() -AddWord(in targetWord : string)
-oleDbUpdateCommand : OleDbCommand -AddSTreeToCache(in key : string, in sTree : EMSTreeClassifier) : bool +Clear()
-oleDbSelectCommand : OleDbCommand +GetModel(in key : string) : EMSTreeClassifier +Reset() 1 -Controls
+oleDbDataReader : OleDbDataReader +GetFrequency(in key : string, in matchText : string, in classIdx : int) : int +Contains(in word : string) : bool
-command : COMMAND +StringList() : ArrayList
-error : string 1
-bOpen : bool
+ErrorMessage() : string 1..* -Access
Controllers::TreeViewNodeManager
+IsOpen() : bool
+InsertCommand() : string -error : string
EMSTreeClassifier
+DeleteCommand() : string +ErrorMessage() : string
+UpdateCommand() : string -className : string[]
+ChildNameExist(in TargetNode : TreeNode, in matchName : string) : bool
+SelectCommand() : string -dictionary : string[]
+GetClassFiles(in classFileParent : TreeNode) : FileInfo[][]
+GetReader() : OleDbDataReader -dictionaryByClass : string[][]
+GetChildrenNodeNames(in targetNode : TreeNode) : string[]
+ExecuteCommand() : bool -mergedTree : EMSTreeClassifier.EMSTree
+GetTreeNode(in targetNodeName : string, in Parentnode : TreeNode) : TreeNode
-SelectReader() : OleDbDataReader +addToClass(in txt : string, in class : string) +DisplaySTree(in displayView : TreeView, in sTree : EMSTreeClassifier, in classFreqToDisplay : string[])
-UpdateReader() : OleDbDataReader +classIntToName(in classInt : int) : string +AddItemToTreeView(in root : TreeNode, in childNames : params string[]) : TreeNode
-InsertReader() : OleDbDataReader +classNameToInt(in className : string) : int +AddCrossValidationSetsToTreeView(in sourceNode : TreeNode, in content : Array[])
-DeleteReader() : OleDbDataReader +classScore(in example : string, in class : string, in nsf : int, in nmnf : int, in ntnf : int) : double[,,] -PopulateRunNode(in content : Array[], in testSetNum : int, in parentNode : TreeNode)
+OLEDB() +maxScore(in a : double[]) : static int -Combine(in array1 : FileInfo[][], in array2 : FileInfo[][]) : FileInfo[][]
+Open(in Provider : string, in UserID : string, in Password : string, in DatabaseName : string, in Mode : string) +setDepth(in d : int) +AddItem(in destNode : TreeNode, in newNodeName : string, in imageIdx : TreeImages) : TreeNode
+Close() +train(in classTrainingFiles : <unspecified>[][]) : bool -CreateNewNode(in nodeName : string, in imageIdx : TreeImages) : TreeNode
Figure 14. Class Diagram
32 of 93
33. 8 DATABASE
8.1 Entities
All the data in the system is stored in an Access database. The following describes the
organisation of the data that the system will store.
8.1.1 Score Table
When a user calls to score a new document or a set of documents, each document is
scored against 126 configurations for each class. The data is cached in the score table.
8.1.2 Source Table
The source table stores the location properties of documents. This includes the physical
pathname of the document and where it is logically located in the display tree.
8.1.3 Configuration Table
This configuration table stores the 126 combination of scoring methods used in
Pampapathi et al‟s study. Each configuration consists of a type of scoring function,
match normalisation, and tree normalisation function.
8.1.4 Score Functions Table
33 of 93
34. This table contains the name description of score functions.
8.1.5 Match Normalisation Functions Table
This table contains the name description of match normalisation functions.
8.1.6 Tree Normalisation Functions Table
This table contains the name description of tree normalisation functions.
8.1.7 Classification Condition Table
This table stores any classification conditions to be considered when classifying a
document from a particular corpus.
8.1.8 Class Weights Table
This table stores the class weights when classifying documents.
8.1.9 Temporary Max and Min Score Table
34 of 93
35. This is a temporary table used to cache the maximum and minimum scores for a class
grouped by document, configuration.
8.2 Views
The following are some of the main views to assist in querying the main tables for data
displayed in the user interface.
8.2.1 Weighted Scores
This view obtains the weighted scores by documents and scoring configuration.
8.2.2 Maximum and Minimum Scores
This view obtains the maximum and minimum score by document and scoring
configuration.
8.2.3 Misclassified Documents
This view obtains the misclassified documents and related data.
8.3 Relation Design for the Main Tables
The main table of the database is the Scores table. This table contains the scores for
each document, scored by different configuration combinations (see the Implementation
35 of 93
36. section for scoring configuration description). Figure 15 shows the relationships
between the main tables.
tTreeNormalisation tMatchNormalisation tScoreFunction
PK Index PK Index PK Index
Name Name Name
1..1
Config
PK,I1 ConfigId
1..1 1..1
FK2 SF
FK3 MN
FK1 TN
SF Name
MN Name
TN Name
*..1
tempMaxMinWScores
Source FK2,I2 SourceId
*..1 FK1,I1 ConfigId
PK SourceId True Class
Node Parent Path MaxOfWScore
Node Path MinOfWScore
File Path
Scores
*..1 PK ScoreId
*..1
FK2,I4,I3 SourceId
FK1,I2,I1 ConfigId
Score Class
True Class
Score
Figure 15. Table Relations
36 of 93
37. 9 IMPLEMENTATION
Due to the large size of the program, this report will not cover all the different
implementation details, but instead the discussion will focus on the main classes and
highlight some specific implementation. See Appendix B Class Definitions.
9.1 Main User Interface
The main form of the user interface is divided into four resizable panes which each
display different types of information to the user (see Figure 16):
tvExplorer
rtxtView/sTreeView.
lblTreeDetail/listView
rTxtInfo
The tvExplorer is a Windows Form TreeView control, which displays the different
corpuses available in the system. The information is presented as a hierarchy of nodes,
like the way files and folders are displayed in the left pane of Windows Explorer.
The rtxtView is implemented as a Windows Forms RichTextBox control. When the user
selects a child node in tvExplorer that represents a document, rtxtView will display the
content of document. The rtxtView will also allow users to perform dynamic n-gram
(sub-string) matching on a document (see section 10.3 Dynamic Sub-String Matching).
The sTreeView is implemented as a TreeView control. It shares the same pane as the
rtxtView control and is only made visible on the main form (and the rtxtView becomes
invisible) when the user requests to display a suffix tree that has been created. At the
same time the lblSTreeDetail control, which is implemented as a Windows Form Label
control will display description about the suffix tree currently displayed in the sTreeView
control. ListView is a Windows Form ListView control which provides information related
to the current content of the rtxtView control.
RtxtInfo is a RichText control and displays classification summary regarding a document.
37 of 93
38. lblSTreeDetail/listView
tvExplorer
rtxtInfo rtxtView/sTreeView
Figure 16. Main User Interface
The main form is implemented as a .NET class called MainForm. Figure 17 shows the
class members and class interface.
Note that there are other Windows Form control classes which were implemented to
control the flow of user-system interaction. Section 10 Implementation Specifics will
describe one of them in detail, and see Appendix x for all the user interface classes.
38 of 93