3. Data In Digital Era
▷How to turn mountains of data into “Nuggets”
Data Base
Machine Learning
Data Mining
Statistics ++
▷A way of effective processing – “FEATURE SELECTION”
▷Feature Selection
Reduces Number of Features
Remove Noise
Speed up Data Mining Algorithms
4. Feature Selection
▷Process of Selecting Optimal Subset of Features
Feature Selection have been NP Hard
Data Mining: Classification, Clustering, Association Rule, Regression
▷Subset Generation
Heuristic search, with each state in the search space specifying a candidate subset for
evaluation
Search Strategy – N features 2N
- Complete, Sequential, Random
▷Complete Search (no optimal subset is missed)
Guarantee to find optimal result according to eval
criteria using.
Order of the search space O(2N)
Branch and Bound + Beam Search
6. Objective
▷ How to find the best subset of features that optimize the criteria
function, J(x) from set of measurements of feature variables.
▷Optimization is for the set of all possible subsets of size d,Xd of the p
possible measurements, x1………xp
▷Goal – find the subset 𝑋 𝑑 of features which optimize the J(X) function
J( 𝑋 𝑑)= max 𝑋 ∈ 𝑋𝑑 J(X)
7. Monotonicity
▷ Exponential search and feature selection criterion is monotonic.
▷Monotonicity identify the branches that do not contain the optimal solutions for
feature selections
▷Given the set, only subset of features contribute to be optimal features.
▷If feature selection criterion yields smaller values on a subset
- Subset and it’s derivations cannot be optimal
J( 𝑋1) ≥ J( 𝑋1) ≥ ……….. ≥ J( 𝑋𝑗) where 𝑋𝑗 = Y {y1, y2....... yj }
[Xj – set of features obtained by removing j features y1,y2,..yj from Y]
9. ▷Select best features from set of N features
▷Introduced by A.H. Land and A.G. Doig for discrete programming and
combinational optimization problems.
▷Applied this concept for feature selection by Nareda and Fukunaga.
▷Follows Divide and Conquer approach
▷Optimal Search that doesn’t involve exhaustive search.
▷Assumption: Criterion function agrees with Monotonicity condition
10. ▷Operations
Branch – Partition full set of features into smaller subsets
Bound – Provide a bound for the best solution in the subset where it discard if
bound points out that it can’t contain an optimal solution.
▷Applications
Travelling salesman problem
False Noise Analysis
0/1 Knapsack Problem
K-Nearest Neighbor Search
Integer Programming
Set Inversions
Feature Selection in Machine Learining
13. Terminology
• Zj– Index of discarded feature.
• Sj – List of succeessors for the considered node
• N – Number of features in full feature set.
• M`- Desired number of features to be selected .
• Jfeature – Feature subset selection criterion (with monotonicity
property)
• α – Bound
• j – Level of the tree.
B&B
14. 1. Root j = 0, α = - ∞ (root level as level zero and initialize minus
infinity as the bound value)
2. Create the successors list for current level identified by Sj
Corresponding level includes all possible values that Zj at level j can
have with maximum possible index of features to be (m+j)
i.e.
Successor nodes at this level contains subsets of features with one feature deleted
(ascending order) from the list of previous level’s parent node .
Analysis of Pseudo-code
15. 3. Select a new node from the current level as,
- If Sj ( List of successors) is empty then go to step 5
- Else find a value k which has the maximum Jfeature value (then Zj = k
and delete k from list Sj)
4. If Sj ( List of successors) is empty then go to step 5
-If last level reached, move to step 6 ( Here we have reached the
amount of features that we are expecting)
-else step to next level( j = j + 1) and move to step 2
16. 5. Return to a lower level,
- if j = 0 , terminate
- otherwise continue with step 3
Whenever the criterion evaluated for any node is less than the bound
α , all nodes that are successors of that node also have criterion
values less than α . So, we prune them.
6. (At last level) α = Jfeature (z1, z2, … , zd) and continue from
step 5.
Objective is to
- Optimize the criterion function while updating lower bound .
17. Overview - Branch and Bound
▷Construct an ordered tree preserving following property.
Jk means k variables to eliminated and the order of Zp is
determined by a discrimination criterion.
▷Traverse tree from following Depth-First Search.
▷At each level – evaluate criterion and sort
▷Prune the tree (Any node less than the Bound α is pruned)
18. Demonstration
▷Selecting the best six features from ten features
▷Number of levels = (10 – 6) = 4
▷Number of leaf nodes = 𝑛
𝑘
𝐶 = 10
6
𝐶= 210
▷Assumption: Initial feature set = {1,2,3,4,5,6,7,8,9,10}
At root: level number = 0
j = 0
zj = 0
α = - ∞
19. ▷Level 0 [1,2,3,4,5,6,7,8,9,10]
▷Level 1
- Contains subset of the total set at Level 0 with one variable removed
- Create the successor list for the current level consist of all possible
values
- Successor list (Sj) = {1,2,3,4,5,6,7,8,9}
▷Number of leaf nodes = 𝑛
𝑘
𝐶 = 10
6
𝐶= 210
▷Assumption: Initial feature set = {1,2,3,4,5,6}
At root: level number = 0
j = 0
zj = 0
α = - ∞
20. ▷Level 0 [1,2,3,4,5,6,7,8,9,10]
▷Level 1: 9 features
▷Level 2: 8 features
▷Level 3: 7 features
▷Level 4: 6 features
At level 4 there will be 210 Leaf nodes
21.
22.
23.
24. Constructed Tree
▷ Constructed tree is NOT symmetric.
▷Variable features are removed ONLY in ascending order
This avoid the chances of subsets being replicated.
Eg: to give subset (123) has the same result as removing 5 and then 4
▷Unnecessary repetition in calculation can removed by applying this.
26. Backtracking
▷ The searching algorithms used for the constructed tree is “Depth First Search”
▷Search happens from Right to Left
Least dense part to the part with most branches
▷Start from the right most set (1,2,3,4,5,6) with a J value of 80.5.
▷Search backtracks to the nearest branching node and proceeds down the
rightmost branch evaluation all nodes up until leaf node reaches.
▷If the node value is less than the J value stored, no more further traverse down
that branch occurs.
▷If the node values are greater than the J value stored traversal goes until a leaf
node find and if J value at new leaf node is higher than J value store, update the
new J value. This process happens recursively.
27. Optimal Feature Subset
▷ Optimal Jvalues or the updated bounded value of the backtracking algorithm is
α = 82.6
▷The Corresponding applicable feature subset = [1,2,3,4,5,10]
▷Hence the optimal feature set = [1,2,3,4,5,10]
▷Classify as a slow algorithm
Worst Case – Exponential time complexities
Average case – Reasonably fast
Algorithm characteristics
28. Branch and Bound in Research Domain
Research
“Evaluation of Feature Selection Techniques for Analysis of
Functional MRI and EEG”
Citation
Burrell, L., Smart, O., Georgoulas, G. K., Marsh, E., &
Vachtsevanos, G. J. (2007, June). Evaluation of Feature
Selection Techniques for Analysis of Functional MRI and EEG.
In DMIN (pp. 256-262).
29. From paper:
In order to classify the pathological events in the human body,
Branch and Bound algorithm has applied to functional MRI and
EEG data.
fMRI data iEEG data
30. ▷ Extracted from each patient dataset
- fMRI data : 12 features
- iEEG data : 14 features
▷Features have expressed in mathematically. Several analysis domain have
considered in constructing mathematical expressions (time, frequency, statistics,
information theory)
▷Executed for varying feature subset sizes with the objective function, feature
vector, and classification vector as the algorithm inputs.
▷For the purpose of classifying classes these extracted features have used.
Evaluation has done using K-Nearest Neighbour (k-NN) classifier and quantifies
the accuracy of the extracted feature set.
31. Observations of the research
▷Patient with high signal-to noise ratio for which only few features are needed.
▷Patient with poor signal to noise ration for which the Branch and Bound algorithm
achieve the best classification accuracy.
▷But still it requires 13 of the 14 features (iEEG data) to achieve the corresponding
optimal accuracy.
▷Sequential Forward Floating Selection requires 6-8 features to achieve it’s optimal
classification accuracy.
▷Less features less computational cost.
▷B&B achieves it’s optimal classification accuracy with a higher
computational cost. ( More features to be extracted for the classification). B & B
algorithm does not outperform any of the other feature extraction methods in both
fMRI and iEEG data
32. Recommendations for Brach and Bound
▷If the search is exhaustive and complete traverse is needed,
Branch and Bound would come in handy where it omit the
construction of certain search tree branches.
Limitations of Branch and Bound Algorithm
▷In certain circumstance it can be slower than exhaustive
search.
▷Weak performance circumstances:
Criterion Estimation being slow: (evaluated feature subsets are larger)
Sub-tree cut offs are less frequent nearer the root(High criterion values)
33. Suggested Recommendations
▷Peter Somon and Pavel Pudil introduced Fast Branch &
Bound Principal
▷Incorporate a prediction mechanism
Inaccuracy of this mechanism should NOT affect the optimality of
result, speed wise acceptable
Information about the individual feature contribution to the criterion
value gathered during the algorithm
35. Introduction
▷Heuristic method for solving combinatorial optimization
problems.
▷Nodes that have high probabilities at each level of the search
tree are selected for further branching, while the remaining
nodes are pruned off permanently.
▷Only a predetermined number of best partial solutions are kept
as candidate at each level
▷Traverse tree from following Breadth-First Search.
▷Examines number of alternatives or beams in parallel.
▷Beam width can be either fixed or vary.
▷Solution for excessive memory requirement of Best-First-
Search
36. Applications of Beam Search
▷Speech Recognition via Artificial Intelligence approach.
▷Image processing
▷Job shop problem with both make span and mean tardiness
performance measure problem.
▷Single machine early/tardy problem.
Feature Extraction in Machine Learining
37. How Beam Search works in Feature Extraction
▷Consists in a truncated branch and bound where only the most
promising β feature nodes will be retained (instead of all feature
nodes)
▷β parameter is known as the Beam width which is fixed to a value
before feature extraction starts.
▷The other feature nodes are simply discarded ( not in the β node
set)
▷No backtracking mechanism is utilized as in the Branch and Bound
algorithm, since the intent of this technique is to extract the features
quickly .
39. Terminology
• kbw– Beam width
• Sbsf – Predefined threshold for feature subset pruning
• B – Partial solution feature set
• C – Children of the partial solution in B
• HEURISTIC – Feature criterion heuristic funciton
B&B
40. Algorithm Analysis
1. Algorithm maintains set of B partial solutions. In the beginning, B
only contains the empty partial feature solutions.
2. Feature set-C contains all of the children of the partial feature
solutions in B.
3. Select the best Kbw after each of the n features individually
evaluated.
4. Add a new feature to each of these Kbw features, forming Kbw(n-
1) 2 tuples of feature.
41. 5. Each partial feature solution is then retrieved from C and then will
be evaluated.
6. From all possible tuples by appending the Kbw tuples with other
features (not already in the existing tuples)
7. If the feature criterion value (J) is lower than the threshold then
the partial feature subset is discarded. If it is higher, then it would
append to B
42. Demonstration
▷Initial Feature set [1,2,3,4,5]
▷Goal – Extract optimal 3 features out of initial five
▷Assuming that the kbw (Beam width) is 2
▷Predefined threshold for the feature criterion value (J) = 88
43.
44. Tree Analysis
▷For each of the five features, individual feature criterion values
will be constructed
Feature 1 and Feature 5 surpasses the pre-defined J value of 88
Kbw defined as two, hence only Feature 1 and Feature 5 subsets will only
branch out further.
▷Forms all possible tuples by appending with the other feature
▷Evaluated each ramified nodes using criterion function J
▷(1,4), (1,5) feature subsets branched through feature 1 node
and (3,5),(4,5) subsets branched through the feature 5 node
are having the highest corresponding J values. Hence those
are branched out next.
45. Optimal Feature Subset
▷Finally derives at the desired size of the feature set size (Leaf
node). Optimal feature set (Highest Jvalue) contains in the
feature set (2,3,5).
▷Optimal Jvalue at the leaf nodes stage in the beam algorithm is
J = 100
▷Corresponding applicable feature subset = [2,3,5]
▷Hence optimal set = [2,3,5]
46. Algorithm characteristics
▷No backtracking available with in the algorithm to trace back.
▷Pruned branches might have the optimal solution. (Not always
gives the optimal solution)
47. Beam Search in Research Context
Research
Beam search for feature extraction in Automatic SVM
Defect classification
Citation
Gupta, P., Doermann, D., & DeMenthon, D. (2002). Beam
search for feature selection in automatic SVM defect
classification. In Pattern Recognition, 2002. Proceedings.
16th International Conference on (Vol. 2, pp. 212-215).
IEEE.
48. ▷Beam Search is used with the SVM based on classifier for
automatic defect classification
Reduce the dimensionality of the feature space substantially.
Improves classifier performance
▷Uses the heuristics functionality of beam search to reduce the
search space.
▷Implemented beam search with a SVM classifier to select the
candidate subset for the Automatic defect classification
▷Performance of the classifier depend on the quality of the
feature extracted
49. Data & Features
▷Semiconductor industry uses Automatic Defect Classification
▷Categorizing wafer defects into classes based on information
provided by sensing and imaging devices
▷Each defect is described by a high – dimensional feature vector
consisting of about 100 features
▷Attempts to capture features that show high variability between
different classes and thus help in distinguishing between them.
▷Spread factor (η) is defined measure the power of each feature
to be able to distinguish between classes.
50. Research Results
▷Significant reduction in the size of the feature vector with the use
of beam search achieved.
▷Time taken to train the SVM classifier was seem to reduce and
improved
▷Size of the feature subset if reduced by at least 70% for all the
binary classification
51. Research Results (Cont.)
▷Reduction in computation and memory comes at a cost, in this
case, the algorithm is not guaranteed to find an optimal solution
and cannot recover from wrong decisions.
▷If a node leading to the optimal solution is discarded during the
search, there is no longer any way to reach that optimal solution
▷Optimal solution is NOT Guaranteed.
52. Recommendations for Beam Search
▷Varying the beam width parameter trades off the risk of missing
optimal goal state against the computational cost of the search
A wider beam considers more candidate solutions, whilst taking up more
memory and processing power.
A narrow beam considers less candidate solutions, leading of missing
potential optimal solution lists.
▷Hence wider beam width allows greater safety, but at the cost of
increased computational effort.