The document proposes a method called MB Predefined K nearest neighbor to improve K nearest neighbor classification when some property values may be incorrect. It divides properties randomly into small packets to create multiple trees, assigns unique IDs to leaf nodes, and builds a master tree from the leaf nodes. During training, it analyzes data routes through the master tree to create match lists for each leaf node. When classifying a new data point, it uses the data point's leaf node IDs to quickly retrieve the best match list and perform classification, improving accuracy over standard KNN even if some property values differ from the training data.
1. MB Predefined K nearest neighbor
The aim of this article is to propose a noble method to reduce the effect of wrong values for some
dimension or properties of objects whose k nearest neighbor we want to find. In this proposed method,
at first we have to create packets randomly where each packet will have some properties in it. Then,
next step will be to create separate tree for each packet. We have to create a tree using same properties
of all elements in the training set. For example, let’s assume that we have 50 properties and we have
decided to create each packet with 5 properties, then we will have 10 packets using all properties. And if
we assign one tree for each packet, we will have 10 trees. Our aim is to find the object with maximum
matching criteria. If we want more accuracy, we can reduce number of properties per packet. For
example, we will use 3 properties instead of 5 for each packet if we want more accuracy. In that case,
we will have less number of properties prone to get wrong data when one property is off. In case of 5
properties per packet, if we have one mismatch, this packet will fail to match or get the appropriate
result, even though other 4 data matched with test data. In case of 3 properties per packet, if one
mismatch is found in a packet, we will lose 3 properties and other properties are still under
consideration to be with proper data. Using 5 properties per packet, we will have 10 packets and using 3
properties per packet, we will have 17 packets.
So, if one mismatch found among property values of a packet, it will fail to get the proper data in one
tree, but, 9 other trees will be still on right path. For each data in training set, prepare a list of where
each data reside in each tree. Once we get a list of matching data from one tree, we can check for each
of these data and find out for how many properties we have match comparing with the test data. When
we will move to next tree, our task will be to find the section of data which matches with test data by
traversing the tree from top to bottom. We have to see if it is already available in the list of probable
match. If not available, that means this data was not available in the last tree’s leaf node’s data. In that
case, we have to check its tree list, which contains where this data reside in each tree. This way, we do
not have to check all the matched data. For example, we have 10 million data in the training set. In each
tree, after traversing the tree, we get 10000 data approximately from each tree. So, there will be
100,000 data all together. In worst case, we have to check for a possible match with all these data. But,
on average cases, we will find same data in multiple chunks of data or common in several sections of
data coming from different trees. This is a huge task. We have to try to reduce this task. One possible
solution could be to select data randomly from these 100,000 data and stop when we have 100% or an
acceptable percentage of matching data is found.
Next task will be to prepare a list of groups using all leaf nodes of all trees. Group members go to same
leaf node in different trees. We can prepare a group of all match if available and also groups of fewer
number of matches. Each Data of all match group will reside in same leaf node in each tree. Some group