MB Predefined KNN Method Reduces Wrong Values

MB Predefined K nearest neighbor

The aim of this article is to propose a noble method to reduce the effect of wrong values for some
dimension or properties of objects whose k nearest neighbor we want to find. In this proposed method,
at first we have to create packets randomly where each packet will have some properties in it. Then,
next step will be to create separate tree for each packet. We have to create a tree using same properties
of all elements in the training set. For example, let’s assume that we have 50 properties and we have
decided to create each packet with 5 properties, then we will have 10 packets using all properties. And if
we assign one tree for each packet, we will have 10 trees.  Our aim is to find the object with maximum
matching criteria. If we want more accuracy, we can reduce number of properties per packet. For
example, we will use 3 properties instead of 5 for each packet if we want more accuracy. In that case,
we will have less number of properties prone to get wrong data when one property is off.  In case of 5
properties per packet, if we have one mismatch, this packet will fail to match or get the appropriate
result, even though other 4 data matched with test data. In case of 3 properties per packet, if one
mismatch is found in a packet, we will lose 3 properties and other properties are still under
consideration to be with proper data. Using 5 properties per packet, we will have 10 packets and using 3
properties per packet, we will have 17 packets.

So, if one mismatch found among property values of a packet, it will fail to get the proper data in one
tree, but, 9 other trees will be still on right path. For each data in training set, prepare a list of where
each data reside in each tree. Once we get a list of matching data from one tree, we can check for each
of these data and find out for how many properties we have match comparing with the test data. When
we will move to next tree, our task will be to find the section of data which matches with test data by
traversing the tree from top to bottom.  We have to see if it is already available in the list of probable
match. If not available, that means this data was not available in the last tree’s leaf node’s data. In that
case, we have to check its tree list, which contains where this data reside in each tree. This way, we do
not have to check all the matched data. For example, we have 10 million data in the training set. In each
tree, after traversing the tree, we get 10000 data approximately from each tree. So, there will be
100,000 data all together.  In worst case, we have to check for a possible match with all these data. But,
on average cases, we will find same data in multiple chunks of data or common in several sections of
data coming from different trees. This is a huge task. We have to try to reduce this task. One possible
solution could be to select data randomly from these 100,000 data and stop when we have 100% or an
acceptable percentage of matching data is found.

Next task will be to prepare a list of groups using all leaf nodes of all trees. Group members go to same
leaf node in different trees. We can prepare a group of all match if available and also groups of fewer
number of matches.  Each Data of all match group will reside in same leaf node in each tree. Some group

could have 9 matches or 8 matches or 7 matches, we will go up to more than one match. Now, we only
have to check one from each of these sub groups. After processing this group, our 100,000 data, which
we got from all trees, are divided into many sub groups. For example, if we check one from all match
group, then we do not have to check other members of the group because they all go to same leaf
nodes. Similarly, for 9 match group, if we check single data’s value and after that we do not have to
check 9 values of the remaining members of this group. This will reduce the amount of calculation we
have to do without grouping.
Once we have 10,000 data (approximately) from each tree, then we can start checking sub group with
most match. Say, one sub group has all matching leaf node in all trees. Check only one data from it, we
do not have to process the remaining. Then work with sub group with 90% match. We have to check for
one tree data for each data as 9 of them are same. If we have policy to match more than 50% and we
get 3 matches from 90% matching group data, then we have one remaining data to check. We can skip
processing other data of this whole group as any member of this group will fail to meet minimum
requirement of matching, which is at least 50% matching criteria. We will check the ones which has
probability to give 50% or more percentage match. This will reduce many unnecessary calculations.

We have to give a unique ID number to all leaf nodes in all trees. So, a data will have 10 ID numbers
from 10 different tree’s leaf node. And the test data will also have 10 ID numbers, where each tree will
give one ID. Now, we have to create a trail or tree. In first level, we have to keep all the leaf nodes of
first tree. In second level, we will keep all the leaf nodes of 2ed tree. Similarly, we have to insert data
from each tree in each level. So, it will be 10 levels and different number of branches in each level
according to this example. Each level will have different number of branches as those will come from
different trees. Now, for any route from top to leaf node of this tree, prepare the data which will be the
match. For example, sample route is 1234567899. We have to take all the elements of first leaf node of
first tree as all nodes in first level is from first tree. Then, take 2ed leaf node of 2ed tree. Similarly, gather
all the data from 10 different leaf nodes. Next task will be to find the data which is most common in all
nodes, remember same data could reside in many trees because this data’s properties match could be
same as test data and also remember that each data has 50 properties according to this specific
example. Prepare a list of data which will have more than a predefined percentage (say, 80%) of match.
We have to process this at training period. If we have a training data set, then why we should wait to do
this calculation at runtime or real time when test data will be provided. Even though it will be a huge
task to prepare each leaf node’s matching data list, but as it will be done in training period, we can take
as much time as we need. And if our training set has large volume of data, we can always use many
machines to perform this task. This is why I am calling it a predefined method to find nearest neighbor.

Now, at real‐time, when we will have a real data to match, at first we will have to get the 10 leaf nodes
for it from 10 different trees. Then use the tree, which is developed using data from each small tree, to
find the desired data according to this 10 values. For example, using a test data, we got (1, 10, 20,

35,45,56,77, 85, 90, and 95) – these are the leaf node number, each leaf node will be given a unique id.
Using this, go to the appropriate leaf node and get the data. The data for leaf node of this tree is already
prepared at training period. We should have a very short list at leaf node. We can use Euclidean distance
method to get the most appropriate result. This way, we can get to appropriate result even when some
properties have different values comparing with the same data available in the training set.

If we want more precision, then we have to use smaller packets. As we are selecting partition data
randomly, then if one property’ data is wrong, it will disturb only one packet. That is why, more partition
will give less chance to get inappropriate match. If we have 10 partitions, by missing one, our result will
be 90% accurate. If we have 20 partitions, by missing one, we will have 95% accuracy. As, we are
selecting randomly for partitioning, in best case, all the wrong will go to same partition and in worst
case, all of them will go to different partitions. We can use least square method or genetics to find best
combinations for packets.

Summary:
Divide the properties into small packets. For example, 50 properties could be broken down to 5 random
properties in each packet. We could use least square method or genetics to find better partition for the
training set. For more precision, use smaller packet.

Prepare a tree for each packet and use clustering or small partitioning or any other dynamic method for
branching in every level. Give unique id number to each leaf node in every tree.

Prepare another tree by taking leaf nodes of each tree in a separate level of this tree.

For any route from top to leaf node of this tree, get the section data for each node in the route and
analyze these data to get best match list for that route. Store that in leaf node. Process this at training
period.

Use groups for the data. That will help to find match quickly. Use canceling groups to reduce work.

Use Euclidean distance method to get best match from match list.

Author of this article:
Mutawaqqil Billah
Independent Research Scientist,
B.Sc in Computer Science and Mathematics,
Ramapo College of New Jersey, USA
Address: 906/2, East Shewrapara, Mirpur, Dhaka, Bangladesh
Phone: 8801912479175
Email: mutawaqqil02@yahoo.com

MB Predefined KNN Method Reduces Wrong Values

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (8)

Más de Mutawaqqil Billah

Más de Mutawaqqil Billah (14)

MB Predefined KNN Method Reduces Wrong Values