A brief overview of making recommendations using the K nearest neighbour algorithm and the Euclidean distance. Given at a Forward First Tuesday evening.
Hope to show that it’s not too complicated, very interested, potentially valuable, and various parts are quite similar\nWhat kinds of things does machine learning cover?\n
Increasing piles of data\nmachine learning is complimentary to data mining: evolve behaviours from empirical data\n
Classic classification.\n
Product suggestions.\nList from my Kindle suggestions. Over 850,000 kindle titles alone. Recommendations based on my purchases and content?\n
Google employs all kinds of machine learning: query result ranking, news story clustering \n
2 searches, on immediately after the other: one chrome, one safari. there’s a difference!?\n
Social sites make use of recommendations.\nInstead of products it’s users to other users.\nThis time it’s pretty good.\n
Social sites make use of recommendations. Instead of products it’s users to other users.\n
\n
Going to cover a high level description of these 3 topics, and then explore some of the details through a classification example\n
How much something is or isn’t part of a group. Assign class labels using a classifier built from predictor values\n
16 things. we know there are 4 categories or labels.\nwe want to automate the way find a category for each thing. \n
\n
Clustering: Group a large number of things into groups of similar things\n
24 blobs, not sure of what the categories are\njust want groups of similar things\n
we’ve got 4 categories\n
\n
lets take an example of looking at recommending items to users\n
3 items, and 2 users\n
we can see recommendations for items from those users\n
for example, the red user shares 2 items...\n
with the blue user... we can use the blue users preferences to identify things that the red user would be interested in...\n
and for things like twitter + facebook, these graphs would be users to users\n
this brings up an interesting point- how do we model the problem.\nthe first thing we need to look at...\n
I mentioned it quite a lot- but what does that mean?\n
interesting example\n2 films- how similar?\nboth star jim carrey\n\n
Collaborative filtering- based on behaviour of multiple people (for example)\n
\n
\n
How to measure similarity? We can calculate distance... \n
One way is euclidean distance. Similar to pythagorean formula for calculating sides of triangles.\n\nWhat are q and p? ...\n
p and q are our vectors-\n1) so we first calculate the difference\n2) then square those (ensuring all numbers are signed the same)\n3) we sum the squares\n4) square root of the sum\n\nso, let’s look at the results for our data\n
we can see that item 3 is closer to item 2 than item 1.\nthis can be seen by the ratings for items 2 and 3 from all users have a similar shape.\n\nhow does this look in code?\n
\n
How about content based calculations?\nWell we break down the content into feature vectors.\n
This is our previous matrix- user and item ratings, what do we swap users for?\n
We swap them for features.\nFor example, items were documents, features may be the words in those documents.\nMovies might break down films into running length, actors etc.\n\nImportantly- Measure similarity in the same way- with distance calculations.\n\nLet’s put this in practice\n
We’ve looked at how to represent data, and how to measure similarity.\nHow do we turn that into an algorithm that can classify things?\n
One really simple one is k-nearest neighbours: find the most common category for our item from k nearest items\n
Our matrix from before- shows the calculated distance of Items 1 and 2 from item 3.\nBut, if we’re classifying, we need to know what the categories are!\n
We’ve added the labels so we can see that item 1 was spam and item 2 was ham\n\nitems 1 and 2 represent our trained model- data and their label\n\nlet’s drop the stuff we don’t need any more\n
we have just labels and distances from all other items to our new item.\n\nback to our algorithm- knn. method: find the most common label from k nearest items to our item (in this case 3).\n\nso, given the above information we’d classify it into “Ham” category. If we had more data we’d just compare more neighbours.\n\ntime for some code ...\n
xs is the vector we’re trying to classify\nk is the number of nearest neighbours we’ll measure the distance of\nm is our trained matrix of data\nlabels are the labels for the items in the matrix\n
all very well, how do we know our model is accurately categorising things?\n
Similar matrix to before, how can we use the empirical data to measure effectiveness of the algorithm?\n\nWe can take our data and consider part of it to be testing data...\n
Item 3 now becomes our test data- we have calculated label and an observed label. We can then measure how well we match.\n\nThis is the same for rating movies (for example) as well- how close is our estimated score to the actual measured score?\n\nAnyway, that brings us to the end of a whistlestop tour\n