A presentation give to the Metadata Perspectives 2011 conference, which explores how Google's changes to the algorithm work, and how they effect publishers.
17. Quick Recap Driven by Need to combat spamdexing Big Data Semantic web Machine Learning UX Leads to Content is king, PageRank is reformed Good websites win Old tricks won’t work, you just have to be relevant, interesting and well done
18.
19. 1. Playing the system probably isn’t worth it – go for descriptions and metadata that actually reflect the book, not what will boost its sales
20. 2. The good news is, you’re not writing for machinesThe bad news is, you’re not writing for machines
21. 3. Those little details like subject categories, keynotes and so on WILL be judged by the system
What is this? Can anyone identify it? If we were at a computer scientists convention everyone would know these. Algorithms. The essence of discovery is based on them. Metadata is all about automating things, making them machine readable, and the algorithm is the machine that does the reading. This is one of the most famous – PageRank. It ultimately dictates who will be successful on the web, and when.
This means that everyone in publishing needs to be educated about SEO. We have got used to thinking about metadata, but we do we think like web companies? They obsess about SEO and analytics, and in practice this means thinking about all of your data, all the time. Books will increasingly rely on intelligent use of metadata as the channel to market. Whether retailing direct or through online retailers, having in place a sense of SEO – what makes thing easily found – is going to be key. Publishers generally have not employed SEO specialists.
This is a story about one company above all, and one familiar to many in this room. Sorry if I am going to go over ground that everyone knows, but before we meet Panda it is worth putting search in context and seeing how it has evolved over time.
Google was by no means the first search engine. Those with long memories will remember names such as these, who dominated search in the early days of the consumer web. The problem was that they weren’t very good, and didn’t offer a very fine grained result. Aside from issues around user experience (they were confusing) the SERPs did not contain what people were looking for. This is because the logic behind them was crude – broadly speaking they looked for clusters of search terms, so that density was rewarded. They had a simplistic algorithimic understanding of what was already a tremendously complex system. It was easy to game the system but simply stuffing a web page full of keywords and tags. In metadata terms this was easy and effective.
The story then moves here, to Stanford in Palo Alto. Stanford, then as now, was at the heart of the computer industry, and had been since the earliest days of the transistor. Major tech companies had spun out of it for years, like SUN Microsystems, which stands for Stanford University Networks, and developed the Java platform at Stanford.
Stanford is where Larry Page and Sergey Brin, then Ph.D students, met in the mid nineties. They need no introduction. Both were working on techniques to improve computer search when they started to collaborate – this research would eventually bear fruit as the gold standard of algorithm design. Brin was working on data mining, Page on a new way of viewing the web as a graph, and soon they were seeing the potential of their work, before quitting university to found Google in a garage.
The key break through was Google pagerank, named after larry page. Whereas previous search engines had looked for search terms, and then employed a few back end fixes and tweaks, pagerank would go for a totally different model. It would look at the links into a website, and see each link as a vote of support for the content of a website. So, the more pages linked to a given website, the more that websites link would be worth. This system of backlinking was, as Brin and Page saw it, the architecture of the internet, and would provide more relevant search information. It didn’t rely on guess work about what was important, but interpreted the internet as already supplying that information in its very structure.
This diagram reveals the principles behind pagerank. Essentially it looks at the web in terms of a probability distribution. What is the probability that if you click on a random link, you will land at a given website? Probably is measured as between 0 and 1; so if half of the links on the web to go one website, it would have a probability, and a pagerank, of 0.5. This kind of algorithmic link analysis lets you gauge the importance of a link, and an element, within any given set. They saw the web as one big set or graph – the webgraph – and with Brin’s data mining skills were able to scale it up. While this wasn’t totally new – it has origins in work on citations analysis – it did fundamentally transform the business of search, and changed what dictated a web pages search position. This is obviously a simplification of how it works, but gives you a good idea.
This wasn’t the end of it, of course. Google still took into account a whole host of other factors based primarily on metadata and content. So it would still look for the density of a key word on a given page. More to the point it would look at the metadata as presented in the <meta> part of the HTML, in the head of the web document. Getting this metadata right became a critical part of the SEO mix. You needed to describe the web page accurately, and over the years Google added many tweaks to the main algorithm to make sure that websites were rewarded for accurate metadata. There are over 120 different data points taken into account for each website by Google. This was the bread and butter of SEO specialists. Other factors like a good sitemap, links that work and good HTML were also taken into account.
In short the system worked. However there were always problems and over time people gamed the system. Part of the issue was the exponential nature of the web’s expansion. The number of registered domains went from 15k in 1995, to well over 350m in 2011, with over 4.5m URLs being added each month. Google is indexing above 50bn web pages on a regular basis, and in 2008 had indexed over 1 trillion individual pages. That figure is likely to be much higher already. This represents petabytes, if not zetabytes, of raw data. Yet this is only the beginning of the problem.
I mentioned earlier SEO techniques. They weren’t all good. So called black hat SEO used suspect means to boost a sites web traffic. Most of the time these were relatively harmless – so called cloaking, where in the non-displayed part of the page many unrelated keywords would be hidden, tricking the search spiders into indexing the page wrongly. Essentially the search engine would see a different site to the user. So called doorway pages would be created for the purpose of a search engine, but would then automatically redirect to another site, a principle known as spamdexing. Another technique was keyword stuffing, where content and metadata would be filled with pointless keywords – good for search engines, but largely frustrating and useless for users. Beyond this the use of various malware that infected computers was and is a big problem. Content creators too were involved – the practice of link baiting, or writing for the sole purpose of being linked to – became common and the quality of website copy dropped.
Another kind of spamdexing, and possibly the most problematic for Google, was link farming. This was designed to hi-jack pagerank, and use it’s fundamental principles to game search engine results. A link farm would often be created by automated programs, sometimes viruses, that would connect a bunch of websites, and then have them all link to one another, boosting their pagerank. Many sites would simply crawl wikipedia, copying its content, and suck up links for no purpose. Both the duplication of content and the link farming became increasing problems. Link farming was growing out of all proportion and the results were becoming very evident in Google SERPs. They were getting worse, and hitting Google’s claim as the best search engine. Google are painfully aware how easy it would be for rivals to switch to a service like Yahoo or Bing if their results were consistently bad.
Google has a reputation as being a company that invests in research, famously allowing employees 20% time to pursue personal projects, as well as hiring many developers to research. One such as engineer Nanveet Panda. He secured a patent for Google in the area of machine learning, finding techniques for allowing machines to act and learn intelligently. This was the start of a huge new project for Google. Over the years they had constantly tweaked their algorithm, making adjustments to try and stay one step ahead of black hat SEO. Now however they were going to do something different – they were going to fundamentally change the nature of their secret sauce, and it was to be Nanveet Panda’s work that sparked it. Google had always relied on computers but now they changed, and started to do intense user testing of websites. Did people like it? Did they trust it? Did they think the content was good? Was the design pretty? Would they spend money on this website? They collected vast amounts of this data, and using Panda’s breakthrough, created a system whereby machines learned from these very human insights, and essentially replicated them. This was about using a new breakthrough technology, the semantic web, and a new approach, people not computers, and applying it on a vast scale.
This was known externally as Farmer, internally as Panda, and was to be the most significant change in the Google algorithm since launch. It was phased in earlier this year. Over night Panda had a massive effect. Many websites and indeed business were constructed on the basis of pleasing Google. Suddenly they were cast adrift, and saw their traffic rates plummeted. Panic spread through the worlds of ecommerce, and confusion reigned in some of the halls of traditional SEO. Even tech savvy sites like technorati or the next web suffered. What Panda did was, to some extent, invert pagerank. Links were downgraded in importance, and content received a massive upgrade. Websites were to be judged, once again, but in a totally transformed way, on their content. Having been told context is king, content was recrowned. Google could tell whether a website was “good” or not, and links alone would not save you.
Let me give you a true example from a friend of mine who works in search marketing, although I won’t give the actual company. This firm sells luxury goods online, including shoes. These are premium, high value shoes, known for their quality and so even on the internet they retail for high prices at around €400. The business has been selling these shoes successfully for over four years. Panda comes on; traffic nose dives overnight, sales totally stall and within a matter of days the business is in deep trouble. What happened? It all hinged on the names of the shoes, which were European cities like Sienna, Milan and Lyons. Under the new rules the search engine saw this as a mistake – what was a shoe website doing talking about a bunch of unconnected cities? It wasn’t a good fit, therefore the website would not be helpful to people, therefore it falls in the search rankings. At the flick of a switch a business model is over. Luckily there is a fix. The metadata, the code of the site, needed to be changed. It needed to say, this is a shoe called Sienna, not this is a page about shoes, and oh look, there is a reference to an Italian city. Once the structure of the metadata was fixed it was able, at the next Google algorithm update (updates are now usually phased in on a six week cycle, with major updates like Panda on a much longer cycle), to get back and even increase the lost traffic. The rebuild also allowed it to put through other fixes in the sitemaps, product descriptions and code to improve the SEO for the new era. However many companies, and innumerable link-baiting, link farmed, copy cat websites have been seriously hit by Panda.
Seeing as this is the Frankfurt Book Fair, back to books. Here are three lessons and mistakes that I take from what Panda tells us about metadata. If you work in any way with anything on the internet, then these changes are worth thinking about at a general level as they describe the general trajectory of all search technology.
Those descriptions are all important, and you really have to take time over them. They will be your one shot at getting people’s attention. On the plus side, with the semantic web and Panda things like this don’t have to be technical, or to follow strange rules. In fact, the more the stay relevant and appealing to the book, the better. The bad news is, this means there are no guide lines, short cuts or tricks to getting your book seen.
Eventually all metadata will be judged by its fidelity to the book itself – machines will be able to see if people have allotted a subject correctly, whether they have filled out an appropriate keynote correctly – they will be able to group and list things in an almost infinite number of permutations. Panda is a significant step in this direction but it won’t stop there. In this scenario metadata is absolutely critical, and low quality metadata, as with websites, will see books vanish without trace – and we all know how easy it is for that to happen.