SlideShare una empresa de Scribd logo
1 de 13
Descargar para leer sin conexión
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
1
34ParadiseRoad|RichmondUponThames|UnitedKingdom|TW91SETEL//+44(0)8442642960VISIT//mathsight.org
White Paper:
Deconstructing Google’s Penguin 2.0
Produced by MathSight
This paper identifies how shifts in traffic sourced by Google’s search engine can be
related to the structural and content-based features of a company’s web pages.
We performed this analysis in order to extract potentially useful insights for the
following groups:
A. Marketing agencies
B. The wider online SEO community
C. Online businesses
D. Those with an interest in big data and analytics
Introduction
We decided to apply our machine learning led predictive SEO models to deconstruct
the Google Penguin 2.0 algorithm update, rolled out on 19th May 2013.
Although it is near impossible to reverse engineer a complete search engine algorithm
such as Google’s, it is possible to show the potential causes of any change in algorithm
methods when it occurs. We look for a step change in a pattern that could be an
underlying increase or decrease in actual Google-sourced traffic as a result of an
algorithm alteration, such as the recent Penguin 2.0 update.
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
2
What we did
Once the Google search traffic dataset for our chosen group of web domains had been
obtained from website analytics, de-seasonalised and filtered, the first step in the
reverse engineering process was to confirm that a change in traffic did indeed take
place. This was done using signal processing techniques, a best practice in the oil and
gas exploration industry, to detect the likely point of change in the noisy data.
Following this, we gathered a wide range of standard SEO features from the pages
(title character length, number of meta description words, readability and so on) within
the domain. Finally we applied a variety of statistical methods to identify those
features that were rewarded or penalised in terms of their Google search traffic after
the likely algorithm update time.
Our results showed, with some statistical confidence of around 90-95%, that the main
areas within HTML that Google has probably targeted with this change were:
● Main body text
● Hyperlinks
● Anchor text (clickable text in the hyperlink)
● Meta description text
Methods
Data collection
Websites from eight business categories as follows were used for the purposes of this
study, in order to create a well-rounded dataset:
● Online retailers including the travel, gifts, mobile apps and jewellery sectors;
● Corporate B2B companies including business awards, advertising and PR,
HTML file contents were first gathered by our in-house web crawler, which scanned
the sites in-depth, for structural and content-based ‘features’.
Daily website analytics (page view) data was also imported for each domain above,
spanning a two-month period, from 11 April 2013 through to 11 June 2013. This
period afforded a reasonable window around the time that Google had announced the
‘Penguin 2.0’ algorithm update.
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
3
WHOLE SITE TRAFFIC OVER 3-MONTH PERIOD
16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13
5.8
5.6
5.4
5.2
5
4.8
4.6
4.4
4.2
4
3.8
SOURCE:GOOGLE/MEDIUM:ORGANIC/METRICPAGEVISITS
DATE
WHOLE SITE TRAFFIC FOR SITE E OVER LAST 3-MONTH PERIOD
Cleansing and exploration of the data
The traffic data, in time series form for a single domain were first smoothed using
moving average and then seasonality variation removal, to reduce the effect of a
repeated site usage pattern across the week (e.g. reduced visits on the weekend). This
is slightly more insightful than both the moving average and the raw traffic numbers, as
abrupt changes are clearly defined yet separated from any cyclical variation.
16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13
5.8
5.6
5.4
5.2
5
4.8
4.6
4.4
4.2
4
3.8
DATE
GOOGLE SEARCH TRAFFIC FOR SITE ‘E’ OVER TIME
#ORGANIC(GOOGLE)PAGEVISITS
DAILY
WEEKLY MA
WITHOUT DAY EFFECT
16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13
5.8
5.6
5.4
5.2
5
4.8
4.6
4.4
4.2
4
3.8
DATE
GOOGLE SEARCH TRAFFIC FOR SITE ‘E’ OVER TIME
#ORGANIC(GOOGLE)PAGEVISITS
DAILY
WEEKLY MA
WITHOUT DAY EFFECTGOOGLE SEARCH TRAFFIC OVER TIME
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
4
Using this cleaned traffic data, a change point detection algorithm was deployed in
order to detect the most likely timing of a change in traffic levels over the period in
question. For each domain, this gave a probabilistic confirmation that a change had
indeed occurred at the period in question, rather than simply a series of fluctuations
due to ‘noise’ in the traffic data.
Using this method, of our eight site categories, 3 were selected (numbers 2, 5 and 7) as
they each showed clear evidence (like the pattern in the upper graph) that a change in
daily visitor traffic had occurred.
The lower graph shown to the left here (for the 8th category in our list) shows that it is
unlikely that such a change occurred on the 19th May, rather that it took place later, in
early June.
16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13 25-JUN-13
-540
-541
-542
-543
-544
-545
-546
DATE
LIKELIHOOD OF SINGLE CHANGE IN TRAFFIC LEVEL OCCURING
LIKELIHOOD
27-MAR-13 06-APR-13 16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13 05-JUL-13
-755
-760
-765
-770
-775
-780
-785
-790
DATE
LIKELIHOOD OF SINGLE CHANGE IN TRAFFIC LEVEL OCCURING
LIKELIHOOD
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
5
Simple Statistical modelling
Following this confirmation that a change had indeed occurred, all the html pages of
the chosen domains were classified as either ‘winner’ or ‘loser’ pages with respect to
their mean traffic levels pre- or post- the alleged algorithm update. The traffic values
were normalised, i.e. adjusted so that difference between ‘before’ and ‘after’ algorithm
change traffic level were scaled correctly.
Then, the effect of html page features on traffic difference was analysed using the
Analysis of Variance (ANOVA) method. This enabled us to see if there was any
statistically significant relationship between feature metrics and daily search traffic
variation.
Results
The results below represent a selection of Penguin 2.0 case studies within the overall
data set.
Site A: An online luxury jewellery supplier
CUSTOMER QUERY: They wanted to understand why their daily traffic
jumped up on 19th May.
We found that the average visitor traffic before 19th May was 33.97 per day, while
afterwards it was up to 59.66 per day (an increase of 56.31%). There was a clear
confirmation statistically that a change in traffic took place (see spike in chart below).
27-MAR-13 06-APR-13 16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13 25-JUN-13 05-JUL-13
-462
-464
-465
-466
-467
-468
DATE
LIKELIHOOD OF SINGLE CHANGE IN TRAFFIC LEVEL OCCURING
LIKELIHOOD
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
6
OUR ANSWER:
Firstly, Google’s algorithm seemed to have become attentive to the nature of the
title tags in their html pages; as this seems to have had an effect on the traffic level
after the change. These aspects were found to be significant, shown here in order of
importance:
1. The number of syllables per title
2. The number of ‘rare’ words (i.e. those not in the list of 5,000 most commonly
used English language words) present in the title
3. The title length, in characters (less significantly)
Secondly, the nature of overall html body text has had an impact; in this order:
1. The number of words and characters in the document
2. The ratio of ‘rare’ to commonly used words
These were rewarded in the following fashion:
Total number of words
Body total number of rare words
Title total number of rare words
Title total number of syllables
0
0.13
0.25
0.38
0.50
Differenceinmeanfeaturevalues
Thirdly, but notably less significantly, the following two features
had some influence:
1. The number of hyperlinks present
2. The meta description character length
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
7
Site B: A mobile application vendor
Site type: Promotional and catalogue of products
CUSTOMER QUERY: The ecommerce team wanted to understand why their
visitor traffic fluctuated slightly around 19th May.
16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13
5.8
5.6
5.4
5.2
5
4.8
4.6
4.4
4.2
4
3.8
DATE
GOOGLE SEARCH TRAFFIC FOR SITE ‘E’ OVER TIME
#ORGANIC(GOOGLE)PAGEVISITS
DAILY
WEEKLY MA
WITHOUT DAY EFFECT
16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13
5.8
5.6
5.4
5.2
5
4.8
4.6
4.4
4.2
4
3.8
DATE
GOOGLE SEARCH TRAFFIC FOR SITE ‘E’ OVER TIME
#ORGANIC(GOOGLE)PAGEVISITS
DAILY
WEEKLY MA
WITHOUT DAY EFFECTGOOGLE SEARCH TRAFFIC OVER TIME
We found that the average visitor traffic before 19th May was 49,534.53, it initially
rose and then afterwards it had settled, overall having dropped slightly to 49,271.79 (a
-0.53% change)
OUR ANSWER:
This site has much higher traffic volumes, and many more pages so the data extracted
was far richer than that obtained from site A. Nevertheless, similarities between this
website and site A quickly became apparent, such that there seemed to be a focus on
the overall html page body text content, meta -descriptions as well as hyperlinks.
That is to say;
● The total number of words; the number of syllables in those words; the ratio of
rare and extremely rare to those commonly used; the number of difficult words;
the number of sentences and the ratio of unique to duplicate words. Indeed,
text readability (which is a combination of almost all the other word-related
features) emerged as slightly significant.
● The number of hyperlinks and those linking to files or html files specifically.
● The meta description; here the number of words and ratio of unique to
duplicate words.
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
8
However, the difference here was that there was no hint of title length or content
being significant. Rather, there was also a focus on two other areas (listed in order
of importance):
● The length in characters and number of words in anchor text.
● The number of rare words in the headers.
Thanks to the increased dataset for this client, by comparing pages that ‘won’ post
algorithm change with those that lost, we were able to observe that some features
were quite substantially rewarded as they increased in value; whilst others were
punished. See below:
Punished Rewarded
-0.70 -0.53 -0.35 -0.18 0 0.18 0.35 0.53 0.70
Difference in mean feature values
Number of hyperlinks with html
Number of hyperlinks with file
Number of hyperlinks
Metadesc total number of words
Metadesc number of unique words
Metadesc number of duplicate words
h3 total number of rare words
h1 total number of rare words
Number of rare words
Body total number of rare words
Number of duplicate words
Number of unique words
Total number of words
Total number of syllables
Total number of 5K common words
Total number of dictionary (very rare) words
Average number of words in anchor text
Average length of anchor text
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
9
Site C: An online watch vendor
Site type: extensive online catalogue with product photographs
CUSTOMER QUERY: They wanted to understand a recent increase in daily traffic
sourced from Google search, which started on 19th May.
The site’s daily average traffic before was 429.85, and after this date about 503.0,
thereby showing a 16% increase.
OUR ANSWER:
Again, a relatively rich set of data, from which certain significant features emerged, to
start with those that have already appeared in one or more of the previous two sites
examined:
● Number of hyperlinks (and whether they link to a file)
● Anchor text (length and number of words, number of unique and duplicate
words, number of syllables)
● Body text (ratio of commonplace to rarer words, ratio of unique to duplicate
words, readability)
● Headers (and whether or not they contain rare words)
● To a certain extent; meta description character length
Additionally, the following features were significant:
● Number of external CSS styles (cascading style sheets - manages appearance
of the website)
● Number of scripts
● Number of external and absolute internal links
In terms of rewarded and / or punished features - with this client a large quantity of
anchor text (regardless of whether duplicate words or not) was heavily rewarded.
As was the ‘reading ease level’ of the body text and headers (i.e. the harder/ more
complex it was, the better it was to Google’s algorithm). This relates to the number of
rare words. The full breakdown is shown in the following graph.
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
10
Punished Rewarded
-0.50 -0.38 -0.25 -0.13 0 0.13 0.25 0.38 0.50
Difference in mean feature values
Max number of words in anchor text
Maximum length of anchor text
Anchor number of unique words
Anchor total number of syllables
Average length of anchor text
Anchor total number of words
Anchor number of duplicate words
h1 total number of rare words
Number of h1 instances
Dale chall readability
Ratio of rare words to total
Number of rare words
Number of unique 5K common words
Total number of 5K common words
Number of hyperlinks
Number of hyperlinks with file
Number of absolute internal links
Number of external links
Number of external CSS styles
Number of scripts
Conclusions
Our analysis suggests that there is perhaps significant, positive association between a
site’s search traffic sourced visitor levels, augmenting the values of certain features and
reducing the values of others, based on two months’ worth of traffic data (observed 10
days, one month and 6 weeks before and after the change).
Overall, there was a large variation between the types of rewarded features for the
different websites analysed. This would suggest that any advice given would be most
effective if tailored to the individual domain, or type of domain if domains can be
grouped or clustered into types. For example, with site B the presence of anchor text
was seemingly punished as a feature, whereas for site C it was heavily rewarded.
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
11
However, even with all the above variation - looking at the two larger and more popular
(traffic-wise) sites that clearly exhibited effects of Penguin 2.0, we can draw the
following conclusions:
● In body text, rare words are good and generally rewarded - i.e. those that are
not in the list of 5,000 most common words in the English language. So it is a
good idea to raise the writing level of the page copy (i.e. aim for higher Dale-
Chall readability scores).
● Use of headings will be rewarded; it is also advantageous to use words that are
less commonplace here.
● The number of hyperlinks present appears to have been rewarded – i.e. the
more hyperlinks the greater the increase in traffic (in some cases), although
perhaps this is too vague to take any action upon. Of those hyperlinks, there
was no bias towards external or internal links however.
● Finally, depending on the type of site, and based on our limited survey the
presence and increased character length of meta descriptions and the
increased quantity of words in anchor text are now slightly more rewarded
than previously.
In addition to the insights gained on deconstructing Penguin 2.0, we can now use the
models to evaluate the inbound link profiles of sites that may have been affected by
the latest algorithm update. For example, the models may be applied to a site’s inbound
link profile when trying to decide which links to disavow when faced with a search
engine manual penalty notice (ie apparent abnormal inbound link activity), or
subsequent loss of traffic in response to a search engine’s algorithm update.
Future work
Our research into analysing search algorithm updates are continuing as our data
partnership community grows. Like search engines we are adding the number of
features and searching for the “sweet spots” of site optimisation. The benefit of using
machine learning is that our data modelling of algorithm updates is dynamic and
therefore stays up to date with the constantly improved search engines. Thus
MathSight’s models evolve with the search engines.
Further research papers will be released to provide insights into Penguin and Panda
going forward.
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
12
About MathSight
MathSight was launched in March 2013; it demystifies the search engine algorithms
using machine learning and big data.
The platform analyses both the qualitative and stylistic aspects of content, web design,
and site architecture, their inter-relationships, traffic data and other key performance
indicators. This enables MathSight to determine the cause of changes in search engine
traffic, be it a change in the algorithm, or the SEO (onsite and offsite) of a client or
competitors. These insights are currently available for integration into bespoke and
best in class, enterprise level, SEO tools.
For more information visit: MathSight.org
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
13

Más contenido relacionado

Destacado

Μελέτη Ασφαλούς Ασύρματης Πρόσβασης με Χρήση Εικονικών Δικτύων
Μελέτη Ασφαλούς Ασύρματης Πρόσβασης με Χρήση Εικονικών ΔικτύωνΜελέτη Ασφαλούς Ασύρματης Πρόσβασης με Χρήση Εικονικών Δικτύων
Μελέτη Ασφαλούς Ασύρματης Πρόσβασης με Χρήση Εικονικών ΔικτύωνPavlos Papadopoulos
 
Kevin Farmer Chronological
Kevin Farmer ChronologicalKevin Farmer Chronological
Kevin Farmer ChronologicalKevin Farmer
 
PPC Advertising for Retail
PPC Advertising for RetailPPC Advertising for Retail
PPC Advertising for RetailThe Tomorrow Lab
 
レジリエンス 講座資料ー
レジリエンス 講座資料ーレジリエンス 講座資料ー
レジリエンス 講座資料ーHirohisa Shimizu
 
基調報告 SDGsとは何か?S-11研究の到達点とこれから
基調報告  SDGsとは何か?S-11研究の到達点とこれから基調報告  SDGsとは何か?S-11研究の到達点とこれから
基調報告 SDGsとは何か?S-11研究の到達点とこれからShinichi Hisamatsu
 
Описание ППО Жилина С.В., ДСШ № 115
Описание ППО Жилина С.В., ДСШ № 115Описание ППО Жилина С.В., ДСШ № 115
Описание ППО Жилина С.В., ДСШ № 115YG1981
 
Описание ППО Корсун Л.Н., ДСШ № 115
Описание ППО Корсун Л.Н., ДСШ № 115Описание ППО Корсун Л.Н., ДСШ № 115
Описание ППО Корсун Л.Н., ДСШ № 115YG1981
 
PR Strategy for Ban on Maggi
PR Strategy for Ban on MaggiPR Strategy for Ban on Maggi
PR Strategy for Ban on MaggiDisha Bedi
 

Destacado (11)

Μελέτη Ασφαλούς Ασύρματης Πρόσβασης με Χρήση Εικονικών Δικτύων
Μελέτη Ασφαλούς Ασύρματης Πρόσβασης με Χρήση Εικονικών ΔικτύωνΜελέτη Ασφαλούς Ασύρματης Πρόσβασης με Χρήση Εικονικών Δικτύων
Μελέτη Ασφαλούς Ασύρματης Πρόσβασης με Χρήση Εικονικών Δικτύων
 
Enrollment Management Plan
Enrollment Management PlanEnrollment Management Plan
Enrollment Management Plan
 
Kevin Farmer Chronological
Kevin Farmer ChronologicalKevin Farmer Chronological
Kevin Farmer Chronological
 
PPC Advertising for Retail
PPC Advertising for RetailPPC Advertising for Retail
PPC Advertising for Retail
 
レジリエンス 講座資料ー
レジリエンス 講座資料ーレジリエンス 講座資料ー
レジリエンス 講座資料ー
 
基調報告 SDGsとは何か?S-11研究の到達点とこれから
基調報告  SDGsとは何か?S-11研究の到達点とこれから基調報告  SDGsとは何か?S-11研究の到達点とこれから
基調報告 SDGsとは何か?S-11研究の到達点とこれから
 
Описание ППО Жилина С.В., ДСШ № 115
Описание ППО Жилина С.В., ДСШ № 115Описание ППО Жилина С.В., ДСШ № 115
Описание ППО Жилина С.В., ДСШ № 115
 
Описание ППО Корсун Л.Н., ДСШ № 115
Описание ППО Корсун Л.Н., ДСШ № 115Описание ППО Корсун Л.Н., ДСШ № 115
Описание ППО Корсун Л.Н., ДСШ № 115
 
PR Strategy for Ban on Maggi
PR Strategy for Ban on MaggiPR Strategy for Ban on Maggi
PR Strategy for Ban on Maggi
 
Guia informatica
Guia informaticaGuia informatica
Guia informatica
 
Saudi Arabian Monetary Agency (SAMA)
Saudi Arabian Monetary Agency (SAMA)Saudi Arabian Monetary Agency (SAMA)
Saudi Arabian Monetary Agency (SAMA)
 

Similar a Google Penguin 2.0 Whitepaper Analysis

Google Algorithm | SEO Updates
Google Algorithm | SEO UpdatesGoogle Algorithm | SEO Updates
Google Algorithm | SEO UpdatesRiya Pathak
 
Innovation in google algorithm
Innovation in google algorithmInnovation in google algorithm
Innovation in google algorithmPrajwol Rai
 
Google news wave algorithim update
Google news wave algorithim updateGoogle news wave algorithim update
Google news wave algorithim updateBusinessVibes
 
GOOGLE ALGORITHM - 2014 A LOOK SEO
GOOGLE ALGORITHM - 2014 A LOOK SEOGOOGLE ALGORITHM - 2014 A LOOK SEO
GOOGLE ALGORITHM - 2014 A LOOK SEOVENKATESH S
 
SEO with Google Analytics - Organic Keywords
SEO with Google Analytics - Organic KeywordsSEO with Google Analytics - Organic Keywords
SEO with Google Analytics - Organic KeywordsTobias Kraeft
 
Getting Traffic From Google.pdf
Getting Traffic From Google.pdfGetting Traffic From Google.pdf
Getting Traffic From Google.pdfDemetris D-Papa
 
How to Perform A/B Testing?
How to Perform A/B Testing?How to Perform A/B Testing?
How to Perform A/B Testing?QATestLab
 
Bobby Singh Digital marketer -
Bobby Singh Digital marketer - Bobby Singh Digital marketer -
Bobby Singh Digital marketer - Bobby singh
 
3 core web vitals in seo - Digital Marketing
3 core web vitals in seo - Digital Marketing3 core web vitals in seo - Digital Marketing
3 core web vitals in seo - Digital MarketingBobby singh
 
Searching for reliable business information: free versus fee
Searching for reliable business information: free versus feeSearching for reliable business information: free versus fee
Searching for reliable business information: free versus feevoginip
 
Search Marketer's Toolkit for Google Tag Manager and Google Analytics
Search Marketer's Toolkit for Google Tag Manager and Google AnalyticsSearch Marketer's Toolkit for Google Tag Manager and Google Analytics
Search Marketer's Toolkit for Google Tag Manager and Google AnalyticsSimo Ahava
 
What's New in SEO - July 2018 | Impression
What's New in SEO - July 2018 | ImpressionWhat's New in SEO - July 2018 | Impression
What's New in SEO - July 2018 | ImpressionLaura Hampton
 
Impact of Urban Revitalization in Birmingham Alabama
Impact of Urban Revitalization in Birmingham AlabamaImpact of Urban Revitalization in Birmingham Alabama
Impact of Urban Revitalization in Birmingham AlabamaStormBourne, LLC
 
Global Automotive - Analysis and Commentary - August 2023.pptx
Global Automotive - Analysis and Commentary - August 2023.pptxGlobal Automotive - Analysis and Commentary - August 2023.pptx
Global Automotive - Analysis and Commentary - August 2023.pptxpaul young cpa, cga
 
Google Entrepreneurship Project (by Yasir Afzal Rajput)
Google Entrepreneurship Project (by Yasir Afzal Rajput)Google Entrepreneurship Project (by Yasir Afzal Rajput)
Google Entrepreneurship Project (by Yasir Afzal Rajput)Yasir Afzal Rajput
 

Similar a Google Penguin 2.0 Whitepaper Analysis (20)

Google Algorithm | SEO Updates
Google Algorithm | SEO UpdatesGoogle Algorithm | SEO Updates
Google Algorithm | SEO Updates
 
Innovation in google algorithm
Innovation in google algorithmInnovation in google algorithm
Innovation in google algorithm
 
Google news wave algorithim update
Google news wave algorithim updateGoogle news wave algorithim update
Google news wave algorithim update
 
GOOGLE ALGORITHM - 2014 A LOOK SEO
GOOGLE ALGORITHM - 2014 A LOOK SEOGOOGLE ALGORITHM - 2014 A LOOK SEO
GOOGLE ALGORITHM - 2014 A LOOK SEO
 
SEO with Google Analytics - Organic Keywords
SEO with Google Analytics - Organic KeywordsSEO with Google Analytics - Organic Keywords
SEO with Google Analytics - Organic Keywords
 
Getting Traffic From Google.pdf
Getting Traffic From Google.pdfGetting Traffic From Google.pdf
Getting Traffic From Google.pdf
 
How to Perform A/B Testing?
How to Perform A/B Testing?How to Perform A/B Testing?
How to Perform A/B Testing?
 
Bobby Singh Digital marketer -
Bobby Singh Digital marketer - Bobby Singh Digital marketer -
Bobby Singh Digital marketer -
 
3 core web vitals in seo - Digital Marketing
3 core web vitals in seo - Digital Marketing3 core web vitals in seo - Digital Marketing
3 core web vitals in seo - Digital Marketing
 
What Is The Latest Google Update 2023.pdf
What Is The Latest Google Update 2023.pdfWhat Is The Latest Google Update 2023.pdf
What Is The Latest Google Update 2023.pdf
 
Search Industry News Q4 2016
Search Industry News Q4 2016Search Industry News Q4 2016
Search Industry News Q4 2016
 
ga4.pdf
ga4.pdfga4.pdf
ga4.pdf
 
ga4.pdf
ga4.pdfga4.pdf
ga4.pdf
 
Searching for reliable business information: free versus fee
Searching for reliable business information: free versus feeSearching for reliable business information: free versus fee
Searching for reliable business information: free versus fee
 
Search Marketer's Toolkit for Google Tag Manager and Google Analytics
Search Marketer's Toolkit for Google Tag Manager and Google AnalyticsSearch Marketer's Toolkit for Google Tag Manager and Google Analytics
Search Marketer's Toolkit for Google Tag Manager and Google Analytics
 
What's New in SEO - July 2018 | Impression
What's New in SEO - July 2018 | ImpressionWhat's New in SEO - July 2018 | Impression
What's New in SEO - July 2018 | Impression
 
Impact of Urban Revitalization in Birmingham Alabama
Impact of Urban Revitalization in Birmingham AlabamaImpact of Urban Revitalization in Birmingham Alabama
Impact of Urban Revitalization in Birmingham Alabama
 
SEO Periodic Table 2019
SEO Periodic Table 2019SEO Periodic Table 2019
SEO Periodic Table 2019
 
Global Automotive - Analysis and Commentary - August 2023.pptx
Global Automotive - Analysis and Commentary - August 2023.pptxGlobal Automotive - Analysis and Commentary - August 2023.pptx
Global Automotive - Analysis and Commentary - August 2023.pptx
 
Google Entrepreneurship Project (by Yasir Afzal Rajput)
Google Entrepreneurship Project (by Yasir Afzal Rajput)Google Entrepreneurship Project (by Yasir Afzal Rajput)
Google Entrepreneurship Project (by Yasir Afzal Rajput)
 

Último

Aryabhata I, II of mathematics of both.pptx
Aryabhata I, II of mathematics of both.pptxAryabhata I, II of mathematics of both.pptx
Aryabhata I, II of mathematics of both.pptxtegevi9289
 
How videos can elevate your Google rankings and improve your EEAT - Benjamin ...
How videos can elevate your Google rankings and improve your EEAT - Benjamin ...How videos can elevate your Google rankings and improve your EEAT - Benjamin ...
How videos can elevate your Google rankings and improve your EEAT - Benjamin ...Benjamin Szturmaj
 
The Skin Games 2024 25 - Sponsorship Deck
The Skin Games 2024 25 - Sponsorship DeckThe Skin Games 2024 25 - Sponsorship Deck
The Skin Games 2024 25 - Sponsorship DeckToluwanimi Balogun
 
The Science of Landing Page Messaging.pdf
The Science of Landing Page Messaging.pdfThe Science of Landing Page Messaging.pdf
The Science of Landing Page Messaging.pdfVWO
 
April 2024 - VBOUT Partners Meeting Group
April 2024 - VBOUT Partners Meeting GroupApril 2024 - VBOUT Partners Meeting Group
April 2024 - VBOUT Partners Meeting GroupVbout.com
 
Five Essential Tools for International SEO - Natalia Witczyk - SearchNorwich 15
Five Essential Tools for International SEO - Natalia Witczyk - SearchNorwich 15Five Essential Tools for International SEO - Natalia Witczyk - SearchNorwich 15
Five Essential Tools for International SEO - Natalia Witczyk - SearchNorwich 15SearchNorwich
 
BDSM⚡Call Girls in Sector 150 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 150 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 150 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 150 Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
Call Us ➥9654467111▻Call Girls In Delhi NCR
Call Us ➥9654467111▻Call Girls In Delhi NCRCall Us ➥9654467111▻Call Girls In Delhi NCR
Call Us ➥9654467111▻Call Girls In Delhi NCRSapana Sha
 
Moving beyond multi-touch attribution - DigiMarCon CanWest 2024
Moving beyond multi-touch attribution - DigiMarCon CanWest 2024Moving beyond multi-touch attribution - DigiMarCon CanWest 2024
Moving beyond multi-touch attribution - DigiMarCon CanWest 2024Richard Ingilby
 
Cost-effective tactics for navigating CPC surges
Cost-effective tactics for navigating CPC surgesCost-effective tactics for navigating CPC surges
Cost-effective tactics for navigating CPC surgesPushON Ltd
 
Branding strategies of new company .pptx
Branding strategies of new company .pptxBranding strategies of new company .pptx
Branding strategies of new company .pptxVikasTiwari846641
 
How to Leverage Behavioral Science Insights for Direct Mail Success
How to Leverage Behavioral Science Insights for Direct Mail SuccessHow to Leverage Behavioral Science Insights for Direct Mail Success
How to Leverage Behavioral Science Insights for Direct Mail SuccessAggregage
 
Social Samosa Guidebook for SAMMIES 2024.pdf
Social Samosa Guidebook for SAMMIES 2024.pdfSocial Samosa Guidebook for SAMMIES 2024.pdf
Social Samosa Guidebook for SAMMIES 2024.pdfSocial Samosa
 
Local SEO Domination: Put your business at the forefront of local searches!
Local SEO Domination:  Put your business at the forefront of local searches!Local SEO Domination:  Put your business at the forefront of local searches!
Local SEO Domination: Put your business at the forefront of local searches!dstvtechnician
 
Factors-Influencing-Branding-Strategies.pptx
Factors-Influencing-Branding-Strategies.pptxFactors-Influencing-Branding-Strategies.pptx
Factors-Influencing-Branding-Strategies.pptxVikasTiwari846641
 
Labour Day Celebrating Workers and Their Contributions.pptx
Labour Day Celebrating Workers and Their Contributions.pptxLabour Day Celebrating Workers and Their Contributions.pptx
Labour Day Celebrating Workers and Their Contributions.pptxelizabethella096
 

Último (20)

Aryabhata I, II of mathematics of both.pptx
Aryabhata I, II of mathematics of both.pptxAryabhata I, II of mathematics of both.pptx
Aryabhata I, II of mathematics of both.pptx
 
No Cookies No Problem - Steve Krull, Be Found Online
No Cookies No Problem - Steve Krull, Be Found OnlineNo Cookies No Problem - Steve Krull, Be Found Online
No Cookies No Problem - Steve Krull, Be Found Online
 
How videos can elevate your Google rankings and improve your EEAT - Benjamin ...
How videos can elevate your Google rankings and improve your EEAT - Benjamin ...How videos can elevate your Google rankings and improve your EEAT - Benjamin ...
How videos can elevate your Google rankings and improve your EEAT - Benjamin ...
 
Turn Digital Reputation Threats into Offense Tactics - Daniel Lemin
Turn Digital Reputation Threats into Offense Tactics - Daniel LeminTurn Digital Reputation Threats into Offense Tactics - Daniel Lemin
Turn Digital Reputation Threats into Offense Tactics - Daniel Lemin
 
The Skin Games 2024 25 - Sponsorship Deck
The Skin Games 2024 25 - Sponsorship DeckThe Skin Games 2024 25 - Sponsorship Deck
The Skin Games 2024 25 - Sponsorship Deck
 
The Science of Landing Page Messaging.pdf
The Science of Landing Page Messaging.pdfThe Science of Landing Page Messaging.pdf
The Science of Landing Page Messaging.pdf
 
April 2024 - VBOUT Partners Meeting Group
April 2024 - VBOUT Partners Meeting GroupApril 2024 - VBOUT Partners Meeting Group
April 2024 - VBOUT Partners Meeting Group
 
Five Essential Tools for International SEO - Natalia Witczyk - SearchNorwich 15
Five Essential Tools for International SEO - Natalia Witczyk - SearchNorwich 15Five Essential Tools for International SEO - Natalia Witczyk - SearchNorwich 15
Five Essential Tools for International SEO - Natalia Witczyk - SearchNorwich 15
 
BDSM⚡Call Girls in Sector 150 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 150 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 150 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 150 Noida Escorts >༒8448380779 Escort Service
 
Call Us ➥9654467111▻Call Girls In Delhi NCR
Call Us ➥9654467111▻Call Girls In Delhi NCRCall Us ➥9654467111▻Call Girls In Delhi NCR
Call Us ➥9654467111▻Call Girls In Delhi NCR
 
Moving beyond multi-touch attribution - DigiMarCon CanWest 2024
Moving beyond multi-touch attribution - DigiMarCon CanWest 2024Moving beyond multi-touch attribution - DigiMarCon CanWest 2024
Moving beyond multi-touch attribution - DigiMarCon CanWest 2024
 
Cost-effective tactics for navigating CPC surges
Cost-effective tactics for navigating CPC surgesCost-effective tactics for navigating CPC surges
Cost-effective tactics for navigating CPC surges
 
How to Create a Social Media Plan Like a Pro - Jordan Scheltgen
How to Create a Social Media Plan Like a Pro - Jordan ScheltgenHow to Create a Social Media Plan Like a Pro - Jordan Scheltgen
How to Create a Social Media Plan Like a Pro - Jordan Scheltgen
 
Branding strategies of new company .pptx
Branding strategies of new company .pptxBranding strategies of new company .pptx
Branding strategies of new company .pptx
 
Creator Influencer Strategy Master Class - Corinne Rose Guirgis
Creator Influencer Strategy Master Class - Corinne Rose GuirgisCreator Influencer Strategy Master Class - Corinne Rose Guirgis
Creator Influencer Strategy Master Class - Corinne Rose Guirgis
 
How to Leverage Behavioral Science Insights for Direct Mail Success
How to Leverage Behavioral Science Insights for Direct Mail SuccessHow to Leverage Behavioral Science Insights for Direct Mail Success
How to Leverage Behavioral Science Insights for Direct Mail Success
 
Social Samosa Guidebook for SAMMIES 2024.pdf
Social Samosa Guidebook for SAMMIES 2024.pdfSocial Samosa Guidebook for SAMMIES 2024.pdf
Social Samosa Guidebook for SAMMIES 2024.pdf
 
Local SEO Domination: Put your business at the forefront of local searches!
Local SEO Domination:  Put your business at the forefront of local searches!Local SEO Domination:  Put your business at the forefront of local searches!
Local SEO Domination: Put your business at the forefront of local searches!
 
Factors-Influencing-Branding-Strategies.pptx
Factors-Influencing-Branding-Strategies.pptxFactors-Influencing-Branding-Strategies.pptx
Factors-Influencing-Branding-Strategies.pptx
 
Labour Day Celebrating Workers and Their Contributions.pptx
Labour Day Celebrating Workers and Their Contributions.pptxLabour Day Celebrating Workers and Their Contributions.pptx
Labour Day Celebrating Workers and Their Contributions.pptx
 

Google Penguin 2.0 Whitepaper Analysis

  • 1. DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER 1 34ParadiseRoad|RichmondUponThames|UnitedKingdom|TW91SETEL//+44(0)8442642960VISIT//mathsight.org
  • 2. White Paper: Deconstructing Google’s Penguin 2.0 Produced by MathSight This paper identifies how shifts in traffic sourced by Google’s search engine can be related to the structural and content-based features of a company’s web pages. We performed this analysis in order to extract potentially useful insights for the following groups: A. Marketing agencies B. The wider online SEO community C. Online businesses D. Those with an interest in big data and analytics Introduction We decided to apply our machine learning led predictive SEO models to deconstruct the Google Penguin 2.0 algorithm update, rolled out on 19th May 2013. Although it is near impossible to reverse engineer a complete search engine algorithm such as Google’s, it is possible to show the potential causes of any change in algorithm methods when it occurs. We look for a step change in a pattern that could be an underlying increase or decrease in actual Google-sourced traffic as a result of an algorithm alteration, such as the recent Penguin 2.0 update. DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER 2
  • 3. What we did Once the Google search traffic dataset for our chosen group of web domains had been obtained from website analytics, de-seasonalised and filtered, the first step in the reverse engineering process was to confirm that a change in traffic did indeed take place. This was done using signal processing techniques, a best practice in the oil and gas exploration industry, to detect the likely point of change in the noisy data. Following this, we gathered a wide range of standard SEO features from the pages (title character length, number of meta description words, readability and so on) within the domain. Finally we applied a variety of statistical methods to identify those features that were rewarded or penalised in terms of their Google search traffic after the likely algorithm update time. Our results showed, with some statistical confidence of around 90-95%, that the main areas within HTML that Google has probably targeted with this change were: ● Main body text ● Hyperlinks ● Anchor text (clickable text in the hyperlink) ● Meta description text Methods Data collection Websites from eight business categories as follows were used for the purposes of this study, in order to create a well-rounded dataset: ● Online retailers including the travel, gifts, mobile apps and jewellery sectors; ● Corporate B2B companies including business awards, advertising and PR, HTML file contents were first gathered by our in-house web crawler, which scanned the sites in-depth, for structural and content-based ‘features’. Daily website analytics (page view) data was also imported for each domain above, spanning a two-month period, from 11 April 2013 through to 11 June 2013. This period afforded a reasonable window around the time that Google had announced the ‘Penguin 2.0’ algorithm update. DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER 3
  • 4. WHOLE SITE TRAFFIC OVER 3-MONTH PERIOD 16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13 5.8 5.6 5.4 5.2 5 4.8 4.6 4.4 4.2 4 3.8 SOURCE:GOOGLE/MEDIUM:ORGANIC/METRICPAGEVISITS DATE WHOLE SITE TRAFFIC FOR SITE E OVER LAST 3-MONTH PERIOD Cleansing and exploration of the data The traffic data, in time series form for a single domain were first smoothed using moving average and then seasonality variation removal, to reduce the effect of a repeated site usage pattern across the week (e.g. reduced visits on the weekend). This is slightly more insightful than both the moving average and the raw traffic numbers, as abrupt changes are clearly defined yet separated from any cyclical variation. 16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13 5.8 5.6 5.4 5.2 5 4.8 4.6 4.4 4.2 4 3.8 DATE GOOGLE SEARCH TRAFFIC FOR SITE ‘E’ OVER TIME #ORGANIC(GOOGLE)PAGEVISITS DAILY WEEKLY MA WITHOUT DAY EFFECT 16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13 5.8 5.6 5.4 5.2 5 4.8 4.6 4.4 4.2 4 3.8 DATE GOOGLE SEARCH TRAFFIC FOR SITE ‘E’ OVER TIME #ORGANIC(GOOGLE)PAGEVISITS DAILY WEEKLY MA WITHOUT DAY EFFECTGOOGLE SEARCH TRAFFIC OVER TIME DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER 4
  • 5. Using this cleaned traffic data, a change point detection algorithm was deployed in order to detect the most likely timing of a change in traffic levels over the period in question. For each domain, this gave a probabilistic confirmation that a change had indeed occurred at the period in question, rather than simply a series of fluctuations due to ‘noise’ in the traffic data. Using this method, of our eight site categories, 3 were selected (numbers 2, 5 and 7) as they each showed clear evidence (like the pattern in the upper graph) that a change in daily visitor traffic had occurred. The lower graph shown to the left here (for the 8th category in our list) shows that it is unlikely that such a change occurred on the 19th May, rather that it took place later, in early June. 16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13 25-JUN-13 -540 -541 -542 -543 -544 -545 -546 DATE LIKELIHOOD OF SINGLE CHANGE IN TRAFFIC LEVEL OCCURING LIKELIHOOD 27-MAR-13 06-APR-13 16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13 05-JUL-13 -755 -760 -765 -770 -775 -780 -785 -790 DATE LIKELIHOOD OF SINGLE CHANGE IN TRAFFIC LEVEL OCCURING LIKELIHOOD DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER 5
  • 6. Simple Statistical modelling Following this confirmation that a change had indeed occurred, all the html pages of the chosen domains were classified as either ‘winner’ or ‘loser’ pages with respect to their mean traffic levels pre- or post- the alleged algorithm update. The traffic values were normalised, i.e. adjusted so that difference between ‘before’ and ‘after’ algorithm change traffic level were scaled correctly. Then, the effect of html page features on traffic difference was analysed using the Analysis of Variance (ANOVA) method. This enabled us to see if there was any statistically significant relationship between feature metrics and daily search traffic variation. Results The results below represent a selection of Penguin 2.0 case studies within the overall data set. Site A: An online luxury jewellery supplier CUSTOMER QUERY: They wanted to understand why their daily traffic jumped up on 19th May. We found that the average visitor traffic before 19th May was 33.97 per day, while afterwards it was up to 59.66 per day (an increase of 56.31%). There was a clear confirmation statistically that a change in traffic took place (see spike in chart below). 27-MAR-13 06-APR-13 16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13 25-JUN-13 05-JUL-13 -462 -464 -465 -466 -467 -468 DATE LIKELIHOOD OF SINGLE CHANGE IN TRAFFIC LEVEL OCCURING LIKELIHOOD DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER 6
  • 7. OUR ANSWER: Firstly, Google’s algorithm seemed to have become attentive to the nature of the title tags in their html pages; as this seems to have had an effect on the traffic level after the change. These aspects were found to be significant, shown here in order of importance: 1. The number of syllables per title 2. The number of ‘rare’ words (i.e. those not in the list of 5,000 most commonly used English language words) present in the title 3. The title length, in characters (less significantly) Secondly, the nature of overall html body text has had an impact; in this order: 1. The number of words and characters in the document 2. The ratio of ‘rare’ to commonly used words These were rewarded in the following fashion: Total number of words Body total number of rare words Title total number of rare words Title total number of syllables 0 0.13 0.25 0.38 0.50 Differenceinmeanfeaturevalues Thirdly, but notably less significantly, the following two features had some influence: 1. The number of hyperlinks present 2. The meta description character length DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER 7
  • 8. Site B: A mobile application vendor Site type: Promotional and catalogue of products CUSTOMER QUERY: The ecommerce team wanted to understand why their visitor traffic fluctuated slightly around 19th May. 16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13 5.8 5.6 5.4 5.2 5 4.8 4.6 4.4 4.2 4 3.8 DATE GOOGLE SEARCH TRAFFIC FOR SITE ‘E’ OVER TIME #ORGANIC(GOOGLE)PAGEVISITS DAILY WEEKLY MA WITHOUT DAY EFFECT 16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13 5.8 5.6 5.4 5.2 5 4.8 4.6 4.4 4.2 4 3.8 DATE GOOGLE SEARCH TRAFFIC FOR SITE ‘E’ OVER TIME #ORGANIC(GOOGLE)PAGEVISITS DAILY WEEKLY MA WITHOUT DAY EFFECTGOOGLE SEARCH TRAFFIC OVER TIME We found that the average visitor traffic before 19th May was 49,534.53, it initially rose and then afterwards it had settled, overall having dropped slightly to 49,271.79 (a -0.53% change) OUR ANSWER: This site has much higher traffic volumes, and many more pages so the data extracted was far richer than that obtained from site A. Nevertheless, similarities between this website and site A quickly became apparent, such that there seemed to be a focus on the overall html page body text content, meta -descriptions as well as hyperlinks. That is to say; ● The total number of words; the number of syllables in those words; the ratio of rare and extremely rare to those commonly used; the number of difficult words; the number of sentences and the ratio of unique to duplicate words. Indeed, text readability (which is a combination of almost all the other word-related features) emerged as slightly significant. ● The number of hyperlinks and those linking to files or html files specifically. ● The meta description; here the number of words and ratio of unique to duplicate words. DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER 8
  • 9. However, the difference here was that there was no hint of title length or content being significant. Rather, there was also a focus on two other areas (listed in order of importance): ● The length in characters and number of words in anchor text. ● The number of rare words in the headers. Thanks to the increased dataset for this client, by comparing pages that ‘won’ post algorithm change with those that lost, we were able to observe that some features were quite substantially rewarded as they increased in value; whilst others were punished. See below: Punished Rewarded -0.70 -0.53 -0.35 -0.18 0 0.18 0.35 0.53 0.70 Difference in mean feature values Number of hyperlinks with html Number of hyperlinks with file Number of hyperlinks Metadesc total number of words Metadesc number of unique words Metadesc number of duplicate words h3 total number of rare words h1 total number of rare words Number of rare words Body total number of rare words Number of duplicate words Number of unique words Total number of words Total number of syllables Total number of 5K common words Total number of dictionary (very rare) words Average number of words in anchor text Average length of anchor text DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER 9
  • 10. Site C: An online watch vendor Site type: extensive online catalogue with product photographs CUSTOMER QUERY: They wanted to understand a recent increase in daily traffic sourced from Google search, which started on 19th May. The site’s daily average traffic before was 429.85, and after this date about 503.0, thereby showing a 16% increase. OUR ANSWER: Again, a relatively rich set of data, from which certain significant features emerged, to start with those that have already appeared in one or more of the previous two sites examined: ● Number of hyperlinks (and whether they link to a file) ● Anchor text (length and number of words, number of unique and duplicate words, number of syllables) ● Body text (ratio of commonplace to rarer words, ratio of unique to duplicate words, readability) ● Headers (and whether or not they contain rare words) ● To a certain extent; meta description character length Additionally, the following features were significant: ● Number of external CSS styles (cascading style sheets - manages appearance of the website) ● Number of scripts ● Number of external and absolute internal links In terms of rewarded and / or punished features - with this client a large quantity of anchor text (regardless of whether duplicate words or not) was heavily rewarded. As was the ‘reading ease level’ of the body text and headers (i.e. the harder/ more complex it was, the better it was to Google’s algorithm). This relates to the number of rare words. The full breakdown is shown in the following graph. DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER 10
  • 11. Punished Rewarded -0.50 -0.38 -0.25 -0.13 0 0.13 0.25 0.38 0.50 Difference in mean feature values Max number of words in anchor text Maximum length of anchor text Anchor number of unique words Anchor total number of syllables Average length of anchor text Anchor total number of words Anchor number of duplicate words h1 total number of rare words Number of h1 instances Dale chall readability Ratio of rare words to total Number of rare words Number of unique 5K common words Total number of 5K common words Number of hyperlinks Number of hyperlinks with file Number of absolute internal links Number of external links Number of external CSS styles Number of scripts Conclusions Our analysis suggests that there is perhaps significant, positive association between a site’s search traffic sourced visitor levels, augmenting the values of certain features and reducing the values of others, based on two months’ worth of traffic data (observed 10 days, one month and 6 weeks before and after the change). Overall, there was a large variation between the types of rewarded features for the different websites analysed. This would suggest that any advice given would be most effective if tailored to the individual domain, or type of domain if domains can be grouped or clustered into types. For example, with site B the presence of anchor text was seemingly punished as a feature, whereas for site C it was heavily rewarded. DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER 11
  • 12. However, even with all the above variation - looking at the two larger and more popular (traffic-wise) sites that clearly exhibited effects of Penguin 2.0, we can draw the following conclusions: ● In body text, rare words are good and generally rewarded - i.e. those that are not in the list of 5,000 most common words in the English language. So it is a good idea to raise the writing level of the page copy (i.e. aim for higher Dale- Chall readability scores). ● Use of headings will be rewarded; it is also advantageous to use words that are less commonplace here. ● The number of hyperlinks present appears to have been rewarded – i.e. the more hyperlinks the greater the increase in traffic (in some cases), although perhaps this is too vague to take any action upon. Of those hyperlinks, there was no bias towards external or internal links however. ● Finally, depending on the type of site, and based on our limited survey the presence and increased character length of meta descriptions and the increased quantity of words in anchor text are now slightly more rewarded than previously. In addition to the insights gained on deconstructing Penguin 2.0, we can now use the models to evaluate the inbound link profiles of sites that may have been affected by the latest algorithm update. For example, the models may be applied to a site’s inbound link profile when trying to decide which links to disavow when faced with a search engine manual penalty notice (ie apparent abnormal inbound link activity), or subsequent loss of traffic in response to a search engine’s algorithm update. Future work Our research into analysing search algorithm updates are continuing as our data partnership community grows. Like search engines we are adding the number of features and searching for the “sweet spots” of site optimisation. The benefit of using machine learning is that our data modelling of algorithm updates is dynamic and therefore stays up to date with the constantly improved search engines. Thus MathSight’s models evolve with the search engines. Further research papers will be released to provide insights into Penguin and Panda going forward. DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER 12
  • 13. About MathSight MathSight was launched in March 2013; it demystifies the search engine algorithms using machine learning and big data. The platform analyses both the qualitative and stylistic aspects of content, web design, and site architecture, their inter-relationships, traffic data and other key performance indicators. This enables MathSight to determine the cause of changes in search engine traffic, be it a change in the algorithm, or the SEO (onsite and offsite) of a client or competitors. These insights are currently available for integration into bespoke and best in class, enterprise level, SEO tools. For more information visit: MathSight.org DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER 13