1. STRUCTURAL PROFILING OF WEB SITES IN THE WILD
LABORATOIRE D’INFORMATIQUE FORMELLE UNIVERSITÉ DU QUÉBEC À CHICOUTIMI
XAVIER CHAMBERLAND-THIBEAULT AND SYLVAIN HALLÉ ICWE 9 JUIN 2020
1
3. DEBUGGING AND FIXING WEB APPLICATIONS
An increasing number of tools are created to help analyze, debug, detect errors or even process the output of
web applications.
Most of the tools focus on anlyzing the Document Object Model (DOM) and the Cascading Stylesheet (CSS) of a
page.
Those tools have varied utilities :
Fixing cross-browser issues ;
DOM interpreter ;
Detect responsive web design bugs ;
Etc.
3
4. WHAT DOES A WEB PAGE LOOKS LIKE ?
Most of the aforementioned tools have their scalability, and sometimes even their success, based on size related
features.
What’s the average size of a web page ?
Walsh and al. (2015) run experiments against pages of up to 196 DOM nodes, whereas Choudhary and al. (2013) chose
pages going up to 39146 DOM nodes.
This paper aimed to address this issue by doing a large-scale analysis of 708 websites hoping to measure an array
of parameters relative to the size and structure of web pages.
4
7. WEBSITE COLLECTION
To make sure to get a pool of websites representing the reality of the users, it was mandatory to get the sites
that the most users visit.
To do that, the Moz top 500 most frequented websites list was used. However, there were many duplicates made of country
specific versions of the same web application.
Out of those 500 sites, only 300 non-duplicate remained.
Yet, sites visited by the most users do not reflect the reality, for this notion is orthogonal to the sites most visited
by an individual user.
Therefore, we informally asked people around to provide us with the list of websites they use daily.
7
8. DOM HARVESTING
To collect data on the DOM for each of these sites, a JavaScript program was designed to run when a page has
finished loading.
The script starts at the body node of a page and performs a preordered traversal of the integral DOM tree,
recording and computing various features :
Tag names ;
CSS classes ;
Visibility status ;
Structural information.
The script then generated two files : a JSON file containing all the data and a DOT file accepted by the Graphviz
library so we could get statistical and visual representation of a web page.
8
9. DOM HARVESTING – RUNNING ON EVERY PAGE
To actually be able to run on every page, the TamperMonkey extension was used.
This extension, available on multiple browsers, allows the user to inject and run custom JavaScript code every
time a new page is loaded in the browser.
It is to be noted that the harvesting was done on the browser-rendered DOM and properties.
9
10. DATA PROCESSING
LabPal was used to process all the 62MB of raw data :
Every website was made into an experiment that would process the associated JSON file ;
It was then possible to aggregate all the data recovered and even perform deeper statistical analysis.
It is to be noted that some files were not used since the automated loading made us retrieve a lot of pop-ups.
Manually inspecting each recovered files to detect the pop-ups would have been a tedious task, therefore it was
decided to use a more generic filter removing most of these pages by removing every file with less than 5 DOM
nodes or if the URL belonged to a list of know advertisement pages.
10
12. GRAPHICAL REPRESENTATION OF AWEBSITE
Each color represents a different HTML tag name.
The root of the tree, the body tag, is represented by
the black square.
This is the representation of Zippyshare.com .
12
18. REFERENCES
Walsh,T.A., McMinn, P., Kapfhammer, G.M.:Automatic detection of potential layout faults following changes to
responsive web pages (N). In: Cohen, M.B., Grunske, L.,Whalen, M. (eds.) Proc.ASE 2015. pp. 709–714. IEEE
Computer Society (2015)
Choudhary, S.R., Prasad, M.R., Orso,A.: X-PERT: accurate identification of crossbrowser issues in web applications.
In: Notkin, D., Cheng, B.H.C., Pohl, K. (eds.) Proc. ICSE 2013. pp. 702–711. IEEE Computer Society (2013)
The Moz top 500 websites, https://moz.com/top500,Accessed October 20th, 2019
All pictures used are licence free
18