Unraveling Multimodality with Large Language Models.pdf
Archiving the Mobile Web
1. Archiving the Mobile
Web
Frank McCown, Monica Yarbrough, &
Keith Enlow
Computer Science Dept
Harding University
WADL 2013
Indianapolis, IN
July 25, 2013
4. Two Types of Mobile Web
Feature Phone Web Smartphone Web
cHTML (iMode), WML,
WAP, etc.
XHTML, HTML5, etc.
5.
6. Serving Up Mobile Sites
1. Responsive web design
• Same HTML content to desktop and mobile
• CSS media queries alter appearance
<!-- CSS media query on a link element -->
<link rel="stylesheet" media="(max-width: 800px)" href="example.css" />
<!-- CSS media query within a style sheet -->
<style>
@media (max-width: 600px) {
.sidebar { display: none; }
}
</style>
8. Serving Up Mobile Sites
1. Responsive web design
• Same HTML content to desktop and mobile
• CSS media queries alter appearance
2. Redirect mobile user agent to mobile site
• Client-side redirection
• Server-side redirection
9. Client-Side Redirection
• JavaScript detects mobile user agent
// From www.harding.edu
var ua = navigator.userAgent.toLowerCase();
if (queryString.match('version=mobile') ||
ua.match(/IEMobile|Windows CE|NetFront|PlayStation|like Mac OS
Z|MIDP|UP.Browser|Symbian|
Nintendo|BlackBerry|mobile/i)) {
if (!ua.match('ipad')) {
if (window.location.pathname.match('.html'))
window.location = window.location.pathname.replace('.html', '.m.html');
else
window.location = window.location.pathname + 'index.m.html';
}
}
11. Server-Side Redirection
• Server routes mobile user agent to different page
Apache Example:
RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_USER_AGENT}
(android|bbd+|meego).+mobile|avantgo|badda/|blackberry|blazer|etc…|zte-) [NC]
RewriteRule ^$ http://detectmobilebrowser.com/mobile [R,L]
https://developers.google.com/webmasters/smartphone-sites/details
13. Serving Up Mobile Sites
1. Responsive web design
• Same HTML content to desktop and mobile
• CSS media queries alter appearance
2. Redirect mobile user agent to mobile site
• Client-side redirection
• Server-side redirection
3. User-agent content negotiation
• Dynamically serving different HTML for the same URL
14. User-Agent Content
Negotiation
• Server serves up different content
for same URL
• Use Vary: User-Agent
header in response
• Best method for serving content
quickly
15. Archiving Mobile Sites
1. Responsive web design
• Easy: Crawl like normal
• Use client tools to view page formatted for mobile
2. Redirect mobile user agent to mobile site
• Need to crawl with mobile user agent
• Need JavaScript-enabled crawler to handle client-side
redirection
3. User-agent content negotiation
• Need to crawl with mobile user agent
• Need to distinguish mobile vs. desktop for same URL
16. How are we doing
archiving mobile sites so
far?
25. Google’s Suggestions for SEO
• Vary HTTP Header
• Annotations within the HTML:
• On desktop page:
• <link rel=“alternate” media=“only screen and (max-width:
640px)” href=“http://m.example.com/page-1” >
• On mobile page:
• <link rel=“canonical” href=“http://www.example.com/page-1”
>
• Media queries
https://developers.google.com/webmasters/smartphone-sites/
26. How Mobile Finder Works
• Use both desktop and mobile useragents
• Look for:
• Redirect
• Different content
• Different stylesheets
• Media queries
27. How Mobile Finder Works
• Change the url to fit common mobile url patterns
ex: www.t-mobile.com m.t-mobile.com
28. PhantomJs
• Headless WebKit (browser)
• Well-known and widely used
• Used to get the content of a page
• Takes snapshots of the sites it visits
• Scriptable with coffeescript or javascript
29. Web Service
• Query string with 2 parameters
• url (required)
• useragent (optional)
• http://cs.harding.edu/mobilefinder/service.php?url=URL&u
seragent=USER_AGENT
• Default useragent = Mozilla/5.0 (iPhone; U; CPU iPhone OS
4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like
Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7
(compatible; mediaqueries/1.0; +http://cs.harding.edu)
33. Analysis Results
• Accuracy (of 100 random hand-checked results)
• 96 % accurate overall
• 1 % inaccurately record not found when there is in fact a
mobile version
• 3 % inaccurately say mobile found when there is not a
mobile version
36. Are Google’s Suggestions
Used?
• 28 % found a mobile version following Google’s
suggestions
• 85 % found as having some sort of mobile version
37. Are Google’s Suggestions
Used?
• 28 % found following Google’s suggestions
• Of the 82% that were found as not following the
rules:
• 93% missing vary HTTP header
• 89% missing alternate and canonical links
38. Are Google’s Suggestions
Used?
• 28 % found following Google’s suggestions
• 85 % found as having some sort of mobile version
• Redirect: 35%
• “Significantly” different content: 28%
• Stylesheets alone: 9%
• Stylesheets and media queries: 11%
• Media queries alone: 6%
• Differing urls (trial and error): 11%
39. End Result
• As a whole, mobile web pages do not adhere to
Google’s standards
• There are no truly consistent ways for finding a
mobile version of a site
41. Introduction
• Heritrix 3.1
• Mobile Finder Web Service
• 2 Options
• Crawl desktop web pages (default)
• Crawl mobile web pages using Mobile finder and
exclude mobile web pages that use media queries.
42. Experiment
• Decision Making Heritrix
• Web Service (Mobile Finder) Heritrix
• Modified Heritrix 3.1 to include two options for crawling
• Option 0: Crawl with desktop user agent
• Option 1: Crawl with mobile user agent using Mobile Finder
• Added built in mobile user agent adapted from Google Bot
• Crawled a small set of URLs
• Used Mobile Finder to find if the given URL has mobile
version
• Wrote a small script to discover differences between the
mobile and desktop versions
43. <property name="userAgentTemplate"
value="Mozilla/5.0 (compatible; heritrix/@VERISON@+
@OPERATOR_CONTACT_URL@)"/>
<property name="userAgentTemplateMobile"
value="Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us)
AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117
Safari/6531.22.7 (compatible; heritrix/@VERSION@+
@OPERATOR_CONTACT_URL@"/>
<!-- Option # = Description
0 [Default] Crawl using desktop user agent
1 Crawl using mobile user agent + Mobile Finder Web Service --
>
<property name="CrawlOption" value="0" />
54. Redirection/Delivery
• 200 Response (server side redirect)
• 302 “Temporary” relocation
• 301 “Permanent” relocation
• JavaScript Redirection (client side redirect)
• Media Queries
• Style Sheets
55. Tiny Limits
• No JavaScript Engine
• Heritrix is unable to perform and execute JavaScript
code
• Unable to catch client side redirection and will instead
continue to crawl the desktop version of the web page.
Note: The Mobile Finder Web Service will find the mobile page and therefore Heritrix will
continue the crawl.
• www.nasa.gov
• www.ssa.gov
• www.cornell.edu
56. Hufington Fox News NBC News NASA SSA White House Stanford Cornell MIT
56774 12703 8894 4960 2380 8121 2351 2901 120
2134 110 3545 63 53 570 116 94 124
Total Link Count
57. HTML Distribution
Huffington Fox News NBC News NASA SSA White House Stanford Cornell MIT
11550 2681 2302 851 20 3251 385 596 12
493 35 488 18 0 76 16 31 26