Alexander Sibiryakov- Frontera

Frontera: open source, large scale web
crawling framework
Alexander Sibiryakov, May 20, 2016, PyData Berlin 2016
sibiryakov@scrapinghub.com

• Software Engineer @
Scrapinghub
• Born in Yekaterinburg, RU
• 5 years at Yandex, search
quality department: social and
QA search, snippets.
• 2 years at Avast! antivirus,
research team: automatic false
positive solving, large scale
prediction of malicious
download attempts.
About myself
2

• Over 2 billion requests per month
(~800/sec.)
• Focused crawls & Broad crawls
We help turn web content
into useful data
3
{
"content": [
{
"title": {
"text": "'Extreme poverty' to fall below
10% of world population for ﬁrst time",
"href": "http://www.theguardian.com/
society/2015/oct/05/world-bank-extreme-
poverty-to-fall-below-10-of-world-population-
for-ﬁrst-time"
},
"points": "9 points",
"time_ago": {
"text": "2 hours ago",
"href": "https://news.ycombinator.com/
item?id=10352189"
},
"username": {
"text": "hliyan",
"href": "https://news.ycombinator.com/
user?id=hliyan"
}
},

Broad crawl usages
4
Lead generation
(extracting contact
information)
News analysis
Topical crawling
Plagiarism detection
Sentiment analysis
(popularity, likability)
Due diligence (proﬁle/
business data)
Track criminal activity & ﬁnd lost persons (DARPA)

Saatchi Global Gallery Guide
www.globalgalleryguide.com
• Discover 11K online
galleries.
• Extract general
information, art samples,
descriptions.
• NLP-based extraction.
• Find more galleries on the
web.

Frontera recipes
• Multiple websites data collection automation
• «Grep» of the internet segment
• Topical crawling
• Extracting data from arbitrary document

Multiple websites data collection
automation
• Scrapers from multiple websites.
• Data items collected and updated.
• Frontera can be used to
• crawl in parallel and scale the process,
• schedule revisiting (within ﬁxed time),
• prioritize the URLs during crawling.

«Grep» of the internet segment
• alternative to Google,
• collect the zone ﬁles from registrars
(.com/.net/.org),
• setup Frontera in distributed mode,
• implement text processing in spider code,
• output items with matched pages.

Topical crawling
• document topic classifier & seeds URL list,
• if document is classified as positive crawler ->
extracted links,
• Frontera in distributed mode,
• topic classifier code put in spider.
Extensions: link classifier, follow/final classifiers.

Extracting data from arbitrary
document
• Tough problem. Can’t be solved completely.
• Can be seen as a structured prediction problem:
• Conditional Random Fields (CRF) or
• Hidden Markov Models (HMM).
• Tagged sequence of tokens and HTML tags can be
used to predict the data ﬁelds boundaries.
• Webstruct and WebAnnotator Firefox extension.

Task
• Spanish web: hosts and
their sizes statistics.
• Only .es ccTLD.
• Breadth-ﬁrst strategy:
• ﬁrst 1-click environ,
• 2,
• 3,
• …
• Finishing condition: 100 docs
from host max., all hosts
• Low costs.
11

Spanish, Russian, German and
world Web in 2012
12
Domains Web servers Hosts DMOZ*
Spanish (.es) 1,5M 280K 4,2M 122K
Russian
(.ru, .рф, .su) 4,8M 2,6M ? 105K
German (.de) 15,0M 3,7M 20,4M 466K
World 233M 62M 890M 3,9M
Sources: OECD Communications Outlook 2013, statdom.ru
* - current period (October 2015)

Solution
• Scrapy (based on Twisted) - network operations.
• Apache Kafka - data bus (offsets, partitioning).
• Apache HBase - storage (random access, linear
scanning, scalability).
• Twisted.Internet - library for async primitives for use in
workers.
• Snappy - efﬁcient compression algorithm for IO-
bounded applications.
13

Architecture
Kafka topic
SW
Crawling strategy
workers
Storage workers
14
DB

1. Big and small hosts
problem
• Queue is flooded with
URLs from the same
host.
• → underuse of spider
resources.
• additional per-host
(per-IP) queue and
metering algorithm.
• URLs from big hosts
are cached in memory.
15

2. DDoS DNS service Amazon AWS
Breadth-first strategy →
first visiting of unknown
hosts →
generating huge amount
of DNS reqs.
Recursive DNS server
• on every spider node,
• upstream to Verizon &
OpenDNS.
We used dnsmasq.
16

3. Tuning Scrapy thread pool’а
for efﬁcient DNS resolution
• OS DNS resolver,
• blocking calls,
• thread pool to resolve
DNS name to IP.
• numerous errors and
timeouts 🆘
• A patch for thread
pool size and
timeout adjustment.
17

4. Overloaded HBase region
servers during state check
• 10^3 links per doc,
• state check: CRAWLED/NOT
CRAWLED/ERROR,
• HDDs.
• Small volume 🆗
• With ⬆table size, response
times ⬆
• Disk queue ⬆
• Host-local fingerprint
function for keys in HBase.
• Tuning HBase block cache to
fit average host states into
one block.
18
3Tb of metadata.
URLs, timestamps,…
275 b/doc

5. Intensive network trafﬁc
from workers to services
• Throughput
between workers
and Kafka/HBase  
~ 1Gbit/s.
• Thrift compact
protocol for HBase
• Message
compression in
Kafka with Snappy
19

6. Further query and trafﬁc
optimizations to HBase
• State check: lots of
reqs and network
• Consistency
• Local state cache
in strategy worker.
• For consistency,
spider log was
partitioned by
host.
20

State cache
• All ops are batched:
– If no key in cache→
read HBase
– every ~4K docs →
flush
• Close to 3M (~1Gb)
elms → flush & cleanup
• Least-Recently-Used
(LRU) 👍
21

Spider priority queue (slot)
• Cell:
Array of: 
- ﬁngerprint,  
- Crc32(hostname),  
- URL,  
- score
• Dequeueing top N.
• Prone to huge hosts
• Scoring model: document
count per host.
22

7. Problem of big and small
hosts (strikes back!)
• Discovered few very
huge hosts (>20M
docs)
• All queue partitions
were flooded with huge
hosts,
• Two MapReduce jobs:
– queue shuffling,
– limit all hosts to
100 docs MAX.
23

Spanish (.es) internet crawl results
• fnac.es, rakuten.es, adidas.es,
equiposdefutbol2014.es,
druni.es,
docentesconeducacion.es -
are the biggest websites
• 68.7K domains found (~600K
expected),
• 46.5M crawled pages overall,
• 1.5 months,
• 22 websites with more than
50M pages
24

where are the rest of
web servers?!

Bow-tie model
A. Broder et al. / Computer Networks 33 (2000) 309-320
26

Y. Hirate, S. Kato, and H. Yamana, Web Structure in 2005
27

12 years dynamics
Graph Structure in the Web — Revisited, Meusel, Vigna, WWW 2014
28

• Single-thread Scrapy spider gives 1200 pages/min.
from about 100 websites in parallel.
• Spiders to workers ratio is 4:1 (without content)
• 1 Gb of RAM for every SW (state cache, tunable).
• Example:
• 12 spiders ~ 14.4K pages/min.,
• 3 SW and 3 DB workers,
• Total 18 cores.
Hardware requirements
(distributed backend+spiders)
29

Software requirements
30
Single process Distributed spiders
Distributed
backend
Python 2.7+, Scrapy 1.0.4+
sqlite or any other
RDBMS
HBase/RDBMS
- ZeroMQ or Kafka
- - DNS Service

• Online operation: scheduling of new batch,
updating of DB state.
• Storage abstraction: write your own backend
(sqlalchemy, HBase is included).
• Run modes: single process, distributed spiders,
dist. backend
• Scrapy ecosystem: good documentation, big
community, ease of customization.
Main features
31

• Message bus abstraction (ZeroMQ and Kafka are
available out-of-the box).
• Crawling strategy abstraction: crawling goal, url
ordering, scoring model is coded in separate module.
• Polite by design: each website is downloaded by at most
one spider.
• Canonical URLs resolution abstraction: each document
has many URLs, which to use?
• Python: workers, spiders.
Main features
32

References
GitHub: https://github.com/scrapinghub/frontera
RTD: http://frontera.readthedocs.org/
Google groups: Frontera (https://goo.gl/ak9546)
33

Future plans
• Python 3 support,
• Docker images,
• Web UI,
• Watchdog solution: tracking
website content changes.
• PageRank or HITS strategy.
• Own HTML and URL parsers.
• Integration into Scrapinghub
services.
• Testing on larger volumes.
34

Run your business using Frontera
SCALABLE
OPEN
CUSTOMIZABLE
Made in Scrapinghub
(authors of Scrapy)

Contribute!
• Web scale crawler,
• Historically first
attempt in Python,
• Truly resource-
intensive task: CPU,
network, disks.
36

We’re hiring!
http://scrapinghub.com/jobs/
37

38
Mandatory sales slide
Crawl the web, at scale
• cloud-based platform
• smart proxy rotator Get data, hassle-free
• off-the-shelf datasets
• turn-key web scraping
try.scrapinghub.com/PDB16

Questions?
Thank you!
Alexander Sibiryakov,
sibiryakov@scrapinghub.com

Alexander Sibiryakov- Frontera

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (16)

Destacado

Destacado (16)

Similar a Alexander Sibiryakov- Frontera

Similar a Alexander Sibiryakov- Frontera (20)

Más de PyData

Más de PyData (20)

Último

Último (20)

Alexander Sibiryakov- Frontera