SpazioDati collects public information about all Italian companies from many different sources, the most challenging being the World Wide Web Our Internet Data Gathering project crawls and processes data from the entire Italian web, using distributed frameworks such as Hadoop, Nutch, Elasticsearch and Spark ✨ This talk will give an overview of the extraction pipeline and present some of the issues we tackled during and after development.
2. Your speaker
● Born in Trento
● Studied at UniTN and Georgia Tech
● PhD in Large Scale Graph Analytics
● Teaches Algorithms and Data Structures
● Data Scientist at SpazioDati
In my spare time:
● {Read|Watch|Play} {Science
Fiction|Fantasy} {Novels|TV|Board
Games}
3. SpazioDati S.R.L.
● Born in 2012
● Data integration
● Focus on corporate world:
○ Official data from Camera di Commercio
○ Open data
● Atoka
○ B2B database of company information
○ Sales intelligence
○ API
● Data analytics
○ Portfolio analysis
○ Lead generation
○ Risk evaluation
Always hard at work!
4. Internet Data Gathering (IDG)
IDG is an internal project to gather, process
and organize internet data about italian
companies.
It uses many different technologies for Big
Data Gathering and Processing.
Entire pipeline runs on Amazon AWS A representation of the Internet
5. Internet Data Gathering (IDG)
Takeaways:
● Web data is HORRIBLE
● OSS can help!
● For Big Data, you need a Big Framework
8. Apache Nutch
● Distributed crawler runnable on Hadoop
● Highly configurable
Each iteration:
1. Injector adds new Urls
2. Generator runs Scoring Function to
select Urls
3. Urls are divided in segments
4. Each segment is downloaded in parallel
5. Pages are parsed
6. Newly discovered urls are added to
CrawlDB
9. Apache Nutch
● Distributed crawler runnable on Hadoop
● Highly configurable
Each iteration:
1. Injector adds new Urls
2. Generator runs Scoring Function to
select Urls
3. Urls are divided in segments
4. Each segment is downloaded in parallel
5. Pages are parsed
6. Newly discovered urls are added to
CrawlDB
10. Nutch in SpazioDati
● Restricted to:
○ .it domains
○ domains registered in Italy (through
whois)
● Runs weekly:
○ Cluster of 15 machines
○ Use Elastic MapReduce service
○ 12M pages each week
● Keep complete history
○ 5.3T downloaded
○ After 4 months pages are not processed
11. Crawling is not easy!
Issues with crawling:
● People who do not want to be crawled
○ Be polite!
○ We follow robots.txt specification and
use unique User Agent
● Avoid accidental DDOS attacks
○ Each domain should be crawled
sequentially
● Never crawl too deeply
○ Filters on depth, url length and queries
○ Try to avoid crawling too much a single
domain
“The crawlers delved too greedily and too deep”
https://www.amazon.it/s/ref=lp_1345828031_nr_p_n_binding_browse-b_0
?fst=as%3Aoff&rh=n%3A411663031%2Cn%3A%21411664031%2Cn%3A
1345828031%2Cp_n_binding_browse-bin%3A509801031&bbn=1345828
031&ie=UTF8&qid=1504078452&rnid=509800031
13. Extracting data from Crawl
Crawler gives us compressed json of HTML with metadata
● Structured, useful information
● Domain based
● Distributed processing
Easy information Medium information Complex information
Text Social Accounts Technologies
Links Logo Entities
Codici Fiscali Language People
14. Hadoop for data processing:
● User defines User Defined Functions
● Hadoop framework
○ Stores input data
○ Divides it in chunkes
○ Makes it available to all machines
○ Runs UDFs on all chunkes
○ Guarantees fault tolerance
○ Collects output
Hadoop
This guy does not have the energy to implement
fault tolerance...
15. PIG
Scripting language for Hadoop
● Scripts are written in Pig Latin
● Looks kinda like SQL
● Easy built pipelines
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
filtered_words = FILTER words BY word MATCHES 'w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
16. Pig in SpazioDati
Our pipeline:
1. Computes domain for each page
2. Groups by domain
3. Extracts information for each domain
4. Integrates data from other sources (i.e.
whois)
5. Exports a json for each domain
● Runs (roughly) monthly
● Cluster of 30 machines
● AWS’s Elastic MapReduce service
● Difficult to test :(
18. Requirements
We want to index our extracted data.
● We should access it easily
● We should explore it efficiently
We will able to:
● Match it with official data about
companies
● Serve it in the backend of our services
5M jsons without indexing
19. Elasticsearch
Open source search engine
● Based on Lucene index
○ Highly efficient index
○ Mostly on disk
● Full text search
● Nested fields support
● Cluster structure
● Web interface
● Allows (very) complex queries
5M indexed jsons
25. The rest of the IDG pipeline
IDG is much more:
● Finding the correct domains for each
company
● Extracting information from social networks
● Validating emails collected in the web
● ecc…
The real IDG pipeline
26. Conclusions
● There is a lot of Open Source
Software for Big Data processing
● You’ll need to tinker with available
features
● Web data is often:
○ Outdated
○ Badly formatted
○ Ambiguous
27. Thanks for your attention!
Questions?
Interested?
see www.spaziodati.eu/jobs for opportunities!