SlideShare una empresa de Scribd logo
1 de 27
Descargar para leer sin conexión
Crawling and Processing the
Italian Corporate Web
Alessio Guerrieri
SpazioDati S.R.L.
Your speaker
● Born in Trento
● Studied at UniTN and Georgia Tech
● PhD in Large Scale Graph Analytics
● Teaches Algorithms and Data Structures
● Data Scientist at SpazioDati
In my spare time:
● {Read|Watch|Play} {Science
Fiction|Fantasy} {Novels|TV|Board
Games}
SpazioDati S.R.L.
● Born in 2012
● Data integration
● Focus on corporate world:
○ Official data from Camera di Commercio
○ Open data
● Atoka
○ B2B database of company information
○ Sales intelligence
○ API
● Data analytics
○ Portfolio analysis
○ Lead generation
○ Risk evaluation
Always hard at work!
Internet Data Gathering (IDG)
IDG is an internal project to gather, process
and organize internet data about italian
companies.
It uses many different technologies for Big
Data Gathering and Processing.
Entire pipeline runs on Amazon AWS A representation of the Internet
Internet Data Gathering (IDG)
Takeaways:
● Web data is HORRIBLE
● OSS can help!
● For Big Data, you need a Big Framework
Crawling the Corporate Web
Web Crawler
Image from https://en.wikipedia.org/wiki/Web_crawler
Apache Nutch
● Distributed crawler runnable on Hadoop
● Highly configurable
Each iteration:
1. Injector adds new Urls
2. Generator runs Scoring Function to
select Urls
3. Urls are divided in segments
4. Each segment is downloaded in parallel
5. Pages are parsed
6. Newly discovered urls are added to
CrawlDB
Apache Nutch
● Distributed crawler runnable on Hadoop
● Highly configurable
Each iteration:
1. Injector adds new Urls
2. Generator runs Scoring Function to
select Urls
3. Urls are divided in segments
4. Each segment is downloaded in parallel
5. Pages are parsed
6. Newly discovered urls are added to
CrawlDB
Nutch in SpazioDati
● Restricted to:
○ .it domains
○ domains registered in Italy (through
whois)
● Runs weekly:
○ Cluster of 15 machines
○ Use Elastic MapReduce service
○ 12M pages each week
● Keep complete history
○ 5.3T downloaded
○ After 4 months pages are not processed
Crawling is not easy!
Issues with crawling:
● People who do not want to be crawled
○ Be polite!
○ We follow robots.txt specification and
use unique User Agent
● Avoid accidental DDOS attacks
○ Each domain should be crawled
sequentially
● Never crawl too deeply
○ Filters on depth, url length and queries
○ Try to avoid crawling too much a single
domain
“The crawlers delved too greedily and too deep”
https://www.amazon.it/s/ref=lp_1345828031_nr_p_n_binding_browse-b_0
?fst=as%3Aoff&rh=n%3A411663031%2Cn%3A%21411664031%2Cn%3A
1345828031%2Cp_n_binding_browse-bin%3A509801031&bbn=1345828
031&ie=UTF8&qid=1504078452&rnid=509800031
Processing the Corporate Web
Extracting data from Crawl
Crawler gives us compressed json of HTML with metadata
● Structured, useful information
● Domain based
● Distributed processing
Easy information Medium information Complex information
Text Social Accounts Technologies
Links Logo Entities
Codici Fiscali Language People
Hadoop for data processing:
● User defines User Defined Functions
● Hadoop framework
○ Stores input data
○ Divides it in chunkes
○ Makes it available to all machines
○ Runs UDFs on all chunkes
○ Guarantees fault tolerance
○ Collects output
Hadoop
This guy does not have the energy to implement
fault tolerance...
PIG
Scripting language for Hadoop
● Scripts are written in Pig Latin
● Looks kinda like SQL
● Easy built pipelines
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
filtered_words = FILTER words BY word MATCHES 'w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
Pig in SpazioDati
Our pipeline:
1. Computes domain for each page
2. Groups by domain
3. Extracts information for each domain
4. Integrates data from other sources (i.e.
whois)
5. Exports a json for each domain
● Runs (roughly) monthly
● Cluster of 30 machines
● AWS’s Elastic MapReduce service
● Difficult to test :(
Querying the Corporate Web
Requirements
We want to index our extracted data.
● We should access it easily
● We should explore it efficiently
We will able to:
● Match it with official data about
companies
● Serve it in the backend of our services
5M jsons without indexing
Elasticsearch
Open source search engine
● Based on Lucene index
○ Highly efficient index
○ Mostly on disk
● Full text search
● Nested fields support
● Cluster structure
● Web interface
● Allows (very) complex queries
5M indexed jsons
Sample query
Domains that contain the word ‘speck’ in the
text:
{
"_source": false,
"query":{
"term":{
"text": "speck"
}
},
"size": 5
}
{
"hits": {
"total": 15069,
"max_score": 11.716405,
"hits": [
{
"_id": "www.titospeck.it",
"_score": 11.716405
},
{
"_id": "derpsairer.it",
"_score": 11.6602
},
{
"_id": "www.speck.it",
"_score": 11.626965
},
{
"_id": "www.bayona-music.com",
"_score": 11.607182
},
{
"_id": "www.salumificiocoati.it",
"_score": 11.560882
}
]
}
}
Sample query (2)
Domains that contain the phrases similar to
speck and tech in the text:
{
"_source": false,
"query":{
"term":{
"text": "speck and tech"
}
},
"size": 3
}
{
"hits": {
"total": 1003897,
"max_score" : 19.871191,
"hits": [
{
"_id": "speckand.tech" ,
"_score": 19.871191
},
{
"_id": "www.speckietechies.com" ,
"_score": 19.674822
},
{
"_id": "francescobonadiman.com" ,
"_score": 17.935522
}
]
}
}
Complex query
{
"size": 0,
"query":{
"bool":{
"must":[
{
"term":{
"technologies.cms.name" : "WordPress"
}
},
{
"term":{
"technologies.cms.version" :"3.0"
}
}
]
}
}
}
{
"took": 1,
"timed_out" : false,
"_shards": {
"total": 10,
"successful" : 10,
"failed": 0
},
"hits": {
"total": 211,
"max_score" : 0,
"hits": []
}
}
Complex query
Compute the distribution of most used cms software
{
"size": 0,
"aggregations" : {
"aggs" : {
"terms": {
"field" : "technologies.cms.name" ,
"size" : 20
}
}
}
}
{
"aggregations" : {
"aggs": {
"doc_count_error_upper_bound" : 997,
"sum_other_doc_count" : 43403,
"buckets" : [
{
"key": "WordPress" ,
"doc_count" : 590133
},
{
"key": "Joomla" ,
"doc_count" : 163595
},
{
"key": "Drupal" ,
"doc_count" : 33727
},
{
"key": "DM Polopoly" ,
"doc_count" : 30455
},
{
"key": "Weebly" ,
"doc_count" : 9861
}
]
}
}
}
Getting value from the Corporate Web
The rest of the IDG pipeline
IDG is much more:
● Finding the correct domains for each
company
● Extracting information from social networks
● Validating emails collected in the web
● ecc…
The real IDG pipeline
Conclusions
● There is a lot of Open Source
Software for Big Data processing
● You’ll need to tinker with available
features
● Web data is often:
○ Outdated
○ Badly formatted
○ Ambiguous
Thanks for your attention!
Questions?
Interested?
see www.spaziodati.eu/jobs for opportunities!

Más contenido relacionado

Similar a Crawling and Processing the Italian Corporate Web

Kiran karnad rtc2014 ghdb-final
Kiran karnad rtc2014 ghdb-finalKiran karnad rtc2014 ghdb-final
Kiran karnad rtc2014 ghdb-final
Romania Testing
 
DevTalks Keynote Powering interactive data analysis with Google BigQuery
DevTalks Keynote Powering interactive data analysis with Google BigQueryDevTalks Keynote Powering interactive data analysis with Google BigQuery
DevTalks Keynote Powering interactive data analysis with Google BigQuery
Márton Kodok
 

Similar a Crawling and Processing the Italian Corporate Web (20)

Living Labs Challenge Workshop
Living Labs Challenge WorkshopLiving Labs Challenge Workshop
Living Labs Challenge Workshop
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
 
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
 
Data Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch SeminarData Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch Seminar
 
Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022
 
A search engine in a world of events and microservices - SF Pot @Meetic
A search engine in a world of events and microservices - SF Pot @MeeticA search engine in a world of events and microservices - SF Pot @Meetic
A search engine in a world of events and microservices - SF Pot @Meetic
 
how to scrape data from yellow pages
how to scrape data from yellow pages how to scrape data from yellow pages
how to scrape data from yellow pages
 
Kiran karnad rtc2014 ghdb-final
Kiran karnad rtc2014 ghdb-finalKiran karnad rtc2014 ghdb-final
Kiran karnad rtc2014 ghdb-final
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 
Running a business on Web Scraped Data
Running a business on Web Scraped DataRunning a business on Web Scraped Data
Running a business on Web Scraped Data
 
Empowering red and blue teams with osint c0c0n 2017
Empowering red and blue teams with osint   c0c0n 2017Empowering red and blue teams with osint   c0c0n 2017
Empowering red and blue teams with osint c0c0n 2017
 
DevTalks Keynote Powering interactive data analysis with Google BigQuery
DevTalks Keynote Powering interactive data analysis with Google BigQueryDevTalks Keynote Powering interactive data analysis with Google BigQuery
DevTalks Keynote Powering interactive data analysis with Google BigQuery
 
EMFcamp2022 - What if apps logged into you, instead of you logging into apps?
EMFcamp2022 - What if apps logged into you, instead of you logging into apps?EMFcamp2022 - What if apps logged into you, instead of you logging into apps?
EMFcamp2022 - What if apps logged into you, instead of you logging into apps?
 
Word press optimizations
Word press optimizations Word press optimizations
Word press optimizations
 
Python in Industry
Python in IndustryPython in Industry
Python in Industry
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018
Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018
Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018
 
Crawlable Spatial Data - #Geo4Web research topic #3
Crawlable Spatial Data - #Geo4Web research topic #3Crawlable Spatial Data - #Geo4Web research topic #3
Crawlable Spatial Data - #Geo4Web research topic #3
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
 
Workflow Engines + Luigi
Workflow Engines + LuigiWorkflow Engines + Luigi
Workflow Engines + Luigi
 

Más de Speck&Tech

What should 6G be? - 6G: bridging gaps, connecting futures
What should 6G be? - 6G: bridging gaps, connecting futuresWhat should 6G be? - 6G: bridging gaps, connecting futures
What should 6G be? - 6G: bridging gaps, connecting futures
Speck&Tech
 
Creare il sangue artificiale: "buon sangue non mente"
Creare il sangue artificiale: "buon sangue non mente"Creare il sangue artificiale: "buon sangue non mente"
Creare il sangue artificiale: "buon sangue non mente"
Speck&Tech
 

Más de Speck&Tech (20)

What should 6G be? - 6G: bridging gaps, connecting futures
What should 6G be? - 6G: bridging gaps, connecting futuresWhat should 6G be? - 6G: bridging gaps, connecting futures
What should 6G be? - 6G: bridging gaps, connecting futures
 
Creare il sangue artificiale: "buon sangue non mente"
Creare il sangue artificiale: "buon sangue non mente"Creare il sangue artificiale: "buon sangue non mente"
Creare il sangue artificiale: "buon sangue non mente"
 
AWS: gestire la scalabilità su larga scala
AWS: gestire la scalabilità su larga scalaAWS: gestire la scalabilità su larga scala
AWS: gestire la scalabilità su larga scala
 
Praticamente... AWS - Amazon Web Services
Praticamente... AWS - Amazon Web ServicesPraticamente... AWS - Amazon Web Services
Praticamente... AWS - Amazon Web Services
 
Data Sense-making: navigating the world through the lens of information design
Data Sense-making: navigating the world through the lens of information designData Sense-making: navigating the world through the lens of information design
Data Sense-making: navigating the world through the lens of information design
 
Data Activism: data as rhetoric, data as power
Data Activism: data as rhetoric, data as powerData Activism: data as rhetoric, data as power
Data Activism: data as rhetoric, data as power
 
Delve into the world of the human microbiome and metagenomics
Delve into the world of the human microbiome and metagenomicsDelve into the world of the human microbiome and metagenomics
Delve into the world of the human microbiome and metagenomics
 
Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...
Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...
Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...
 
Monitorare una flotta di autobus: architettura di un progetto di acquisizione...
Monitorare una flotta di autobus: architettura di un progetto di acquisizione...Monitorare una flotta di autobus: architettura di un progetto di acquisizione...
Monitorare una flotta di autobus: architettura di un progetto di acquisizione...
 
Why LLMs should be handled with care
Why LLMs should be handled with careWhy LLMs should be handled with care
Why LLMs should be handled with care
 
Building intelligent applications with Large Language Models
Building intelligent applications with Large Language ModelsBuilding intelligent applications with Large Language Models
Building intelligent applications with Large Language Models
 
Privacy in the era of quantum computers
Privacy in the era of quantum computersPrivacy in the era of quantum computers
Privacy in the era of quantum computers
 
Machine learning with quantum computers
Machine learning with quantum computersMachine learning with quantum computers
Machine learning with quantum computers
 
Give your Web App superpowers by using GPUs
Give your Web App superpowers by using GPUsGive your Web App superpowers by using GPUs
Give your Web App superpowers by using GPUs
 
From leaf to orbit: exploring forests with technology
From leaf to orbit: exploring forests with technologyFrom leaf to orbit: exploring forests with technology
From leaf to orbit: exploring forests with technology
 
Innovating Wood
Innovating WoodInnovating Wood
Innovating Wood
 
Behind the scenes of our everyday Internet: the role of an IXP like MIX
Behind the scenes of our everyday Internet: the role of an IXP like MIXBehind the scenes of our everyday Internet: the role of an IXP like MIX
Behind the scenes of our everyday Internet: the role of an IXP like MIX
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for science
 
Truck planning: how to certify the right route
Truck planning: how to certify the right routeTruck planning: how to certify the right route
Truck planning: how to certify the right route
 
Break it up! 5G, cruise control, autonomous vehicle cooperation, and bending ...
Break it up! 5G, cruise control, autonomous vehicle cooperation, and bending ...Break it up! 5G, cruise control, autonomous vehicle cooperation, and bending ...
Break it up! 5G, cruise control, autonomous vehicle cooperation, and bending ...
 

Último

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 

Crawling and Processing the Italian Corporate Web

  • 1. Crawling and Processing the Italian Corporate Web Alessio Guerrieri SpazioDati S.R.L.
  • 2. Your speaker ● Born in Trento ● Studied at UniTN and Georgia Tech ● PhD in Large Scale Graph Analytics ● Teaches Algorithms and Data Structures ● Data Scientist at SpazioDati In my spare time: ● {Read|Watch|Play} {Science Fiction|Fantasy} {Novels|TV|Board Games}
  • 3. SpazioDati S.R.L. ● Born in 2012 ● Data integration ● Focus on corporate world: ○ Official data from Camera di Commercio ○ Open data ● Atoka ○ B2B database of company information ○ Sales intelligence ○ API ● Data analytics ○ Portfolio analysis ○ Lead generation ○ Risk evaluation Always hard at work!
  • 4. Internet Data Gathering (IDG) IDG is an internal project to gather, process and organize internet data about italian companies. It uses many different technologies for Big Data Gathering and Processing. Entire pipeline runs on Amazon AWS A representation of the Internet
  • 5. Internet Data Gathering (IDG) Takeaways: ● Web data is HORRIBLE ● OSS can help! ● For Big Data, you need a Big Framework
  • 7. Web Crawler Image from https://en.wikipedia.org/wiki/Web_crawler
  • 8. Apache Nutch ● Distributed crawler runnable on Hadoop ● Highly configurable Each iteration: 1. Injector adds new Urls 2. Generator runs Scoring Function to select Urls 3. Urls are divided in segments 4. Each segment is downloaded in parallel 5. Pages are parsed 6. Newly discovered urls are added to CrawlDB
  • 9. Apache Nutch ● Distributed crawler runnable on Hadoop ● Highly configurable Each iteration: 1. Injector adds new Urls 2. Generator runs Scoring Function to select Urls 3. Urls are divided in segments 4. Each segment is downloaded in parallel 5. Pages are parsed 6. Newly discovered urls are added to CrawlDB
  • 10. Nutch in SpazioDati ● Restricted to: ○ .it domains ○ domains registered in Italy (through whois) ● Runs weekly: ○ Cluster of 15 machines ○ Use Elastic MapReduce service ○ 12M pages each week ● Keep complete history ○ 5.3T downloaded ○ After 4 months pages are not processed
  • 11. Crawling is not easy! Issues with crawling: ● People who do not want to be crawled ○ Be polite! ○ We follow robots.txt specification and use unique User Agent ● Avoid accidental DDOS attacks ○ Each domain should be crawled sequentially ● Never crawl too deeply ○ Filters on depth, url length and queries ○ Try to avoid crawling too much a single domain “The crawlers delved too greedily and too deep” https://www.amazon.it/s/ref=lp_1345828031_nr_p_n_binding_browse-b_0 ?fst=as%3Aoff&rh=n%3A411663031%2Cn%3A%21411664031%2Cn%3A 1345828031%2Cp_n_binding_browse-bin%3A509801031&bbn=1345828 031&ie=UTF8&qid=1504078452&rnid=509800031
  • 13. Extracting data from Crawl Crawler gives us compressed json of HTML with metadata ● Structured, useful information ● Domain based ● Distributed processing Easy information Medium information Complex information Text Social Accounts Technologies Links Logo Entities Codici Fiscali Language People
  • 14. Hadoop for data processing: ● User defines User Defined Functions ● Hadoop framework ○ Stores input data ○ Divides it in chunkes ○ Makes it available to all machines ○ Runs UDFs on all chunkes ○ Guarantees fault tolerance ○ Collects output Hadoop This guy does not have the energy to implement fault tolerance...
  • 15. PIG Scripting language for Hadoop ● Scripts are written in Pig Latin ● Looks kinda like SQL ● Easy built pipelines input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; filtered_words = FILTER words BY word MATCHES 'w+'; word_groups = GROUP filtered_words BY word; word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
  • 16. Pig in SpazioDati Our pipeline: 1. Computes domain for each page 2. Groups by domain 3. Extracts information for each domain 4. Integrates data from other sources (i.e. whois) 5. Exports a json for each domain ● Runs (roughly) monthly ● Cluster of 30 machines ● AWS’s Elastic MapReduce service ● Difficult to test :(
  • 18. Requirements We want to index our extracted data. ● We should access it easily ● We should explore it efficiently We will able to: ● Match it with official data about companies ● Serve it in the backend of our services 5M jsons without indexing
  • 19. Elasticsearch Open source search engine ● Based on Lucene index ○ Highly efficient index ○ Mostly on disk ● Full text search ● Nested fields support ● Cluster structure ● Web interface ● Allows (very) complex queries 5M indexed jsons
  • 20. Sample query Domains that contain the word ‘speck’ in the text: { "_source": false, "query":{ "term":{ "text": "speck" } }, "size": 5 } { "hits": { "total": 15069, "max_score": 11.716405, "hits": [ { "_id": "www.titospeck.it", "_score": 11.716405 }, { "_id": "derpsairer.it", "_score": 11.6602 }, { "_id": "www.speck.it", "_score": 11.626965 }, { "_id": "www.bayona-music.com", "_score": 11.607182 }, { "_id": "www.salumificiocoati.it", "_score": 11.560882 } ] } }
  • 21. Sample query (2) Domains that contain the phrases similar to speck and tech in the text: { "_source": false, "query":{ "term":{ "text": "speck and tech" } }, "size": 3 } { "hits": { "total": 1003897, "max_score" : 19.871191, "hits": [ { "_id": "speckand.tech" , "_score": 19.871191 }, { "_id": "www.speckietechies.com" , "_score": 19.674822 }, { "_id": "francescobonadiman.com" , "_score": 17.935522 } ] } }
  • 22. Complex query { "size": 0, "query":{ "bool":{ "must":[ { "term":{ "technologies.cms.name" : "WordPress" } }, { "term":{ "technologies.cms.version" :"3.0" } } ] } } } { "took": 1, "timed_out" : false, "_shards": { "total": 10, "successful" : 10, "failed": 0 }, "hits": { "total": 211, "max_score" : 0, "hits": [] } }
  • 23. Complex query Compute the distribution of most used cms software { "size": 0, "aggregations" : { "aggs" : { "terms": { "field" : "technologies.cms.name" , "size" : 20 } } } } { "aggregations" : { "aggs": { "doc_count_error_upper_bound" : 997, "sum_other_doc_count" : 43403, "buckets" : [ { "key": "WordPress" , "doc_count" : 590133 }, { "key": "Joomla" , "doc_count" : 163595 }, { "key": "Drupal" , "doc_count" : 33727 }, { "key": "DM Polopoly" , "doc_count" : 30455 }, { "key": "Weebly" , "doc_count" : 9861 } ] } } }
  • 24. Getting value from the Corporate Web
  • 25. The rest of the IDG pipeline IDG is much more: ● Finding the correct domains for each company ● Extracting information from social networks ● Validating emails collected in the web ● ecc… The real IDG pipeline
  • 26. Conclusions ● There is a lot of Open Source Software for Big Data processing ● You’ll need to tinker with available features ● Web data is often: ○ Outdated ○ Badly formatted ○ Ambiguous
  • 27. Thanks for your attention! Questions? Interested? see www.spaziodati.eu/jobs for opportunities!