SlideShare una empresa de Scribd logo
1 de 50
Descargar para leer sin conexión
Outline            Motivation             Algorithms       Experiments      Summary              References




                   Scheduling Algorithms for Web Crawling

               C. Castillo, M. Marin, A. Rodr´
                                             ıguez and R. Baeza-Yates

                                             Center for Web Research
                                                   www.cwr.cl


                                                LA-WEB 2004



C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                   Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




      Motivation


      Algorithms


      Experiments


      Summary


      References




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Motivation



              Web search generates more than 13% of the traffic to Web
              sites [StatMarket, 2003].
              No search engine indexes more than one third of the publicly
              available Web [Lawrence and Giles, 1998].
              If we cannot download all of the pages, we should at least
              download the most “important” ones.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Motivation



              Web search generates more than 13% of the traffic to Web
              sites [StatMarket, 2003].
              No search engine indexes more than one third of the publicly
              available Web [Lawrence and Giles, 1998].
              If we cannot download all of the pages, we should at least
              download the most “important” ones.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Motivation



              Web search generates more than 13% of the traffic to Web
              sites [StatMarket, 2003].
              No search engine indexes more than one third of the publicly
              available Web [Lawrence and Giles, 1998].
              If we cannot download all of the pages, we should at least
              download the most “important” ones.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms          Experiments      Summary              References




The problem of Web crawling


      We must download pages with sizes given by Pi , over a connection
      of bandwidth B. Trivial solution: we download all the pages
      simultaneously at a speed proportional to the size of each page:

                                           Pi
                                                       Bi =
                                          T∗
      T ∗ is the optimal time to use all the available bandwidth:

                                                               Pi
                                                  T∗ =
                                                              B




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                      Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Optimal scenario




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Restrictions



              Robot exclusion protocol [Koster, 1995]
              Waiting time ≈ 10 − 30 seconds
              Web sites bandwidth BiMAX lower than the crawler bandwidth
              B
              Distribution of Web site sizes is very skewed




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Restrictions



              Robot exclusion protocol [Koster, 1995]
              Waiting time ≈ 10 − 30 seconds
              Web sites bandwidth BiMAX lower than the crawler bandwidth
              B
              Distribution of Web site sizes is very skewed




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Restrictions



              Robot exclusion protocol [Koster, 1995]
              Waiting time ≈ 10 − 30 seconds
              Web sites bandwidth BiMAX lower than the crawler bandwidth
              B
              Distribution of Web site sizes is very skewed




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Restrictions



              Robot exclusion protocol [Koster, 1995]
              Waiting time ≈ 10 − 30 seconds
              Web sites bandwidth BiMAX lower than the crawler bandwidth
              B
              Distribution of Web site sizes is very skewed




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Distribution of site sizes




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Realistic scenario




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Number of active robots in a batch




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Goal




      If each page has a certain score, capture most of the total value of
      this score downloading just a fraction of the pages.
      We will use the total Pagerank of the downloaded set vs. the
      fraction of downloaded pages as a measure of quality




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms




      Algorithms are based on a scheduler with two levels of queues:
              Queue of Web sites
              Queue of Web pages in each Web site




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms




      Algorithms are based on a scheduler with two levels of queues:
              Queue of Web sites
              Queue of Web pages in each Web site




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms




      Algorithms are based on a scheduler with two levels of queues:
              Queue of Web sites
              Queue of Web pages in each Web site




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Queues used for the scheduling




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms based on Pagerank


              Optimal/Oracle: crawler asks for the Pagerank value of each
              page in the frontier using an “Oracle”. This is not available in
              a real crawl as we do not have the entire graph
              The average relative error for estimating the Pagerank four
              months ahead is about 78% [Cho and Adams, 2004], so
              historical information from previous crawls is not too useful
              Batch-Pagerank: Pagerank calculations are executed over
              the subset of known pages [Cho et al., 1998]
              Partial-Pagerank: a “temporary” Pagerank value is assigned
              to the pages in between batch-Pagerank calculations



C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms based on Pagerank


              Optimal/Oracle: crawler asks for the Pagerank value of each
              page in the frontier using an “Oracle”. This is not available in
              a real crawl as we do not have the entire graph
              The average relative error for estimating the Pagerank four
              months ahead is about 78% [Cho and Adams, 2004], so
              historical information from previous crawls is not too useful
              Batch-Pagerank: Pagerank calculations are executed over
              the subset of known pages [Cho et al., 1998]
              Partial-Pagerank: a “temporary” Pagerank value is assigned
              to the pages in between batch-Pagerank calculations



C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms based on Pagerank


              Optimal/Oracle: crawler asks for the Pagerank value of each
              page in the frontier using an “Oracle”. This is not available in
              a real crawl as we do not have the entire graph
              The average relative error for estimating the Pagerank four
              months ahead is about 78% [Cho and Adams, 2004], so
              historical information from previous crawls is not too useful
              Batch-Pagerank: Pagerank calculations are executed over
              the subset of known pages [Cho et al., 1998]
              Partial-Pagerank: a “temporary” Pagerank value is assigned
              to the pages in between batch-Pagerank calculations



C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms not based on Pagerank



              Depth: pages are given a priority based on their depths. This
              is graph traversal in breadth-first ordering
              [Najork and Wiener, 2001]
              Length: pages from the Web sites which seem to be bigger
              are crawled first. We do not know which are really the bigger
              Web sites until the end of the crawl. We use partial
              information




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms not based on Pagerank



              Depth: pages are given a priority based on their depths. This
              is graph traversal in breadth-first ordering
              [Najork and Wiener, 2001]
              Length: pages from the Web sites which seem to be bigger
              are crawled first. We do not know which are really the bigger
              Web sites until the end of the crawl. We use partial
              information




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Experiments



              Download a sample of pages using the WIRE crawler
              [Baeza-Yates and Castillo, 2002]
              3.5 million pages from over 50,000 Web sites in .CL
              At most 25,000 pages from each Web site
              Strategies are simulated on a graph built using actual data
              Simulation includes: bandwidth saturation, network speed of
              different Web sites, page sizes, waiting time, latency, etc.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Experiments



              Download a sample of pages using the WIRE crawler
              [Baeza-Yates and Castillo, 2002]
              3.5 million pages from over 50,000 Web sites in .CL
              At most 25,000 pages from each Web site
              Strategies are simulated on a graph built using actual data
              Simulation includes: bandwidth saturation, network speed of
              different Web sites, page sizes, waiting time, latency, etc.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Experiments



              Download a sample of pages using the WIRE crawler
              [Baeza-Yates and Castillo, 2002]
              3.5 million pages from over 50,000 Web sites in .CL
              At most 25,000 pages from each Web site
              Strategies are simulated on a graph built using actual data
              Simulation includes: bandwidth saturation, network speed of
              different Web sites, page sizes, waiting time, latency, etc.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Experiments



              Download a sample of pages using the WIRE crawler
              [Baeza-Yates and Castillo, 2002]
              3.5 million pages from over 50,000 Web sites in .CL
              At most 25,000 pages from each Web site
              Strategies are simulated on a graph built using actual data
              Simulation includes: bandwidth saturation, network speed of
              different Web sites, page sizes, waiting time, latency, etc.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Experiments



              Download a sample of pages using the WIRE crawler
              [Baeza-Yates and Castillo, 2002]
              3.5 million pages from over 50,000 Web sites in .CL
              At most 25,000 pages from each Web site
              Strategies are simulated on a graph built using actual data
              Simulation includes: bandwidth saturation, network speed of
              different Web sites, page sizes, waiting time, latency, etc.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Simulation parameters



              Algorithm
              Waiting time between pages from the same Web site w
              Number of pages downloaded per connection when re-using
              the HTTP connection k
              Number of robots r




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Simulation parameters



              Algorithm
              Waiting time between pages from the same Web site w
              Number of pages downloaded per connection when re-using
              the HTTP connection k
              Number of robots r




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Simulation parameters



              Algorithm
              Waiting time between pages from the same Web site w
              Number of pages downloaded per connection when re-using
              the HTTP connection k
              Number of robots r




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Simulation parameters



              Algorithm
              Waiting time between pages from the same Web site w
              Number of pages downloaded per connection when re-using
              the HTTP connection k
              Number of robots r




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Results with one robot




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Results with many robots




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Speed-ups with the “Length” strategy




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Crawling the real Web using the “Length” strategy




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Pagerank vs day of crawl




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Depth is not correlated with Pagerank
      When depth is ≥ 2 links from the home page




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Summary



              The restrictions, specially waiting time, create a difficult
              problem for scheduling
              An strategy with an “oracle” was too greedy
              We try to keep Web sites in the frontier for as long as
              possible, so we always have several Web sites to choose from
              Simulation ensures the same conditions, which is critical
              because the Web is very dynamic




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Summary



              The restrictions, specially waiting time, create a difficult
              problem for scheduling
              An strategy with an “oracle” was too greedy
              We try to keep Web sites in the frontier for as long as
              possible, so we always have several Web sites to choose from
              Simulation ensures the same conditions, which is critical
              because the Web is very dynamic




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Summary



              The restrictions, specially waiting time, create a difficult
              problem for scheduling
              An strategy with an “oracle” was too greedy
              We try to keep Web sites in the frontier for as long as
              possible, so we always have several Web sites to choose from
              Simulation ensures the same conditions, which is critical
              because the Web is very dynamic




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Summary



              The restrictions, specially waiting time, create a difficult
              problem for scheduling
              An strategy with an “oracle” was too greedy
              We try to keep Web sites in the frontier for as long as
              possible, so we always have several Web sites to choose from
              Simulation ensures the same conditions, which is critical
              because the Web is very dynamic




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Open problems




              Scheduling using historical information
              Exploiting the Web’s structure
              Adversarial IR: Spam detection before downloading the pages




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Open problems




              Scheduling using historical information
              Exploiting the Web’s structure
              Adversarial IR: Spam detection before downloading the pages




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Open problems




              Scheduling using historical information
              Exploiting the Web’s structure
              Adversarial IR: Spam detection before downloading the pages




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




             Baeza-Yates, R. and Castillo, C. (2002).
             Balancing volume, quality and freshness in web crawling.
             In Soft Computing Systems - Design, Management and
             Applications, pages 565–572, Santiago, Chile. IOS Press
             Amsterdam.
             Cho, J. and Adams, R. (2004).
             Page quality: In search of an unbiased Web ranking.
             Technical report, UCLA Computer Science.
             Cho, J., Garc´
                          ıa-Molina, H., and Page, L. (1998).
             Efficient crawling through URL ordering.
             In Proceedings of the seventh conference on World Wide Web,
             Brisbane, Australia.
             Koster, M. (1995).
             Robots in the web: threat or treat ?
             ConneXions, 9(4).
C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




             Lawrence, S. and Giles, C. L. (1998).
             Searching the World Wide Web.
             Science, 280(5360):98–100.
             Najork, M. and Wiener, J. L. (2001).
             Breadth-first crawling yields high-quality pages.
             In Proceedings of the Tenth Conference on World Wide Web,
             pages 114–118, Hong Kong. Elsevier Science.
             StatMarket (2003).
             Search engine referrals nearly double worldwide.
             http://websidestory.com/pressroom/pressreleases.html-
             ?id=181.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling

Más contenido relacionado

La actualidad más candente

Applications of Artificial Intelligence
Applications of Artificial IntelligenceApplications of Artificial Intelligence
Applications of Artificial IntelligenceMehr Un Nisa Manjotho
 
Three big questions about AI in financial services
Three big questions about AI in financial servicesThree big questions about AI in financial services
Three big questions about AI in financial servicesWhite & Case
 
How Does Google Use Artificial Intelligence?
How Does Google Use Artificial Intelligence?How Does Google Use Artificial Intelligence?
How Does Google Use Artificial Intelligence?Bernard Marr
 
Artificial Intelligence in Project Management by Dr. Khaled A. Hamdy
Artificial Intelligence in Project Management by  Dr. Khaled A. HamdyArtificial Intelligence in Project Management by  Dr. Khaled A. Hamdy
Artificial Intelligence in Project Management by Dr. Khaled A. HamdyAgile ME
 
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1DianaGray10
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksBICA Labs
 
Generative AI in Healthcare Market.pptx
Generative AI in Healthcare Market.pptxGenerative AI in Healthcare Market.pptx
Generative AI in Healthcare Market.pptxGayatriGadhave1
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language ModelsLeon Dohmen
 
Introduction to Machine learning with Python
Introduction to Machine learning with PythonIntroduction to Machine learning with Python
Introduction to Machine learning with PythonChariza Pladin
 
Fairness in Machine Learning and AI
Fairness in Machine Learning and AIFairness in Machine Learning and AI
Fairness in Machine Learning and AISeth Grimes
 
Prediction of heart disease using machine learning.pptx
Prediction of heart disease using machine learning.pptxPrediction of heart disease using machine learning.pptx
Prediction of heart disease using machine learning.pptxkumari36
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligenceNitesh Kumar
 
Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Grigory Sapunov
 
Information Retrieval using Semantic Similarity
Information Retrieval using Semantic SimilarityInformation Retrieval using Semantic Similarity
Information Retrieval using Semantic SimilaritySaswat Padhi
 
Web scraping
Web scrapingWeb scraping
Web scrapingSelecto
 

La actualidad más candente (20)

Applications of Artificial Intelligence
Applications of Artificial IntelligenceApplications of Artificial Intelligence
Applications of Artificial Intelligence
 
WEB Scraping.pptx
WEB Scraping.pptxWEB Scraping.pptx
WEB Scraping.pptx
 
Three big questions about AI in financial services
Three big questions about AI in financial servicesThree big questions about AI in financial services
Three big questions about AI in financial services
 
How Does Google Use Artificial Intelligence?
How Does Google Use Artificial Intelligence?How Does Google Use Artificial Intelligence?
How Does Google Use Artificial Intelligence?
 
Artificial Intelligence in Project Management by Dr. Khaled A. Hamdy
Artificial Intelligence in Project Management by  Dr. Khaled A. HamdyArtificial Intelligence in Project Management by  Dr. Khaled A. Hamdy
Artificial Intelligence in Project Management by Dr. Khaled A. Hamdy
 
Bias in AI
Bias in AIBias in AI
Bias in AI
 
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
 
Generative AI in Healthcare Market.pptx
Generative AI in Healthcare Market.pptxGenerative AI in Healthcare Market.pptx
Generative AI in Healthcare Market.pptx
 
web mining
web miningweb mining
web mining
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
 
Introduction to Machine learning with Python
Introduction to Machine learning with PythonIntroduction to Machine learning with Python
Introduction to Machine learning with Python
 
Web mining
Web mining Web mining
Web mining
 
Web mining
Web miningWeb mining
Web mining
 
Fairness in Machine Learning and AI
Fairness in Machine Learning and AIFairness in Machine Learning and AI
Fairness in Machine Learning and AI
 
Prediction of heart disease using machine learning.pptx
Prediction of heart disease using machine learning.pptxPrediction of heart disease using machine learning.pptx
Prediction of heart disease using machine learning.pptx
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 
Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018
 
Information Retrieval using Semantic Similarity
Information Retrieval using Semantic SimilarityInformation Retrieval using Semantic Similarity
Information Retrieval using Semantic Similarity
 
Web scraping
Web scrapingWeb scraping
Web scraping
 

Más de Carlos Castillo (ChaTo)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social MediaCarlos Castillo (ChaTo)
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Carlos Castillo (ChaTo)
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Carlos Castillo (ChaTo)
 

Más de Carlos Castillo (ChaTo) (20)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
 
Link prediction
Link predictionLink prediction
Link prediction
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Indexing
IndexingIndexing
Indexing
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Último (20)

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Scheduling Algorithms for Web Crawling

  • 1. Outline Motivation Algorithms Experiments Summary References Scheduling Algorithms for Web Crawling C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl LA-WEB 2004 C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 2. Outline Motivation Algorithms Experiments Summary References Motivation Algorithms Experiments Summary References C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 3. Outline Motivation Algorithms Experiments Summary References Motivation Web search generates more than 13% of the traffic to Web sites [StatMarket, 2003]. No search engine indexes more than one third of the publicly available Web [Lawrence and Giles, 1998]. If we cannot download all of the pages, we should at least download the most “important” ones. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 4. Outline Motivation Algorithms Experiments Summary References Motivation Web search generates more than 13% of the traffic to Web sites [StatMarket, 2003]. No search engine indexes more than one third of the publicly available Web [Lawrence and Giles, 1998]. If we cannot download all of the pages, we should at least download the most “important” ones. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 5. Outline Motivation Algorithms Experiments Summary References Motivation Web search generates more than 13% of the traffic to Web sites [StatMarket, 2003]. No search engine indexes more than one third of the publicly available Web [Lawrence and Giles, 1998]. If we cannot download all of the pages, we should at least download the most “important” ones. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 6. Outline Motivation Algorithms Experiments Summary References The problem of Web crawling We must download pages with sizes given by Pi , over a connection of bandwidth B. Trivial solution: we download all the pages simultaneously at a speed proportional to the size of each page: Pi Bi = T∗ T ∗ is the optimal time to use all the available bandwidth: Pi T∗ = B C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 7. Outline Motivation Algorithms Experiments Summary References Optimal scenario C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 8. Outline Motivation Algorithms Experiments Summary References Restrictions Robot exclusion protocol [Koster, 1995] Waiting time ≈ 10 − 30 seconds Web sites bandwidth BiMAX lower than the crawler bandwidth B Distribution of Web site sizes is very skewed C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 9. Outline Motivation Algorithms Experiments Summary References Restrictions Robot exclusion protocol [Koster, 1995] Waiting time ≈ 10 − 30 seconds Web sites bandwidth BiMAX lower than the crawler bandwidth B Distribution of Web site sizes is very skewed C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 10. Outline Motivation Algorithms Experiments Summary References Restrictions Robot exclusion protocol [Koster, 1995] Waiting time ≈ 10 − 30 seconds Web sites bandwidth BiMAX lower than the crawler bandwidth B Distribution of Web site sizes is very skewed C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 11. Outline Motivation Algorithms Experiments Summary References Restrictions Robot exclusion protocol [Koster, 1995] Waiting time ≈ 10 − 30 seconds Web sites bandwidth BiMAX lower than the crawler bandwidth B Distribution of Web site sizes is very skewed C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 12. Outline Motivation Algorithms Experiments Summary References Distribution of site sizes C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 13. Outline Motivation Algorithms Experiments Summary References Realistic scenario C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 14. Outline Motivation Algorithms Experiments Summary References Number of active robots in a batch C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 15. Outline Motivation Algorithms Experiments Summary References Goal If each page has a certain score, capture most of the total value of this score downloading just a fraction of the pages. We will use the total Pagerank of the downloaded set vs. the fraction of downloaded pages as a measure of quality C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 16. Outline Motivation Algorithms Experiments Summary References Algorithms Algorithms are based on a scheduler with two levels of queues: Queue of Web sites Queue of Web pages in each Web site C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 17. Outline Motivation Algorithms Experiments Summary References Algorithms Algorithms are based on a scheduler with two levels of queues: Queue of Web sites Queue of Web pages in each Web site C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 18. Outline Motivation Algorithms Experiments Summary References Algorithms Algorithms are based on a scheduler with two levels of queues: Queue of Web sites Queue of Web pages in each Web site C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 19. Outline Motivation Algorithms Experiments Summary References Queues used for the scheduling C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 20. Outline Motivation Algorithms Experiments Summary References Algorithms based on Pagerank Optimal/Oracle: crawler asks for the Pagerank value of each page in the frontier using an “Oracle”. This is not available in a real crawl as we do not have the entire graph The average relative error for estimating the Pagerank four months ahead is about 78% [Cho and Adams, 2004], so historical information from previous crawls is not too useful Batch-Pagerank: Pagerank calculations are executed over the subset of known pages [Cho et al., 1998] Partial-Pagerank: a “temporary” Pagerank value is assigned to the pages in between batch-Pagerank calculations C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 21. Outline Motivation Algorithms Experiments Summary References Algorithms based on Pagerank Optimal/Oracle: crawler asks for the Pagerank value of each page in the frontier using an “Oracle”. This is not available in a real crawl as we do not have the entire graph The average relative error for estimating the Pagerank four months ahead is about 78% [Cho and Adams, 2004], so historical information from previous crawls is not too useful Batch-Pagerank: Pagerank calculations are executed over the subset of known pages [Cho et al., 1998] Partial-Pagerank: a “temporary” Pagerank value is assigned to the pages in between batch-Pagerank calculations C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 22. Outline Motivation Algorithms Experiments Summary References Algorithms based on Pagerank Optimal/Oracle: crawler asks for the Pagerank value of each page in the frontier using an “Oracle”. This is not available in a real crawl as we do not have the entire graph The average relative error for estimating the Pagerank four months ahead is about 78% [Cho and Adams, 2004], so historical information from previous crawls is not too useful Batch-Pagerank: Pagerank calculations are executed over the subset of known pages [Cho et al., 1998] Partial-Pagerank: a “temporary” Pagerank value is assigned to the pages in between batch-Pagerank calculations C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 23. Outline Motivation Algorithms Experiments Summary References Algorithms not based on Pagerank Depth: pages are given a priority based on their depths. This is graph traversal in breadth-first ordering [Najork and Wiener, 2001] Length: pages from the Web sites which seem to be bigger are crawled first. We do not know which are really the bigger Web sites until the end of the crawl. We use partial information C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 24. Outline Motivation Algorithms Experiments Summary References Algorithms not based on Pagerank Depth: pages are given a priority based on their depths. This is graph traversal in breadth-first ordering [Najork and Wiener, 2001] Length: pages from the Web sites which seem to be bigger are crawled first. We do not know which are really the bigger Web sites until the end of the crawl. We use partial information C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 25. Outline Motivation Algorithms Experiments Summary References Experiments Download a sample of pages using the WIRE crawler [Baeza-Yates and Castillo, 2002] 3.5 million pages from over 50,000 Web sites in .CL At most 25,000 pages from each Web site Strategies are simulated on a graph built using actual data Simulation includes: bandwidth saturation, network speed of different Web sites, page sizes, waiting time, latency, etc. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 26. Outline Motivation Algorithms Experiments Summary References Experiments Download a sample of pages using the WIRE crawler [Baeza-Yates and Castillo, 2002] 3.5 million pages from over 50,000 Web sites in .CL At most 25,000 pages from each Web site Strategies are simulated on a graph built using actual data Simulation includes: bandwidth saturation, network speed of different Web sites, page sizes, waiting time, latency, etc. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 27. Outline Motivation Algorithms Experiments Summary References Experiments Download a sample of pages using the WIRE crawler [Baeza-Yates and Castillo, 2002] 3.5 million pages from over 50,000 Web sites in .CL At most 25,000 pages from each Web site Strategies are simulated on a graph built using actual data Simulation includes: bandwidth saturation, network speed of different Web sites, page sizes, waiting time, latency, etc. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 28. Outline Motivation Algorithms Experiments Summary References Experiments Download a sample of pages using the WIRE crawler [Baeza-Yates and Castillo, 2002] 3.5 million pages from over 50,000 Web sites in .CL At most 25,000 pages from each Web site Strategies are simulated on a graph built using actual data Simulation includes: bandwidth saturation, network speed of different Web sites, page sizes, waiting time, latency, etc. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 29. Outline Motivation Algorithms Experiments Summary References Experiments Download a sample of pages using the WIRE crawler [Baeza-Yates and Castillo, 2002] 3.5 million pages from over 50,000 Web sites in .CL At most 25,000 pages from each Web site Strategies are simulated on a graph built using actual data Simulation includes: bandwidth saturation, network speed of different Web sites, page sizes, waiting time, latency, etc. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 30. Outline Motivation Algorithms Experiments Summary References Simulation parameters Algorithm Waiting time between pages from the same Web site w Number of pages downloaded per connection when re-using the HTTP connection k Number of robots r C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 31. Outline Motivation Algorithms Experiments Summary References Simulation parameters Algorithm Waiting time between pages from the same Web site w Number of pages downloaded per connection when re-using the HTTP connection k Number of robots r C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 32. Outline Motivation Algorithms Experiments Summary References Simulation parameters Algorithm Waiting time between pages from the same Web site w Number of pages downloaded per connection when re-using the HTTP connection k Number of robots r C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 33. Outline Motivation Algorithms Experiments Summary References Simulation parameters Algorithm Waiting time between pages from the same Web site w Number of pages downloaded per connection when re-using the HTTP connection k Number of robots r C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 34. Outline Motivation Algorithms Experiments Summary References Results with one robot C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 35. Outline Motivation Algorithms Experiments Summary References Results with many robots C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 36. Outline Motivation Algorithms Experiments Summary References Speed-ups with the “Length” strategy C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 37. Outline Motivation Algorithms Experiments Summary References Crawling the real Web using the “Length” strategy C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 38. Outline Motivation Algorithms Experiments Summary References Pagerank vs day of crawl C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 39. Outline Motivation Algorithms Experiments Summary References Depth is not correlated with Pagerank When depth is ≥ 2 links from the home page C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 40. Outline Motivation Algorithms Experiments Summary References Summary The restrictions, specially waiting time, create a difficult problem for scheduling An strategy with an “oracle” was too greedy We try to keep Web sites in the frontier for as long as possible, so we always have several Web sites to choose from Simulation ensures the same conditions, which is critical because the Web is very dynamic C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 41. Outline Motivation Algorithms Experiments Summary References Summary The restrictions, specially waiting time, create a difficult problem for scheduling An strategy with an “oracle” was too greedy We try to keep Web sites in the frontier for as long as possible, so we always have several Web sites to choose from Simulation ensures the same conditions, which is critical because the Web is very dynamic C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 42. Outline Motivation Algorithms Experiments Summary References Summary The restrictions, specially waiting time, create a difficult problem for scheduling An strategy with an “oracle” was too greedy We try to keep Web sites in the frontier for as long as possible, so we always have several Web sites to choose from Simulation ensures the same conditions, which is critical because the Web is very dynamic C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 43. Outline Motivation Algorithms Experiments Summary References Summary The restrictions, specially waiting time, create a difficult problem for scheduling An strategy with an “oracle” was too greedy We try to keep Web sites in the frontier for as long as possible, so we always have several Web sites to choose from Simulation ensures the same conditions, which is critical because the Web is very dynamic C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 44. Outline Motivation Algorithms Experiments Summary References Open problems Scheduling using historical information Exploiting the Web’s structure Adversarial IR: Spam detection before downloading the pages C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 45. Outline Motivation Algorithms Experiments Summary References Open problems Scheduling using historical information Exploiting the Web’s structure Adversarial IR: Spam detection before downloading the pages C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 46. Outline Motivation Algorithms Experiments Summary References Open problems Scheduling using historical information Exploiting the Web’s structure Adversarial IR: Spam detection before downloading the pages C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 47. Outline Motivation Algorithms Experiments Summary References Baeza-Yates, R. and Castillo, C. (2002). Balancing volume, quality and freshness in web crawling. In Soft Computing Systems - Design, Management and Applications, pages 565–572, Santiago, Chile. IOS Press Amsterdam. Cho, J. and Adams, R. (2004). Page quality: In search of an unbiased Web ranking. Technical report, UCLA Computer Science. Cho, J., Garc´ ıa-Molina, H., and Page, L. (1998). Efficient crawling through URL ordering. In Proceedings of the seventh conference on World Wide Web, Brisbane, Australia. Koster, M. (1995). Robots in the web: threat or treat ? ConneXions, 9(4). C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 48. Outline Motivation Algorithms Experiments Summary References Lawrence, S. and Giles, C. L. (1998). Searching the World Wide Web. Science, 280(5360):98–100. Najork, M. and Wiener, J. L. (2001). Breadth-first crawling yields high-quality pages. In Proceedings of the Tenth Conference on World Wide Web, pages 114–118, Hong Kong. Elsevier Science. StatMarket (2003). Search engine referrals nearly double worldwide. http://websidestory.com/pressroom/pressreleases.html- ?id=181. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 49. Outline Motivation Algorithms Experiments Summary References C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 50. Outline Motivation Algorithms Experiments Summary References C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling