SlideShare una empresa de Scribd logo
1 de 48
Descargar para leer sin conexión
Outline                    Introduction           Models             Experiments                 Summary




                                    Crawling the Infinite Web:
                                     Five Levels are Enough

                                 Ricardo Baeza-Yates and Carlos Castillo

                                           Center for Web Research
                                                 www.cwr.cl


                                               WAW 2004


R. Baeza-Yates and C. Castillo                                                     Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models   Experiments                 Summary




          1 Introduction


          2 Models


          3 Experiments


          4 Summary




R. Baeza-Yates and C. Castillo                                   Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models         Experiments                 Summary




Introduction



               Dynamic page: “a page which is created on request”
               Dynamic pages with links to other dynamic pages
               Malicious: loops and/or near-duplicates
               Legitimate: recommendation systems, calendars, iterative
               algorithms, etc.
               The number of pages on the Web can be considered infinite




R. Baeza-Yates and C. Castillo                                         Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models         Experiments                 Summary




Introduction



               Dynamic page: “a page which is created on request”
               Dynamic pages with links to other dynamic pages
               Malicious: loops and/or near-duplicates
               Legitimate: recommendation systems, calendars, iterative
               algorithms, etc.
               The number of pages on the Web can be considered infinite




R. Baeza-Yates and C. Castillo                                         Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models         Experiments                 Summary




Introduction



               Dynamic page: “a page which is created on request”
               Dynamic pages with links to other dynamic pages
               Malicious: loops and/or near-duplicates
               Legitimate: recommendation systems, calendars, iterative
               algorithms, etc.
               The number of pages on the Web can be considered infinite




R. Baeza-Yates and C. Castillo                                         Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models         Experiments                 Summary




Introduction



               Dynamic page: “a page which is created on request”
               Dynamic pages with links to other dynamic pages
               Malicious: loops and/or near-duplicates
               Legitimate: recommendation systems, calendars, iterative
               algorithms, etc.
               The number of pages on the Web can be considered infinite




R. Baeza-Yates and C. Castillo                                         Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models         Experiments                 Summary




Introduction



               Dynamic page: “a page which is created on request”
               Dynamic pages with links to other dynamic pages
               Malicious: loops and/or near-duplicates
               Legitimate: recommendation systems, calendars, iterative
               algorithms, etc.
               The number of pages on the Web can be considered infinite




R. Baeza-Yates and C. Castillo                                         Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models       Experiments                 Summary




Conflicting interests



               Web site administrator: would like to have all of the Web
               site indexed
               Search engine administrator: would like to use efficiently
               the network and storage capacity available
               Search engine user: would like to find what he is looking for




R. Baeza-Yates and C. Castillo                                       Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models       Experiments                 Summary




Conflicting interests



               Web site administrator: would like to have all of the Web
               site indexed
               Search engine administrator: would like to use efficiently
               the network and storage capacity available
               Search engine user: would like to find what he is looking for




R. Baeza-Yates and C. Castillo                                       Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models       Experiments                 Summary




Conflicting interests



               Web site administrator: would like to have all of the Web
               site indexed
               Search engine administrator: would like to use efficiently
               the network and storage capacity available
               Search engine user: would like to find what he is looking for




R. Baeza-Yates and C. Castillo                                       Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models          Experiments                 Summary




Our approach



               Users do not go so deep inside Web sites
               If something is important it has to be easily reachable
               We will download only a few levels of each Web site
               How many levels?
               How much do you lost?




R. Baeza-Yates and C. Castillo                                          Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models          Experiments                 Summary




Our approach



               Users do not go so deep inside Web sites
               If something is important it has to be easily reachable
               We will download only a few levels of each Web site
               How many levels?
               How much do you lost?




R. Baeza-Yates and C. Castillo                                          Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models          Experiments                 Summary




Our approach



               Users do not go so deep inside Web sites
               If something is important it has to be easily reachable
               We will download only a few levels of each Web site
               How many levels?
               How much do you lost?




R. Baeza-Yates and C. Castillo                                          Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models          Experiments                 Summary




Our approach



               Users do not go so deep inside Web sites
               If something is important it has to be easily reachable
               We will download only a few levels of each Web site
               How many levels?
               How much do you lost?




R. Baeza-Yates and C. Castillo                                          Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models          Experiments                 Summary




Our approach



               Users do not go so deep inside Web sites
               If something is important it has to be easily reachable
               We will download only a few levels of each Web site
               How many levels?
               How much do you lost?




R. Baeza-Yates and C. Castillo                                          Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models   Experiments                 Summary




Models
Navigating a tree ≈ Moving through levels




R. Baeza-Yates and C. Castillo                                   Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models   Experiments                 Summary




Actions
Possible actions at a given level




R. Baeza-Yates and C. Castillo                                   Center for Web Research
Crawling the Infinite Web
Outline                    Introduction      Models        Experiments                 Summary




Type of models we study



               There is a set of atomic actions
               A = {next, start/jump, back, stay , prev , fwd}
               Pr (action| ) is the probability of taking an action
                    action∈A Pr (action|   )=1
               The probability Pr (next| ) is constant
               Stationary distribution → how much time users spent at each
               level




R. Baeza-Yates and C. Castillo                                           Center for Web Research
Crawling the Infinite Web
Outline                    Introduction      Models        Experiments                 Summary




Type of models we study



               There is a set of atomic actions
               A = {next, start/jump, back, stay , prev , fwd}
               Pr (action| ) is the probability of taking an action
                    action∈A Pr (action|   )=1
               The probability Pr (next| ) is constant
               Stationary distribution → how much time users spent at each
               level




R. Baeza-Yates and C. Castillo                                           Center for Web Research
Crawling the Infinite Web
Outline                    Introduction      Models        Experiments                 Summary




Type of models we study



               There is a set of atomic actions
               A = {next, start/jump, back, stay , prev , fwd}
               Pr (action| ) is the probability of taking an action
                    action∈A Pr (action|   )=1
               The probability Pr (next| ) is constant
               Stationary distribution → how much time users spent at each
               level




R. Baeza-Yates and C. Castillo                                           Center for Web Research
Crawling the Infinite Web
Outline                    Introduction      Models        Experiments                 Summary




Type of models we study



               There is a set of atomic actions
               A = {next, start/jump, back, stay , prev , fwd}
               Pr (action| ) is the probability of taking an action
                    action∈A Pr (action|   )=1
               The probability Pr (next| ) is constant
               Stationary distribution → how much time users spent at each
               level




R. Baeza-Yates and C. Castillo                                           Center for Web Research
Crawling the Infinite Web
Outline                    Introduction      Models        Experiments                 Summary




Type of models we study



               There is a set of atomic actions
               A = {next, start/jump, back, stay , prev , fwd}
               Pr (action| ) is the probability of taking an action
                    action∈A Pr (action|   )=1
               The probability Pr (next| ) is constant
               Stationary distribution → how much time users spent at each
               level




R. Baeza-Yates and C. Castillo                                           Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models   Experiments                 Summary




Model A
Forwards and backwards one level at a time




R. Baeza-Yates and C. Castillo                                   Center for Web Research
Crawling the Infinite Web
Outline                    Introduction           Models            Experiments                 Summary




Model A
Forwards and backwards one level at a time




                                          Birth and death process




R. Baeza-Yates and C. Castillo                                                    Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models   Experiments                 Summary




Model B
Back to first level




R. Baeza-Yates and C. Castillo                                   Center for Web Research
Crawling the Infinite Web
Outline                    Introduction          Models           Experiments                 Summary




Model B
Back to first level




                                 Birth and death process with extinction



R. Baeza-Yates and C. Castillo                                                  Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models   Experiments                 Summary




Model C
Back to any previous level




R. Baeza-Yates and C. Castillo                                   Center for Web Research
Crawling the Infinite Web
Outline                    Introduction      Models            Experiments                 Summary




Model C
Back to any previous level




                      Birth and death process with extinction and disaster?




R. Baeza-Yates and C. Castillo                                               Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models   Experiments                 Summary




Cumulative probability of levels 0 . . . k
Based on solutions given in the paper




R. Baeza-Yates and C. Castillo                                   Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models        Experiments                 Summary




Experiments




               Anonimized access logs for 13 Websites
               Educational - Commercial - Reference - Organization - Blogs
               Analysis of access logs to extract ≈ 250,000 user sessions




R. Baeza-Yates and C. Castillo                                        Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models        Experiments                 Summary




Experiments




               Anonimized access logs for 13 Websites
               Educational - Commercial - Reference - Organization - Blogs
               Analysis of access logs to extract ≈ 250,000 user sessions




R. Baeza-Yates and C. Castillo                                        Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models        Experiments                 Summary




Experiments




               Anonimized access logs for 13 Websites
               Educational - Commercial - Reference - Organization - Blogs
               Analysis of access logs to extract ≈ 250,000 user sessions




R. Baeza-Yates and C. Castillo                                        Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models   Experiments                 Summary




Distribution of visits per level




R. Baeza-Yates and C. Castillo                                   Center for Web Research
Crawling the Infinite Web
Outline                    Introduction      Models        Experiments                  Summary




Model fitting
          Code                Type             Country   Model      q         Error
           E1              Educational          Chile     B        0.51      0.88%
           E2              Educational          Spain     B        0.51      2.29%
           E3              Educational           US       B        0.64      0.72%
           C1              Commercial           Chile     B        0.55      0.39%
           C2              Commercial           Chile     B        0.62      5.17%
           R1               Reference           Chile     B        0.54      2.96%
           R2               Reference           Chile     B        0.59      2.75%
           O1             Organization          Italy     C        0.35      2.27%
           O2             Organization           US       B        0.62      2.31%
          OB1          Organization + Blog      Chile     B        0.65      2.07%
          OB2          Organization + Blog      Chile     B        0.72      0.35%
           B1                 Blog              Chile     C        0.79      0.88%
           B2                 Blog              Chile     C        0.63      1.01%
R. Baeza-Yates and C. Castillo                                            Center for Web Research
Crawling the Infinite Web
Outline                    Introduction         Models            Experiments                 Summary




Observed distribution of transitions

          Level         Obs.          Next    Start      Jump    Back      Stay        Prev
            0         247985          0.457     –        0.527     –       0.008        –
            1         120482          0.459     –        0.332   0.185     0.017        –
            2          70911          0.462   0.111      0.235   0.171     0.014        –
            3          42311          0.497   0.065      0.186   0.159     0.017      0.069
            4          27129          0.514   0.057      0.157   0.171     0.009      0.088
            5          17544          0.549   0.048      0.138   0.143     0.009      0.108
            6         10296           0.555   0.037      0.133   0.155     0.009      0.106
            7          6326           0.596   0.033      0.135   0.113     0.006      0.113
            8          4200           0.637   0.024      0.104   0.127     0.006      0.096
            9          2782           0.663   0.015      0.108   0.113     0.006      0.089
           10           2089          0.662   0.037      0.084   0.120     0.005      0.086


R. Baeza-Yates and C. Castillo                                                  Center for Web Research
Crawling the Infinite Web
Outline                    Introduction         Models            Experiments                 Summary




Observed distribution of transitions
          Level         Obs.          Next    Start      Jump    Back      Stay        Prev
            0         247985          0.457     –        0.527     –       0.008        –
            1         120482          0.459     –        0.332   0.185     0.017        –
            2          70911          0.462   0.111      0.235   0.171     0.014        –
            3          42311          0.497   0.065      0.186   0.159     0.017      0.069
            4          27129          0.514   0.057      0.157   0.171     0.009      0.088
            5          17544          0.549   0.048      0.138   0.143     0.009      0.108
            6         10296           0.555   0.037      0.133   0.155     0.009      0.106
            7          6326           0.596   0.033      0.135   0.113     0.006      0.113
            8          4200           0.637   0.024      0.104   0.127     0.006      0.096
            9          2782           0.663   0.015      0.108   0.113     0.006      0.089
           10           2089          0.662   0.037      0.084   0.120     0.005      0.086
          Pr (next) is not constant, if you have spent some time in the Web site,
                              then you can spend some more

R. Baeza-Yates and C. Castillo                                                  Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models   Experiments                 Summary




Pagerank and depth
Cumulative Pagerank by levels in the Chilean Web




R. Baeza-Yates and C. Castillo                                   Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models            Experiments                 Summary




Pagerank and depth
Correlation of Pagerank and depth is low at deeper levels




R. Baeza-Yates and C. Castillo                                            Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models            Experiments                 Summary




Summary



               90% of the visits are 4-5 clicks away from the home page,
               except in blogs
               Simple models try to explain this behavior
               In the paper: explicit methodology, closed solutions to the
               models, references




R. Baeza-Yates and C. Castillo                                            Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models            Experiments                 Summary




Summary



               90% of the visits are 4-5 clicks away from the home page,
               except in blogs
               Simple models try to explain this behavior
               In the paper: explicit methodology, closed solutions to the
               models, references




R. Baeza-Yates and C. Castillo                                            Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models            Experiments                 Summary




Summary



               90% of the visits are 4-5 clicks away from the home page,
               except in blogs
               Simple models try to explain this behavior
               In the paper: explicit methodology, closed solutions to the
               models, references




R. Baeza-Yates and C. Castillo                                            Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models         Experiments                 Summary




Open problems


               A model which better fits empirical data
               Analyzing blogs
               Analyzing the textual content of pages to decide when to stop
               Relationship of this with the spam detection problem
               Try adaptive strategies: which are the factors that affect the
               desired crawling depth in a Web site?
               There are other ways of defining which pages to download
               from an infinite set




R. Baeza-Yates and C. Castillo                                         Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models         Experiments                 Summary




Open problems


               A model which better fits empirical data
               Analyzing blogs
               Analyzing the textual content of pages to decide when to stop
               Relationship of this with the spam detection problem
               Try adaptive strategies: which are the factors that affect the
               desired crawling depth in a Web site?
               There are other ways of defining which pages to download
               from an infinite set




R. Baeza-Yates and C. Castillo                                         Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models         Experiments                 Summary




Open problems


               A model which better fits empirical data
               Analyzing blogs
               Analyzing the textual content of pages to decide when to stop
               Relationship of this with the spam detection problem
               Try adaptive strategies: which are the factors that affect the
               desired crawling depth in a Web site?
               There are other ways of defining which pages to download
               from an infinite set




R. Baeza-Yates and C. Castillo                                         Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models         Experiments                 Summary




Open problems


               A model which better fits empirical data
               Analyzing blogs
               Analyzing the textual content of pages to decide when to stop
               Relationship of this with the spam detection problem
               Try adaptive strategies: which are the factors that affect the
               desired crawling depth in a Web site?
               There are other ways of defining which pages to download
               from an infinite set




R. Baeza-Yates and C. Castillo                                         Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models         Experiments                 Summary




Open problems


               A model which better fits empirical data
               Analyzing blogs
               Analyzing the textual content of pages to decide when to stop
               Relationship of this with the spam detection problem
               Try adaptive strategies: which are the factors that affect the
               desired crawling depth in a Web site?
               There are other ways of defining which pages to download
               from an infinite set




R. Baeza-Yates and C. Castillo                                         Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models         Experiments                 Summary




Open problems


               A model which better fits empirical data
               Analyzing blogs
               Analyzing the textual content of pages to decide when to stop
               Relationship of this with the spam detection problem
               Try adaptive strategies: which are the factors that affect the
               desired crawling depth in a Web site?
               There are other ways of defining which pages to download
               from an infinite set




R. Baeza-Yates and C. Castillo                                         Center for Web Research
Crawling the Infinite Web
Outline                    Introduction   Models   Experiments                 Summary




          Questions and comments . . .




R. Baeza-Yates and C. Castillo                                   Center for Web Research
Crawling the Infinite Web

Más contenido relacionado

Destacado

Political economy of Lisbon strategy
Political economy of Lisbon strategyPolitical economy of Lisbon strategy
Political economy of Lisbon strategygogrowth
 
At what Speed are EU-27 Member States Approaching the Lisbon Targets?
At what Speed are EU-27 Member States Approaching the Lisbon Targets?At what Speed are EU-27 Member States Approaching the Lisbon Targets?
At what Speed are EU-27 Member States Approaching the Lisbon Targets?gogrowth
 
Mia's Sweeties 2nd try
Mia's Sweeties 2nd tryMia's Sweeties 2nd try
Mia's Sweeties 2nd tryj4cap
 
“It’s About Brains…”
“It’s About Brains…”“It’s About Brains…”
“It’s About Brains…”gogrowth
 
Rummetomkroppen2
Rummetomkroppen2Rummetomkroppen2
Rummetomkroppen2drz
 
Tracking the Timetable to Lisbon
Tracking the Timetable to LisbonTracking the Timetable to Lisbon
Tracking the Timetable to Lisbongogrowth
 
Worldinside
WorldinsideWorldinside
Worldinsidedrz
 
Mia's Sweeties
Mia's SweetiesMia's Sweeties
Mia's Sweetiesj4cap
 
Boca New High School Graphics
Boca  New  High  School  GraphicsBoca  New  High  School  Graphics
Boca New High School Graphicsbmahoney
 
Sambuichi Workshop Karch 2008
Sambuichi Workshop Karch 2008Sambuichi Workshop Karch 2008
Sambuichi Workshop Karch 2008drz
 

Destacado (13)

Political economy of Lisbon strategy
Political economy of Lisbon strategyPolitical economy of Lisbon strategy
Political economy of Lisbon strategy
 
Generalizing PageRank (Pisa)
Generalizing PageRank (Pisa)Generalizing PageRank (Pisa)
Generalizing PageRank (Pisa)
 
At what Speed are EU-27 Member States Approaching the Lisbon Targets?
At what Speed are EU-27 Member States Approaching the Lisbon Targets?At what Speed are EU-27 Member States Approaching the Lisbon Targets?
At what Speed are EU-27 Member States Approaching the Lisbon Targets?
 
Mia's Sweeties 2nd try
Mia's Sweeties 2nd tryMia's Sweeties 2nd try
Mia's Sweeties 2nd try
 
“It’s About Brains…”
“It’s About Brains…”“It’s About Brains…”
“It’s About Brains…”
 
Rummetomkroppen2
Rummetomkroppen2Rummetomkroppen2
Rummetomkroppen2
 
Tracking the Timetable to Lisbon
Tracking the Timetable to LisbonTracking the Timetable to Lisbon
Tracking the Timetable to Lisbon
 
Worldinside
WorldinsideWorldinside
Worldinside
 
Mia's Sweeties
Mia's SweetiesMia's Sweeties
Mia's Sweeties
 
Boca New High School Graphics
Boca  New  High  School  GraphicsBoca  New  High  School  Graphics
Boca New High School Graphics
 
Sambuichi Workshop Karch 2008
Sambuichi Workshop Karch 2008Sambuichi Workshop Karch 2008
Sambuichi Workshop Karch 2008
 
Read 180
Read 180Read 180
Read 180
 
Web Information Retrieval and Mining
Web Information Retrieval and MiningWeb Information Retrieval and Mining
Web Information Retrieval and Mining
 

Similar a Crawling the Infinite Web (WAW 2004 Rome)

IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET Journal
 
Chasing web-based malware
Chasing web-based malwareChasing web-based malware
Chasing web-based malwareFACE
 
Oct 2014 Siteimprove Stockholm Accessibility Conference
Oct 2014 Siteimprove Stockholm Accessibility ConferenceOct 2014 Siteimprove Stockholm Accessibility Conference
Oct 2014 Siteimprove Stockholm Accessibility ConferenceKevin Rydberg
 
Modified naive bayes model for improved web page classification
Modified naive bayes model for improved web page classificationModified naive bayes model for improved web page classification
Modified naive bayes model for improved web page classificationHammad Haleem
 
4 Web Crawler.pptx
4 Web Crawler.pptx4 Web Crawler.pptx
4 Web Crawler.pptxDEEPAK948083
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypresNekoGato
 
Michalis Vafopoulos: Initial thoughts about existence in the Web
Michalis Vafopoulos: Initial thoughts about existence in the WebMichalis Vafopoulos: Initial thoughts about existence in the Web
Michalis Vafopoulos: Initial thoughts about existence in the WebPhiloWeb
 
Searchland: Search quality for Beginners
Searchland: Search quality for BeginnersSearchland: Search quality for Beginners
Searchland: Search quality for BeginnersValeria de Paiva
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...butest
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...butest
 
Why the Data Train Needs Semantic Rails -- The Case of Linked Scientometrics ...
Why the Data Train Needs Semantic Rails -- The Case of Linked Scientometrics ...Why the Data Train Needs Semantic Rails -- The Case of Linked Scientometrics ...
Why the Data Train Needs Semantic Rails -- The Case of Linked Scientometrics ...kjanowicz
 
RDA Web service discoverability workshop
RDA Web service discoverability workshopRDA Web service discoverability workshop
RDA Web service discoverability workshopNiall Beard
 
Inverted textindexing
Inverted textindexingInverted textindexing
Inverted textindexingKhwaja Aamer
 
Beyond the Page
Beyond the PageBeyond the Page
Beyond the Pagegsmith
 
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolEvaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolMichael Nelson
 

Similar a Crawling the Infinite Web (WAW 2004 Rome) (20)

IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
 
Chasing web-based malware
Chasing web-based malwareChasing web-based malware
Chasing web-based malware
 
Towards a Web of Services
Towards a Web of ServicesTowards a Web of Services
Towards a Web of Services
 
Oct 2014 Siteimprove Stockholm Accessibility Conference
Oct 2014 Siteimprove Stockholm Accessibility ConferenceOct 2014 Siteimprove Stockholm Accessibility Conference
Oct 2014 Siteimprove Stockholm Accessibility Conference
 
Web mining
Web miningWeb mining
Web mining
 
Modified naive bayes model for improved web page classification
Modified naive bayes model for improved web page classificationModified naive bayes model for improved web page classification
Modified naive bayes model for improved web page classification
 
4 Web Crawler.pptx
4 Web Crawler.pptx4 Web Crawler.pptx
4 Web Crawler.pptx
 
Presentation mz
Presentation mzPresentation mz
Presentation mz
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
Michalis Vafopoulos: Initial thoughts about existence in the Web
Michalis Vafopoulos: Initial thoughts about existence in the WebMichalis Vafopoulos: Initial thoughts about existence in the Web
Michalis Vafopoulos: Initial thoughts about existence in the Web
 
Searchland: Search quality for Beginners
Searchland: Search quality for BeginnersSearchland: Search quality for Beginners
Searchland: Search quality for Beginners
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
Introduction to ASP.NET MVC
Introduction to ASP.NET MVCIntroduction to ASP.NET MVC
Introduction to ASP.NET MVC
 
Why the Data Train Needs Semantic Rails -- The Case of Linked Scientometrics ...
Why the Data Train Needs Semantic Rails -- The Case of Linked Scientometrics ...Why the Data Train Needs Semantic Rails -- The Case of Linked Scientometrics ...
Why the Data Train Needs Semantic Rails -- The Case of Linked Scientometrics ...
 
Minning www
Minning wwwMinning www
Minning www
 
RDA Web service discoverability workshop
RDA Web service discoverability workshopRDA Web service discoverability workshop
RDA Web service discoverability workshop
 
Inverted textindexing
Inverted textindexingInverted textindexing
Inverted textindexing
 
Beyond the Page
Beyond the PageBeyond the Page
Beyond the Page
 
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolEvaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
 

Más de Carlos Castillo (ChaTo)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social MediaCarlos Castillo (ChaTo)
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Carlos Castillo (ChaTo)
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Carlos Castillo (ChaTo)
 

Más de Carlos Castillo (ChaTo) (20)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
 
Link prediction
Link predictionLink prediction
Link prediction
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Indexing
IndexingIndexing
Indexing
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 

Último

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Último (20)

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Crawling the Infinite Web (WAW 2004 Rome)

  • 1. Outline Introduction Models Experiments Summary Crawling the Infinite Web: Five Levels are Enough Ricardo Baeza-Yates and Carlos Castillo Center for Web Research www.cwr.cl WAW 2004 R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 2. Outline Introduction Models Experiments Summary 1 Introduction 2 Models 3 Experiments 4 Summary R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 3. Outline Introduction Models Experiments Summary Introduction Dynamic page: “a page which is created on request” Dynamic pages with links to other dynamic pages Malicious: loops and/or near-duplicates Legitimate: recommendation systems, calendars, iterative algorithms, etc. The number of pages on the Web can be considered infinite R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 4. Outline Introduction Models Experiments Summary Introduction Dynamic page: “a page which is created on request” Dynamic pages with links to other dynamic pages Malicious: loops and/or near-duplicates Legitimate: recommendation systems, calendars, iterative algorithms, etc. The number of pages on the Web can be considered infinite R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 5. Outline Introduction Models Experiments Summary Introduction Dynamic page: “a page which is created on request” Dynamic pages with links to other dynamic pages Malicious: loops and/or near-duplicates Legitimate: recommendation systems, calendars, iterative algorithms, etc. The number of pages on the Web can be considered infinite R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 6. Outline Introduction Models Experiments Summary Introduction Dynamic page: “a page which is created on request” Dynamic pages with links to other dynamic pages Malicious: loops and/or near-duplicates Legitimate: recommendation systems, calendars, iterative algorithms, etc. The number of pages on the Web can be considered infinite R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 7. Outline Introduction Models Experiments Summary Introduction Dynamic page: “a page which is created on request” Dynamic pages with links to other dynamic pages Malicious: loops and/or near-duplicates Legitimate: recommendation systems, calendars, iterative algorithms, etc. The number of pages on the Web can be considered infinite R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 8. Outline Introduction Models Experiments Summary Conflicting interests Web site administrator: would like to have all of the Web site indexed Search engine administrator: would like to use efficiently the network and storage capacity available Search engine user: would like to find what he is looking for R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 9. Outline Introduction Models Experiments Summary Conflicting interests Web site administrator: would like to have all of the Web site indexed Search engine administrator: would like to use efficiently the network and storage capacity available Search engine user: would like to find what he is looking for R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 10. Outline Introduction Models Experiments Summary Conflicting interests Web site administrator: would like to have all of the Web site indexed Search engine administrator: would like to use efficiently the network and storage capacity available Search engine user: would like to find what he is looking for R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 11. Outline Introduction Models Experiments Summary Our approach Users do not go so deep inside Web sites If something is important it has to be easily reachable We will download only a few levels of each Web site How many levels? How much do you lost? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 12. Outline Introduction Models Experiments Summary Our approach Users do not go so deep inside Web sites If something is important it has to be easily reachable We will download only a few levels of each Web site How many levels? How much do you lost? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 13. Outline Introduction Models Experiments Summary Our approach Users do not go so deep inside Web sites If something is important it has to be easily reachable We will download only a few levels of each Web site How many levels? How much do you lost? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 14. Outline Introduction Models Experiments Summary Our approach Users do not go so deep inside Web sites If something is important it has to be easily reachable We will download only a few levels of each Web site How many levels? How much do you lost? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 15. Outline Introduction Models Experiments Summary Our approach Users do not go so deep inside Web sites If something is important it has to be easily reachable We will download only a few levels of each Web site How many levels? How much do you lost? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 16. Outline Introduction Models Experiments Summary Models Navigating a tree ≈ Moving through levels R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 17. Outline Introduction Models Experiments Summary Actions Possible actions at a given level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 18. Outline Introduction Models Experiments Summary Type of models we study There is a set of atomic actions A = {next, start/jump, back, stay , prev , fwd} Pr (action| ) is the probability of taking an action action∈A Pr (action| )=1 The probability Pr (next| ) is constant Stationary distribution → how much time users spent at each level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 19. Outline Introduction Models Experiments Summary Type of models we study There is a set of atomic actions A = {next, start/jump, back, stay , prev , fwd} Pr (action| ) is the probability of taking an action action∈A Pr (action| )=1 The probability Pr (next| ) is constant Stationary distribution → how much time users spent at each level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 20. Outline Introduction Models Experiments Summary Type of models we study There is a set of atomic actions A = {next, start/jump, back, stay , prev , fwd} Pr (action| ) is the probability of taking an action action∈A Pr (action| )=1 The probability Pr (next| ) is constant Stationary distribution → how much time users spent at each level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 21. Outline Introduction Models Experiments Summary Type of models we study There is a set of atomic actions A = {next, start/jump, back, stay , prev , fwd} Pr (action| ) is the probability of taking an action action∈A Pr (action| )=1 The probability Pr (next| ) is constant Stationary distribution → how much time users spent at each level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 22. Outline Introduction Models Experiments Summary Type of models we study There is a set of atomic actions A = {next, start/jump, back, stay , prev , fwd} Pr (action| ) is the probability of taking an action action∈A Pr (action| )=1 The probability Pr (next| ) is constant Stationary distribution → how much time users spent at each level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 23. Outline Introduction Models Experiments Summary Model A Forwards and backwards one level at a time R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 24. Outline Introduction Models Experiments Summary Model A Forwards and backwards one level at a time Birth and death process R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 25. Outline Introduction Models Experiments Summary Model B Back to first level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 26. Outline Introduction Models Experiments Summary Model B Back to first level Birth and death process with extinction R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 27. Outline Introduction Models Experiments Summary Model C Back to any previous level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 28. Outline Introduction Models Experiments Summary Model C Back to any previous level Birth and death process with extinction and disaster? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 29. Outline Introduction Models Experiments Summary Cumulative probability of levels 0 . . . k Based on solutions given in the paper R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 30. Outline Introduction Models Experiments Summary Experiments Anonimized access logs for 13 Websites Educational - Commercial - Reference - Organization - Blogs Analysis of access logs to extract ≈ 250,000 user sessions R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 31. Outline Introduction Models Experiments Summary Experiments Anonimized access logs for 13 Websites Educational - Commercial - Reference - Organization - Blogs Analysis of access logs to extract ≈ 250,000 user sessions R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 32. Outline Introduction Models Experiments Summary Experiments Anonimized access logs for 13 Websites Educational - Commercial - Reference - Organization - Blogs Analysis of access logs to extract ≈ 250,000 user sessions R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 33. Outline Introduction Models Experiments Summary Distribution of visits per level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 34. Outline Introduction Models Experiments Summary Model fitting Code Type Country Model q Error E1 Educational Chile B 0.51 0.88% E2 Educational Spain B 0.51 2.29% E3 Educational US B 0.64 0.72% C1 Commercial Chile B 0.55 0.39% C2 Commercial Chile B 0.62 5.17% R1 Reference Chile B 0.54 2.96% R2 Reference Chile B 0.59 2.75% O1 Organization Italy C 0.35 2.27% O2 Organization US B 0.62 2.31% OB1 Organization + Blog Chile B 0.65 2.07% OB2 Organization + Blog Chile B 0.72 0.35% B1 Blog Chile C 0.79 0.88% B2 Blog Chile C 0.63 1.01% R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 35. Outline Introduction Models Experiments Summary Observed distribution of transitions Level Obs. Next Start Jump Back Stay Prev 0 247985 0.457 – 0.527 – 0.008 – 1 120482 0.459 – 0.332 0.185 0.017 – 2 70911 0.462 0.111 0.235 0.171 0.014 – 3 42311 0.497 0.065 0.186 0.159 0.017 0.069 4 27129 0.514 0.057 0.157 0.171 0.009 0.088 5 17544 0.549 0.048 0.138 0.143 0.009 0.108 6 10296 0.555 0.037 0.133 0.155 0.009 0.106 7 6326 0.596 0.033 0.135 0.113 0.006 0.113 8 4200 0.637 0.024 0.104 0.127 0.006 0.096 9 2782 0.663 0.015 0.108 0.113 0.006 0.089 10 2089 0.662 0.037 0.084 0.120 0.005 0.086 R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 36. Outline Introduction Models Experiments Summary Observed distribution of transitions Level Obs. Next Start Jump Back Stay Prev 0 247985 0.457 – 0.527 – 0.008 – 1 120482 0.459 – 0.332 0.185 0.017 – 2 70911 0.462 0.111 0.235 0.171 0.014 – 3 42311 0.497 0.065 0.186 0.159 0.017 0.069 4 27129 0.514 0.057 0.157 0.171 0.009 0.088 5 17544 0.549 0.048 0.138 0.143 0.009 0.108 6 10296 0.555 0.037 0.133 0.155 0.009 0.106 7 6326 0.596 0.033 0.135 0.113 0.006 0.113 8 4200 0.637 0.024 0.104 0.127 0.006 0.096 9 2782 0.663 0.015 0.108 0.113 0.006 0.089 10 2089 0.662 0.037 0.084 0.120 0.005 0.086 Pr (next) is not constant, if you have spent some time in the Web site, then you can spend some more R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 37. Outline Introduction Models Experiments Summary Pagerank and depth Cumulative Pagerank by levels in the Chilean Web R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 38. Outline Introduction Models Experiments Summary Pagerank and depth Correlation of Pagerank and depth is low at deeper levels R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 39. Outline Introduction Models Experiments Summary Summary 90% of the visits are 4-5 clicks away from the home page, except in blogs Simple models try to explain this behavior In the paper: explicit methodology, closed solutions to the models, references R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 40. Outline Introduction Models Experiments Summary Summary 90% of the visits are 4-5 clicks away from the home page, except in blogs Simple models try to explain this behavior In the paper: explicit methodology, closed solutions to the models, references R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 41. Outline Introduction Models Experiments Summary Summary 90% of the visits are 4-5 clicks away from the home page, except in blogs Simple models try to explain this behavior In the paper: explicit methodology, closed solutions to the models, references R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 42. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 43. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 44. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 45. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 46. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 47. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 48. Outline Introduction Models Experiments Summary Questions and comments . . . R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web