08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Crawling the Infinite Web (WAW 2004 Rome)
1. Outline Introduction Models Experiments Summary
Crawling the Infinite Web:
Five Levels are Enough
Ricardo Baeza-Yates and Carlos Castillo
Center for Web Research
www.cwr.cl
WAW 2004
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
2. Outline Introduction Models Experiments Summary
1 Introduction
2 Models
3 Experiments
4 Summary
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
3. Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterative
algorithms, etc.
The number of pages on the Web can be considered infinite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
4. Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterative
algorithms, etc.
The number of pages on the Web can be considered infinite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
5. Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterative
algorithms, etc.
The number of pages on the Web can be considered infinite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
6. Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterative
algorithms, etc.
The number of pages on the Web can be considered infinite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
7. Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterative
algorithms, etc.
The number of pages on the Web can be considered infinite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
8. Outline Introduction Models Experiments Summary
Conflicting interests
Web site administrator: would like to have all of the Web
site indexed
Search engine administrator: would like to use efficiently
the network and storage capacity available
Search engine user: would like to find what he is looking for
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
9. Outline Introduction Models Experiments Summary
Conflicting interests
Web site administrator: would like to have all of the Web
site indexed
Search engine administrator: would like to use efficiently
the network and storage capacity available
Search engine user: would like to find what he is looking for
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
10. Outline Introduction Models Experiments Summary
Conflicting interests
Web site administrator: would like to have all of the Web
site indexed
Search engine administrator: would like to use efficiently
the network and storage capacity available
Search engine user: would like to find what he is looking for
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
11. Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
12. Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
13. Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
14. Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
15. Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
16. Outline Introduction Models Experiments Summary
Models
Navigating a tree ≈ Moving through levels
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
17. Outline Introduction Models Experiments Summary
Actions
Possible actions at a given level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
18. Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actions
A = {next, start/jump, back, stay , prev , fwd}
Pr (action| ) is the probability of taking an action
action∈A Pr (action| )=1
The probability Pr (next| ) is constant
Stationary distribution → how much time users spent at each
level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
19. Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actions
A = {next, start/jump, back, stay , prev , fwd}
Pr (action| ) is the probability of taking an action
action∈A Pr (action| )=1
The probability Pr (next| ) is constant
Stationary distribution → how much time users spent at each
level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
20. Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actions
A = {next, start/jump, back, stay , prev , fwd}
Pr (action| ) is the probability of taking an action
action∈A Pr (action| )=1
The probability Pr (next| ) is constant
Stationary distribution → how much time users spent at each
level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
21. Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actions
A = {next, start/jump, back, stay , prev , fwd}
Pr (action| ) is the probability of taking an action
action∈A Pr (action| )=1
The probability Pr (next| ) is constant
Stationary distribution → how much time users spent at each
level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
22. Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actions
A = {next, start/jump, back, stay , prev , fwd}
Pr (action| ) is the probability of taking an action
action∈A Pr (action| )=1
The probability Pr (next| ) is constant
Stationary distribution → how much time users spent at each
level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
23. Outline Introduction Models Experiments Summary
Model A
Forwards and backwards one level at a time
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
24. Outline Introduction Models Experiments Summary
Model A
Forwards and backwards one level at a time
Birth and death process
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
25. Outline Introduction Models Experiments Summary
Model B
Back to first level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
26. Outline Introduction Models Experiments Summary
Model B
Back to first level
Birth and death process with extinction
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
27. Outline Introduction Models Experiments Summary
Model C
Back to any previous level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
28. Outline Introduction Models Experiments Summary
Model C
Back to any previous level
Birth and death process with extinction and disaster?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
29. Outline Introduction Models Experiments Summary
Cumulative probability of levels 0 . . . k
Based on solutions given in the paper
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
30. Outline Introduction Models Experiments Summary
Experiments
Anonimized access logs for 13 Websites
Educational - Commercial - Reference - Organization - Blogs
Analysis of access logs to extract ≈ 250,000 user sessions
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
31. Outline Introduction Models Experiments Summary
Experiments
Anonimized access logs for 13 Websites
Educational - Commercial - Reference - Organization - Blogs
Analysis of access logs to extract ≈ 250,000 user sessions
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
32. Outline Introduction Models Experiments Summary
Experiments
Anonimized access logs for 13 Websites
Educational - Commercial - Reference - Organization - Blogs
Analysis of access logs to extract ≈ 250,000 user sessions
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
33. Outline Introduction Models Experiments Summary
Distribution of visits per level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
34. Outline Introduction Models Experiments Summary
Model fitting
Code Type Country Model q Error
E1 Educational Chile B 0.51 0.88%
E2 Educational Spain B 0.51 2.29%
E3 Educational US B 0.64 0.72%
C1 Commercial Chile B 0.55 0.39%
C2 Commercial Chile B 0.62 5.17%
R1 Reference Chile B 0.54 2.96%
R2 Reference Chile B 0.59 2.75%
O1 Organization Italy C 0.35 2.27%
O2 Organization US B 0.62 2.31%
OB1 Organization + Blog Chile B 0.65 2.07%
OB2 Organization + Blog Chile B 0.72 0.35%
B1 Blog Chile C 0.79 0.88%
B2 Blog Chile C 0.63 1.01%
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
35. Outline Introduction Models Experiments Summary
Observed distribution of transitions
Level Obs. Next Start Jump Back Stay Prev
0 247985 0.457 – 0.527 – 0.008 –
1 120482 0.459 – 0.332 0.185 0.017 –
2 70911 0.462 0.111 0.235 0.171 0.014 –
3 42311 0.497 0.065 0.186 0.159 0.017 0.069
4 27129 0.514 0.057 0.157 0.171 0.009 0.088
5 17544 0.549 0.048 0.138 0.143 0.009 0.108
6 10296 0.555 0.037 0.133 0.155 0.009 0.106
7 6326 0.596 0.033 0.135 0.113 0.006 0.113
8 4200 0.637 0.024 0.104 0.127 0.006 0.096
9 2782 0.663 0.015 0.108 0.113 0.006 0.089
10 2089 0.662 0.037 0.084 0.120 0.005 0.086
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
36. Outline Introduction Models Experiments Summary
Observed distribution of transitions
Level Obs. Next Start Jump Back Stay Prev
0 247985 0.457 – 0.527 – 0.008 –
1 120482 0.459 – 0.332 0.185 0.017 –
2 70911 0.462 0.111 0.235 0.171 0.014 –
3 42311 0.497 0.065 0.186 0.159 0.017 0.069
4 27129 0.514 0.057 0.157 0.171 0.009 0.088
5 17544 0.549 0.048 0.138 0.143 0.009 0.108
6 10296 0.555 0.037 0.133 0.155 0.009 0.106
7 6326 0.596 0.033 0.135 0.113 0.006 0.113
8 4200 0.637 0.024 0.104 0.127 0.006 0.096
9 2782 0.663 0.015 0.108 0.113 0.006 0.089
10 2089 0.662 0.037 0.084 0.120 0.005 0.086
Pr (next) is not constant, if you have spent some time in the Web site,
then you can spend some more
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
37. Outline Introduction Models Experiments Summary
Pagerank and depth
Cumulative Pagerank by levels in the Chilean Web
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
38. Outline Introduction Models Experiments Summary
Pagerank and depth
Correlation of Pagerank and depth is low at deeper levels
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
39. Outline Introduction Models Experiments Summary
Summary
90% of the visits are 4-5 clicks away from the home page,
except in blogs
Simple models try to explain this behavior
In the paper: explicit methodology, closed solutions to the
models, references
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
40. Outline Introduction Models Experiments Summary
Summary
90% of the visits are 4-5 clicks away from the home page,
except in blogs
Simple models try to explain this behavior
In the paper: explicit methodology, closed solutions to the
models, references
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
41. Outline Introduction Models Experiments Summary
Summary
90% of the visits are 4-5 clicks away from the home page,
except in blogs
Simple models try to explain this behavior
In the paper: explicit methodology, closed solutions to the
models, references
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
42. Outline Introduction Models Experiments Summary
Open problems
A model which better fits empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that affect the
desired crawling depth in a Web site?
There are other ways of defining which pages to download
from an infinite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
43. Outline Introduction Models Experiments Summary
Open problems
A model which better fits empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that affect the
desired crawling depth in a Web site?
There are other ways of defining which pages to download
from an infinite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
44. Outline Introduction Models Experiments Summary
Open problems
A model which better fits empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that affect the
desired crawling depth in a Web site?
There are other ways of defining which pages to download
from an infinite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
45. Outline Introduction Models Experiments Summary
Open problems
A model which better fits empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that affect the
desired crawling depth in a Web site?
There are other ways of defining which pages to download
from an infinite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
46. Outline Introduction Models Experiments Summary
Open problems
A model which better fits empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that affect the
desired crawling depth in a Web site?
There are other ways of defining which pages to download
from an infinite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
47. Outline Introduction Models Experiments Summary
Open problems
A model which better fits empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that affect the
desired crawling depth in a Web site?
There are other ways of defining which pages to download
from an infinite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
48. Outline Introduction Models Experiments Summary
Questions and comments . . .
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web