Outline
Intro
PageRank
Los comienzos de los buscadores de Internet
Term Spam
La estructura de la Web
Evitando los Dead Ends y Spider Traps.
La solución: el Teleport.
Link Spam. Spam Farm

Intro
Una de las tecnologías más revolucionarias que ha cambiado
para siempre nuestra vida, sin duda es la Internet; y con ella los
motores de búsqueda como Google. Google no fue el primer
buscador, sino el primero que pudo combatir eficientemente a
los spammers quienes intentan intervenir en la propuesta
original de información de la Web. Discutiremos la innovación
más grande realizada por Google, el PageRank.
La batalla entre quienes hacer la Web más útil y quienes quieren
manipularla para su propio beneficio pareciera nunca terminar.
Veremos las formas de vencer y manipular el PageRank,
construyendo pequeñas webs llamadas link spam.

Los Search Engines de antaño
Gestionan un índice invertido para hallar ágilmente todos
los lugares donde se encuentran las palabras.
Cuando se realiza una consulta (lista de terms) se filtran
aquéllas páginas que contienen los términos buscados
utilizando el índice.
Existen algunas estrategias para realizar un ranking, como
la cantidad de veces que un término aparece en el
documento.
Además, se tomaron otras ideas como sumar más relevancia
si el término se encuentra en el header de la página, si
aparece en negrita y demás características sintácticas.

Term Spam
Unethical people ve la oportunidad de engañar a los
buscadores para atraer visitas a sus propias páginas.
Imagina agregar el termino como película a tu página
muchas veces, unas miles. Luego el buscador creerá que tu
hablas de cine y que debes ser muy importante, ya que casi
ni hablas de otra cosa. Cuando un visitante el pregunta al
buscador por alguna película, tu apareces en la primera
página.
Parece simple. A las menciones de terminos cinéfilos le
cambias el color de la letra para que se confunda con el
fondo.
Pero tu vendes zapatillas, no sabes nada de críticas de
películas.

PageRank
Para combatir al term spam, Google incorpora 2 innovaciones:
1 El contenido a ser indexado no se encuentra en las páginas
sino en los hiper-textos.
2 El PageRank. Un ranking de relevancia de las páginas que
simula el comportamiento de los navegantes.
Random Surf: Si comienzas a navegar la Internet desde
cualquier página, siguiendo aleatoriamente cualquier
out-link propuesto, ¿cuál es la página que más visitas?

Ranking de nodos en el grafo
1'5-8++
Las páginas web no son todas equi - importantes.
0+-)%++
vs.
I.-.'%-'+ ;+

Links como votos
Idea: Links como votos
Una página es más importante si tiene más links.
In-coming links?
Out-going links?
Pensar a los in-links como votos:
www.stanford.edu tiene 23.400 inlinks
www.joe-schmoe.com tiene 1 inlink
Todos los in-links son igual de importantes?
Links de páginas importantes deberían contar más
Pregunta recursiva!

Forma recursiva
!#$%'()*+%,-./%'+%01-0-1.'-(%.-%.$/%
!#$%'()*%-2%'.+%+-31#/%04/%
Cada voto-link es proporcional a la importancia de la
página !
de origen.
Si una página p con importancia x tiene n out-links, cada
link vale x=n votos.
La importancia de la página p es la suma de los votos d sus
in-links.
04/%!%6'.$%'70-1.(#/%%$+%#%-3.8'()+9%
/#$%'()%4/.+%$#%,-./+%
!
4/%!*+%-6(%'70-1.(#/%'+%.$/%+37%-2%.$/%
,-./+%-(%'.+%'(8'()+%
p
A31/%B/+)-,/#9%C.(2-1D%E;FGH%I'('(4%I++',/%J.+/.+% @%

C, and D, so this surfer will next be at each of those pages CHAPTER 5. LINK ANALYSIS
1/3, and has zero probability of being at A. A randomsurfer next step, probability 1/2 of being at A, 1/2 of being at D, and B or C.
Una web simple
modifications that are necessary for dealing with some real-world problems
the structure of the Web.
of the Web as a directed graph, where pages are the nodes, and there
from page p1 to page p2 if there are one or more links from p1 to p2.
is an example of a tiny version of the Web, where there are only four
Page A has links to each of the other three pages; page B has links to
In general, we can define the transition matrix of the Web happens to random surfers after one step. This matrix M columns, if there are n pages. The element mij in row i and column 1/k if page j has k arcs out, and one of them is to page i. Otherwise, Example 5.1 : The transition matrix for the Web of Fig. 5.1 M =
only; page C has a link only to A, and page D has links to B and C
A B
C D
Figure 5.1: A hypothetical example of the Web
Matriz de transiciones
⎡
⎢⎢⎣
Suppose a random surfer starts at page A in Fig. 5.1. There are links to B,
so this surfer will next be at each of those pages with probability
has zero probability of being at A. A randomsurfer at B has, at the
probability 1/2 of being at A, 1/2 of being at D, and 0 of being at
0 1/2 1 0
1/3 0 0 1/2
1/3 0 0 1/2
1/3 1/2 0 0
⎤
⎥⎥⎦
In this matrix, the order of the pages is the natural one, A, B, the first column expresses M es the una fact, matriz already estocástica
discussed, that a surfer 1/3 probability of next por being columnas.
at each of the other pages. The expresses the fact that a surfer at B has a 1/2 probability of and the same of being at D. The third column says a surfer at be at A next. The last column says a surfer at D has a 1/2 probability next at B and the same at C. ✷

PageRank
La distribución probabilística para la determinar ubicación del
random surfer la podemos describir como un vector donde la
componente j-ésima es la probabilidad que el surfer visite la
página j. Esta probabilidad es (idealmente) la función PageRank.
El random surfer comienza su sesión desde cualquier
página con igual probabilidad.
Luego el vector inicial v0 tendrá como valor 1=n para cada
componente.
Si M es la matriz de transición de la web, luego de cada
salto la distribución del surfer es Mv0. Luego de 2 pasos,
será M(Mv0) = M2v0, y así.
En general, multiplicando el vector v0 por M un total de i
veces nos dará la probabilidad del random surfer luego de i
pasos.

any fixed constant c, and get another solution to the same equation. When
Cómputo del PageRank
include the constraint that the sum of the components is 1, as we done, then we get a unique solution.
Alcanzamos el limite cuando multiplicamos M otra vez más y la
distribución no cambia.
by multiplying at each step by M is:
!
#
1/4
1/4
1/4
1/4
$
%%
!
#
9/24
5/24
5/24
5/24
$
%%
!
# 15/48
11/48
11/48
11/48
$
%%
!
#
11/32
7/32
7/32
7/32
$
%%
· · ·
!
#
3/9
2/9
2/9
2/9
$
%%
Notice that in this example, the probabilities for B, C, and D remain It is En easy la práctica, to see that para B la and Web, C con must unas always 50-75 iteraciones have the son
same values iteration, because suficientes their para rows lograr in el M cálculo are identical. con un eror To de show doble-precisión.
that their values the same as the value for D, an inductive proof works, and we leave exercise. Given that the last three values of the limiting vector must

1. The in-component, consisting of pages that could reach the SCC by fol-lowing
links, but were not reachable from the SCC.
La estructura de la web
2. The out-component, consisting of pages reachable from the SCC but un-able
to reach the SCC.
3. Tendrils, which are of two types. Some tendrils consist of pages reachable
from the in-component but not able to reach the in-component. The
other tendrils can reach the out-component, but are not reachable from
the out-component.
Strongly
Connected
Component
Tubes
In
Tendrils
Component
Out
Tendrils
Component
Out
In
Disconnected
Components
Figure 5.2: The “bowtie” picture of the Web
In addition, there were small numbers of pages found either in

PageRank: 3 preguntas
r = Mr
¿Esto converge?
¿Converge a lo que queremos?
¿Los resultados son rasonables?

Recall that a page with no link out is called a dead end. ends, the transition matrix of the Web is no longer stochastic, the columns will sum to 0 rather than 1. A matrix whose column most 1 is called substochastic. IfwecomputeMiv for increasing substochastic matrix M, then some or all of the components to 0. That is, importance “drains out” of the Web, and we get about the relative importance of pages.
Callejones sin salida
PAGERANK 153
Example 5.3 : In Fig. 5.3 we have modified Fig. 5.1 by removing C to A. Thus, C becomes a dead end. In terms of random a surfer reaches C they disappear at the next round. The describes Fig. 5.3 is
A B
C D
Figure 5.3: C is now a dead end
M =
!
#
0 1/2 0 0
1/3 0 0 1/2
1/3 0 0 1/2
1/3 1/2 0 0
substochastic, but not stochastic, because the sum of the third
is 0, not 1. Here is the sequence of vectors that result by starting
$
%%
3They are so called because the programs that crawl the Web, recording are often referred to as “spiders.” Once a spider enters a spider trap, it

PageRank con Figure Callejones 5.3: C is now Sin a dead Salida
end
that it is substochastic, but not stochastic, because the sum of the for C, C tiene is 0, un not callejón 1. Here sin is salida the (sequence dead end).
of vectors that result by starting
the vector with each component 1/4, and repeatedly multiplying the !
#
1/4
1/4
1/4
1/4
$
%%
!
#
3/24
5/24
5/24
5/24
$
%%
!
# 5/48
7/48
7/48
7/48
$
%%
!
#
21/288
31/288
31/288
31/288
$
%%
· · ·
!
#
0
0
0
0
$
%%
see, the probability of a surfer being anywhere goes to 0, as the increase. !
Como vemos, la probabilidad que el random surfer visite
cualquier página tiende a 0.
There are two approaches to dealing with dead ends.
We can drop the dead ends from the graph, and also drop their incoming

out. These structures can appear intentionally or unintentionally and they cause the PageRank calculation to place all the PageRank spider traps.
Trampa para arañas
Example 5.5 : Consider Fig. 5.6, which is Fig. 5.1 with the changed to point to C itself. That change makes C a simple spider node. Note CHAPTER that in 5. general LINK ANALYSIS
spider traps can have many nodes, see in Section 5.4, there are spider traps with millions of nodes construct intentionally.
The transition matrix for Fig. 5.6 is
A B
C D
Figure 5.6: A graph with a one-node spider trap
M =
perform the usual iteration to compute the PageRank of the nodes, we
!
$
!
$
!
$
!
$
!
$
!
%%
0 1/2 0 0
1/3 0 0 1/2
1/3 0 1 1/2
1/3 1/2 0 0
#
$

C D
PageRank con Trampas para arañas
Figure 5.6: A graph with a one-node spider trap
Si realizamos la forma de cómputo normal del PageRank,
obtenemos:
perform the usual iteration to compute the PageRank of the nodes, !
#
1/4
1/4
1/4
1/4
$
%%
!
#
3/24
5/24
11/24
5/24
$
%%
!
#
5/48
7/48
29/48
7/48
$
%%
!
#
21/288
31/288
205/288
31/288
$
%%
· · ·
!
#
0
0
1
0
$
%%
predicted, all the PageRank is at C, since once there a random surfer never leave. !
todo el PageRank lo obtiene C, ya que de allí el random
Luego, surfer no puede salir.
random page, rather than following an out-link from their current To avoid the problem illustrated by Example 5.5, we modify the calculation
PageRank by allowing each random surfer a small probability of teleporting

Solución: Random Teleport
En cada paso el random surfer tiene 2 opciones:
Con probabilidad

sigue un outlink random.
Con probabilidad 1

salta a cualquier página.
Los valores más comúnes para

of a surfer operating on the Web. That is, when there are dead ends, the sum
of the components of v may be less than 1, but it will never reach 0.
the sum of the components of the vector v, there will always be some fraction
of a Teleport
surfer operating on the Web. That is, when there are dead ends, the sum
of the components of v may be less than 1, but it will never reach 0.
Example 5.6 : Let us see how the new approach to computing PageRank
fares on the graph of Fig. 5.6. We shall use ! = 0.8 in this example. Thus, the
equation for the iteration becomes
Example 5.6 : Let us see how the new approach to computing PageRank
fares on En the este graph ejemplo of Fig. usamos 5.6. We shall

= 0;8
use ! = 0.8 in this example. Thus, the
equation for the iteration becomes
v! =
v! =
!
!
#
#
0 2/5 0 0
4/15 0 0 2/5
4/15 0 4/5 2/5
4/15 2/5 0 0
0 2/5 0 0
4/15 0 0 2/5
4/15 0 4/5 2/5
4/15 2/5 0 0
$
$
%%
v +
%%
!
#
v +
!
#
1/20
1/20
1/20
1/20
$
%%
1/20
1/20
1/20
1/20
$
%%
Notice that we have incorporated the factor ! into M by multiplying each of
its elements by 4/5. The components of the vector (1 − !)e/n are each 1/20,
since 1 − ! = 1/5 and n = 4. Here are the first few iterations:
Notice that we have incorporated the factor ! into M by multiplying each its elements !
by 4/5. The components of the vector (1 − !)e/n are each 1/20,
since 1 − ! = 1/5 and n = 4. Here are the first few iterations:
#
!
#
1/4
1/4
1/4
1/4
1/4
1/4
1/4
1/4
$
%%
$
!
#
!
#
%% 9/60
13/60
25/60
13/60
9/60
13/60
25/60
13/60
$
%%
$
%%
!
#
!
#
41/300
53/300
153/300
41/300
53/300
153/300
53/300
$
%%
!
#
$
%%
!
#
543/4500
707/4500
2543/4500
707/4500
$
%%
543/4500
707/4500
2543/4500
707/4500
$
· · ·
%%
!
#
· · ·
15/148
19/148
95/148
19/148
!
#
$
%%
15/148
19/148
95/148
19/148
$
%%
By being a spider trap, C has managed to get more than half of the PageRank
for itself. However, the effect has been limited, and each of the nodes gets some
of the PageRank. !
By being a spider trap, C has managed to get more than half of the PageRank

¿Por qué Teleport resuelve el problema?
Cadenas de markov
r(t+1) = Mr(t)
Conjunto de estados X
Matriz de transiciones P donde Pij = P(Xt = ijX(t1) = j)
especifica la probabilidad es estar en un estado x 2 X
El objetivo es hallar tq = P

Cadenas de Markov
Para cualquier vector inicial, iterando sobre la matriz de
transición P converge a una única distribución estacionaria (o
de equilibrio) cuando P es estocástica, irreducible y aperiódica.
Hagamos M una P estocástica, irreducible y aperiódica.
Estocástica Cada columna suma 1.
Irreducible Cada estado tiene una probabilidad no-nula en
moverse a otro estado
Aperiódica Una cadena es periódica si existe un k 1 tq los
intervalos entre visitas al estado s siempre sean
múltiplo de k.

Topic-Specific PageRank
Una web, muchas ¿webes? Un PageRank, muchos ¿PageRankes?
Objetivo: Evaluar la relevancia de una página sobre un
tópico particular por ej. deportes o historia.
Le permite al usuario realizar consultas sobre diferentes
intereses.
Ejemplo: La consulta Jaguar podría presentar diferentes
rankings según si el usuario está interesado en Animales o
en Autos.

Tópicos de Big Data - Link Analysis