Hashing

Ing. Juan Ignacio Zamora M. MS.c
Facultad de Ingenierías
Licenciatura en Ingeniería Informática con Énfasis en Desarrollo de Software
Universidad Latinoamericana de Ciencia y Tecnología

Que es Hashing?
™  Hashing es un concepto programático que refiera al
direccionamiento que se realiza a partir del valor (llave
or key) hacia un campo en una estructura de datos de
composición estática o dinámica.

Hashing – Direct Address
™  Técnica de Hash que se usa cuando el universo “U” de llaves es pequeño.
™  Por ahora se asume que los elementos de universo son distintos.
™  Se denota una tabla de direcciones directas como T[0..m-1] donde cada
posición representa un “Slot k” que apunta a una llave del universo.
™  Si “k” no existe en la tabla… T[k] = null
™  Operaciones à T = O(1)
™  Search (T,k) { return T[k]}
™  Insert (T, x) { T[x.key] = x}
™  Delete (T, x) { T[x.key] = Null}
11.1 Direct-address tables
Direct addressing is a simple technique that works well when the unive
keys is reasonably small. Suppose that an application needs a dynamic set
each element has a key drawn from the universe U D f0; 1; : : : ; m 1g,
is not too large. We shall assume that no two elements have the same key
To represent the dynamic set, we use an array, or direct-address table
by T Œ0 : : m 1, in which each position, or slot, corresponds to a key in
verse U . Figure 11.1 illustrates the approach; slot k points to an element
with key k. If the set contains no element with key k, then T Œk D NIL.
The dictionary operations are trivial to implement:
DIRECT-ADDRESS-SEARCH.T; k/
1 return T Œk
DIRECT-ADDRESS-INSERT.T; x/
1 T Œx:key D x
DIRECT-ADDRESS-DELETE.T; x/
1 T Œx:key D NIL
Each of these operations takes only O.1/ time.
T
U
(universe of keys)
K
(actual
keys)
2
3
5
8
1
9
4
0
7
6 2
3
5
8
key satellite data
2
0
1
3
4
5
6
7
8
9

Hashing - Hash Table
™  A diferencia de “Direct-Address” Hashing ofrece la alternativa de buscar
el slot de un a llave determinada a través de una función de hash.
™  Esto se usa cuando existe un universo muy grande de llaves y la tabla de
direcciones no va a ser tan grande como el universo.
™  Se denota a “m” como el tamaño de la tabla de direcciones.
™  Con esta técnica de direccionamiento nos introducimos al concepto de
colisión.
11.2 Hash tables
The downside of direct addressing is obvious: if the universe U is large, storing
a table T of size jU j may be impractical, or even impossible, given the memory
available on a typical computer. Furthermore, the set K of keys actually stored
may be so small relative to U that most of the space allocated for T would be
wasted.
When the set K of keys stored in a dictionary is much smaller than the uni-
verse U of all possible keys, a hash table requires much less storage than a direct-
address table. Speciﬁcally, we can reduce the storage requirement to ‚.jKj/ while
we maintain the beneﬁt that searching for an element in the hash table still requires
only O.1/ time. The catch is that this bound is for the average-case time, whereas
for direct addressing it holds for the worst-case time.
With direct addressing, an element with key k is stored in slot k. With hashing,
this element is stored in slot h.k/; that is, we use a hash function h to compute the
slot from the key k. Here, h maps the universe U of keys into the slots of a hash
table T Œ0 : : m 1:
h W U ! f0; 1; : : : ; m 1g ;
where the size m of the hash table is typically much less than jU j. We say that an
element with key k hashes to slot h.k/; we also say that h.k/ is the hash value of
key k. Figure 11.2 illustrates the basic idea. The hash function reduces the range
of array indices and hence the size of the array. Instead of a size of jU j, the array
can have size m.
T
U
(universe of keys)
K
(actual
keys)
0
m–1
k1
k2
k3
k4 k5
h(k1)
h(k4)
h(k3)
h(k2) = h(k5)

Colisiones
™  Se da una colisión cuando 2 o mas llaves apuntan al
mismo slot.
™  Lo ideal es evitar las colisiones, sin embargo no en
todas* las implementaciones se logra…
™  Se intenta entonces crear una función de hash que sea
lo suficiente mente aleatoria para siempre crear una
dirección única para cada valor y evitar las colisiones…

Aleatoriedad
™  Realmente existe?
™  Cuantas teclas hay en su teclado?
™  26 letras
™  14 teclas de puntuación
™  Mas para numeración y comandos adicionales
™  Realmente una moneda cae 50% de la veces de un lado especifico.
™  El rebote de una bola es aleatorio?
™  Es el tiempo una medida aleatoria?
™  Cual es la posibilidad de sacar “5” en un juego de dados? *

Resolucion de Colisiones
Por Encadenamiento
™  Existen 2 o mas llaves que apuntan al mismo slot.
™  Por tanto al insertar elementos, estos se agregan en una lista
doblemente enlazada contenida en cada slot.
™  Operaciones
™  Insert (T,x) { T[h(x.key)]} // inserta al inicio [O(1)]
™  Search (T,k) { loop … T[h(k)]} // recorre lista enlazada
™  Delete (T,x) // primero “Search”, después borra
™  Que pasa con los tiempos asintóticos de Search y Delete?

Search & Delete
™  No sabemos cuantos elementos van a quedar en cada
slot…
™  Por tanto se a a usar el principio de “simple uniform
hashing”. Este define que el tiempo de acceso a un
elemento por hashing por encadenamiento esta
definido en
11.2 Hash tables 257
T
U
(universe of keys)
K
(actual
keys)
k1
k2 k3
k4 k5
k6
k7
k8
k1
k2
k3
k4
k5
k6
k7
k8
T =α = n / m = O(α)
h(k) = O(1)
Θ(α +1)

Donde esta la Magia –
Función de Hash [h(k)]
División
Multiplicación
Universal Hashing
Perfect Hashing

Método : División
™  La mayoría de funciones de hash asumen que el universo de llaves esta definido en el
conjunto de números naturales N = {0, 1, 2, …}
™  Incluso buscamos representar algo que no es un numero natural o una letra como un N.
Sea este el caso de la indexación de letras por su equivalente numérico en ASCII.
™  La funcion de hash h(k) por división establece que
h(k) = k mod m
™  Por tanto si una tabla de direcciones tiene tamaño m =12 y la llave k = 100, entonces
100 mod 12 = 4. Esto hace que el valor se almacene en el slot 4.
™  Para usar este método debemos evitar ciertos valores. Entre estos, m no debe ser
potencia de 2. Por tanto si m = 2^p, entonces solo se usan los bits del orden mas bajo de
k; lo cual va a incrementar la cantidad de colisiones.
™  Recomendación: un numero primo que no sea muy cercano a una potencia exacta de 2.

Método : Multiplicación
™  Primero se multiplica “k” por una constante “A”,
donde esta constante cumple con 0 < A < 1.
™  Después se obtiene la parte fraccional de Ak y se
multiplica por m.
™  Del valor resultante, se obtiene el floor.
™  A diferencia de la Division, aquí si se escoge m que sea
una potencia de 2
h(k) = m(kA%1)!" #$

Universal Hashing
Selección aleatoria de funciones de Hash

Universal Hashing
™  Se intenta escoger de forma aleatoria una función de una lista finita de
funciones de hash existentes, independientemente del valor de la llave.
™  Se dice que “H” es una colección finita de funciones de Hash que apunta a un
Universo “U” de llaves.
™  Se dice que la el universo es “Universal” si para cada par distinto de llaves,
h(k) = h(l) existe como máximo la posibilidad de colisión de 1/m.
™  La idea del Univesal Hashing reside en evitar que un proceso o persona mal
intencionada decida forzar colisiones sobre un slot especifico.
™  Universal Class, es la clase que contiene las funciones y decide cual se va a
utilizar para cada llave…

Universal Class
™  Se escoge un numero primo P lo suficientemente
grande para que cada llave “k” se encuentre en el rango
de [0.. P-1]
™  Se asume que la cantidad de llaves en el Universo es
mayor a la cantidad de slots, entonces se establece que
p > m
™  Ahora se denota la siguiente función de Hash, donde
“a” pertenece al conjunto {1,2,…..,p-1} y “b”
pertenece a {0,1,…..,p-1}
11.3 Hash functions 267
expectation, therefore, the expected time for the entire sequence of n operations
is O.n/. Since each operation takes .1/ time, the ‚.n/ bound follows.
Designing a universal class of hash functions
It is quite easy to design a universal class of hash functions, as a little number
theory will help us prove. You may wish to consult Chapter 31 first if you are
unfamiliar with number theory.
We begin by choosing a prime number p large enough so that every possible
key k is in the range 0 to p 1, inclusive. Let Zp denote the set f0; 1; : : : ; p 1g,
and let Zp denote the set f1; 2; : : : ; p 1g. Since p is prime, we can solve equa-
tions modulo p with the methods given in Chapter 31. Because we assume that the
size of the universe of keys is greater than the number of slots in the hash table, we
have p > m.
We now define the hash function hab for any a 2 Zp and any b 2 Zp using a
linear transformation followed by reductions modulo p and then modulo m:
hab.k/ D ..ak C b/ mod p/ mod m : (11.3)
For example, with p D 17 and m D 6, we have h3;4.8/ D 5. The family of all
such hash functions is
expectation, therefore, the expected time for the entire sequence of n operations
is O.n/. Since each operation takes .1/ time, the ‚.n/ bound follows.
Designing a universal class of hash functions
It is quite easy to design a universal class of hash functions, as a little number
theory will help us prove. You may wish to consult Chapter 31 first if you are
unfamiliar with number theory.
We begin by choosing a prime number p large enough so that every possible
key k is in the range 0 to p 1, inclusive. Let Zp denote the set f0; 1; : : : ; p 1g,
and let Zp denote the set f1; 2; : : : ; p 1g. Since p is prime, we can solve equa-
tions modulo p with the methods given in Chapter 31. Because we assume that the
size of the universe of keys is greater than the number of slots in the hash table, we
have p > m.
We now define the hash function hab for any a 2 Zp and any b 2 Zp using a
linear transformation followed by reductions modulo p and then modulo m:
hab.k/ D ..ak C b/ mod p/ mod m : (11.3)
For example, with p D 17 and m D 6, we have h3;4.8/ D 5. The family of all
such hash functions is
Hpm D
˚
hab W a 2 Zp and b 2 Zp
«
: (11.4)
Each hash function hab maps Zp to Zm. This class of hash functions has the nice
Con p = 17, m = 6, se tiene que
h3,4(8) = 5

Open Addressing
™  Esta técnica implica que todas las llaves
abarcan la tabla hash. Por tanto la tabla hash
tiene un tamaño definido con campos Null.
™  Para insertar un elemento se realizan pruebas
hasta encontrar un slot… lo que puede tomar
un tiempo O(n).
™  Cada Slot de la tabla debe contener un
elemento, por tanto al agregar una nueva llave
esta se inserta es una posición respectiva…
™  Si la tabla hash esta llena; entonces lanza un
error
be a permutation of h0;1;: : : ;m 1i
considered as a slot for a new key as
we assume that the elements in the
mation; the key k is identical to the
either a key or NIL (if the slot is em
input a hash table T and a key k. It
key k or ﬂags an error because the h
HASH-INSERT.T; k/
1 i D 0
2 repeat
3 j D h.k; i/
4 if T Œj == NIL
5 T Œj D k
6 return j
7 else i D i C 1
8 until i == m
9 error “hash table overﬂow”
The algorithm for searching for ke
insertion algorithm examined when

Open Addressing
terminate (unsuccessfully) when it finds an empty slot, since k would have b
inserted there and not later in its probe sequence. (This argument assumes that k
are not deleted from the hash table.) The procedure HASH-SEARCH takes as in
a hash table T and a key k, returning j if it finds that slot j contains key k, or
if key k is not present in table T .
HASH-SEARCH.T; k/
1 i D 0
2 repeat
3 j D h.k; i/
4 if T Œj == k
5 return j
6 i D i C 1
7 until T Œj == NIL or i == m
8 return NIL
Deletion from an open-address hash table is difficult. When we delete a
from slot i, we cannot simply mark that slot as empty by storing NIL in it.
we did, we might be unable to retrieve any key k during whose insertion we
probed slot i and found it occupied. We can solve this problem by marking
Que Pasaría con Hash-Delete ?
Zona de Discusión

Probes – Open Addressing
™  Para insertar en una Tabla Hash con Open Addressing, se
debe recorrer y “probar” hasta encontrar un un espacio
disponible.
™  Para no incurrir en tiempos O(n) para buscar un espacio
disponible, se utiliza una “prueba” o función relativa al
valor de la llave.
™  Pruebas (Al final ninguna cumple con el “Uniform Hashing”)
™  Linear (p. 272)
™  Cuadrática
™  Hashing Doble*

Linear Probing
™  El método de Linear Probing utiliza una formula Hash con
la siguiente estructura:
™  Primero se intenta T[h(k)], luego se prueba T[h(k) + 1]
hasta el slot T[m-1]. Todas estas pruebas deben ser
almacenadas para obtener un listado de todas las pruebas
posibles para ese Hash auxiliar. (Primary Clustering)
™  Una vez que se completo la primera corrida, luego la
búsqueda es lineal con base a los elementos que se
encontraron.
™  Tiempo O(T(i + 1) / m)
272 Chapter 11 Hash Tables
Linear probing
Given an ordinary hash function h0
W U ! f0; 1; : : : ; m
an auxiliary hash function, the method of linear probing
h.k; i/ D .h0
.k/ C i/ mod m
for i D 0; 1; : : : ; m 1. Given key k, we ﬁrst probe T Œh
by the auxiliary hash function. We next probe slot T Œh0
.k
slot T Œm 1. Then we wrap around to slots T Œ0; T Œ1; :
slot T Œh0
.k/ 1. Because the initial probe determines th
there are only m distinct probe sequences.
Linear probing is easy to implement, but it suffers fro
primary clustering. Long runs of occupied slots build up
search time. Clusters arise because an empty slot preceded
next with probability .i C 1/=m. Long runs of occupied
and the average search time increases.
Quadratic probing
Quadratic probing uses a hash function of the form

Quadratic Probing
™  La formula Hash tiene la forma de:
™  c1, c2 y m son constantes
™  Esta técnica tiene in rendimiento superior al de
“Linear Probing”
™  También introduce al problema de “Secondary
Clustering” ya que si h(k1,i) = h (k2,i), se debe
almacenar estas colisiones, sin embargo tienden a ser
menores que las de la técnica anterior.
there are only m distinct probe sequences.
Linear probing is easy to implement, b
primary clustering. Long runs of occupie
search time. Clusters arise because an empt
next with probability .i C 1/=m. Long ru
and the average search time increases.
Quadratic probing
Quadratic probing uses a hash function of
h.k; i/ D .h0
.k/ C c1i C c2i2
/ mod m ;
where h0
is an auxiliary hash function, c1
and i D 0; 1; : : : ; m 1. The initial posi
probed are offset by amounts that depend in
ber i. This method works much better than
the hash table, the values of c1, c2, and m
one way to select these parameters. Also,
position, then their probe sequences are th
plies h.k1; i/ D h.k2; i/. This property lea
secondary clustering. As in linear probing
sequence, and so only m distinct probe seq
Double hashing
Double hashing offers one of the best met
cause the permutations produced have m
chosen permutations. Double hashing uses
h.k; i/ D .h1.k/ C ih2.k// mod m ;

Double Hashing
™  La formula se define como:
™  Donde h1 y h2 son funciones auxiliares de hash
™  A diferencia de las otras técnicas, se usan 2 funciones
de hash auxiliares que aumentan la aleatoriedad en la
escogencias de las secuencias.
™  Para mejorar el rendimiento m debe ser potencia de 2.
™  Tiempo Estimado (a + 1)/m
one way to select these parameters. Also, if two keys have t
position, then their probe sequences are the same, since h.k
plies h.k1; i/ D h.k2; i/. This property leads to a milder form
secondary clustering. As in linear probing, the initial probe
sequence, and so only m distinct probe sequences are used.
Double hashing
Double hashing offers one of the best methods available for
cause the permutations produced have many of the charac
chosen permutations. Double hashing uses a hash function o
h.k; i/ D .h1.k/ C ih2.k// mod m ;
where both h1 and h2 are auxiliary hash functions. The initi
tion T Œh1.k/; successive probe positions are offset from pre
11.4 Open ad
0
1
2
3
4
5
6
7
8
9
10
11
12
79
69
98
72
14
50
Figure 11.5
k mod 13 and
the key 14 into

Tarea Hashing
™  Que es Perfect Hashing (a diferencia del approach
tradicional con Colission + Chaining)?
™  Que relación tiene con Universal Hashing?
™  Como se asegura el tiempo O(1)?
™  Bajo que escenarios se puede implementar Perfect
Hashing?
™  Que tamaño debe ser “m” para garantizar esto?
Sección 11.5 MIT

Hashing

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (19)

Similar a Hashing

Similar a Hashing (20)

Más de Juan Zamora, MSc. MBA

Más de Juan Zamora, MSc. MBA (11)

Último

Último (20)

Hashing