Web & Social Media Analytics Previous Year Question Paper.pdf
Advances in discrete energy minimisation for computer vision
1. String algorithms
ast • Knuth-Morris-Pratt algorithm
ime: - find pattern P in string T
- preprocess P (construct Fail array)
- search time: O(n+m) n=|T|, m=|P|
oday: • Data structures for strings
- trie
- suffix tree & generalized suffix tree
- suffix array
• Many applications
- e.g. computing common longest substring
2. Trie data structure
• Data structure for storing a set of strings
• Name comes from “retrieval”, usually pronounced as “try”
• Rooted tree h
• Edges labeled with letters
a e i
• Leafs correspond to input strings
d l hi
had d p
{ had, held, help, hi }
held help
3. Trie data structure
• Data structure for storing a set of strings
• Name comes from “retrieval”, usually pronounced as “try”
• Rooted tree had
held
• Edges labeled with letters help
• Leafs correspond to input strings hi
{ had, held, help, hi }
4. Trie data structure
• Data structure for storing a set of strings
• Name comes from “retrieval”, usually pronounced as “try”
• Rooted tree h
• Edges labeled with letters
had
• Leafs correspond to input strings held
help
hi
{ had, held, help, hi }
5. Trie data structure
• Data structure for storing a set of strings
• Name comes from “retrieval”, usually pronounced as “try”
• Rooted tree h
• Edges labeled with letters
a e i
• Leafs correspond to input strings
had held hi
help
{ had, held, help, hi }
6. Trie data structure
• Data structure for storing a set of strings
• Name comes from “retrieval”, usually pronounced as “try”
• Rooted tree h
• Edges labeled with letters
a e i
• Leafs correspond to input strings
d held hi
help
had
{ had, held, help, hi }
7. Trie data structure
• Data structure for storing a set of strings
• Name comes from “retrieval”, usually pronounced as “try”
• Rooted tree h
• Edges labeled with letters
a e i
• Leafs correspond to input strings
d l hi
had held
{ had, held, help, hi } help
8. Trie data structure
• Data structure for storing a set of strings
• Name comes from “retrieval”, usually pronounced as “try”
• Rooted tree h
• Edges labeled with letters
a e i
• Leafs correspond to input strings
d l hi
had d p
{ had, held, help, hi }
held help
9. Trie data structure
• Cannot represent strings T and TT'
• Solution: append special symbol $
to the end of each word
h
a e i
d l hi
had d p
{ had, held, help, hi }
held help
10. Trie data structure
• Cannot represent strings T and TT'
• Solution: append special symbol $
to the end of each word
h
$ a e i
h$
$
d l
hi$
$ d p
{ had$, held$, help$, hi$, h$ }
had$ $ $
held$ help$
11. Trie data structure
• Application: storing dictionaries
• Space: O(m1+...+mk), mi=length of string Ti
• Looking up word of length n: O(n)
- assuming alphabet has constant size h
$ a
• Alternative to binary trees e i
- pros and cons (see wikipedia) h$
$
d l
had
hi$
h held d p
$
help hi
had$ $ $
- all nodes store items, not just leaves
- unlabeled edges held$ help$
12. Tries for storing suffixes
• This lecture: use tries for storing the set of suffixes of string T
c a o $
a c o $ $
cacao$
acao$
c o a o$
$ cao$
a o ao$ ao$
$
o$
cao$
o $ $
$ acao$
cacao$
14. Reducing space requirements
• # leafs: n+1
• # internal nodes: at most n
• => O(n) storage
# edges: at most 2n edges
• Trick 2: Store edge labels via two pointers into the string
(begin & end)
cacao$
ca a o$ $
acao$
cao$
o$
cao$ o$ cao$ o$ $ ao$
o$
cao$ acao$ ao$
cacao$ $
15. Suffix trees
• Called the suffix tree for string T [Weiner’73]
• D. Knuth: “algorithm of the year 1973”
• Can be constructed in linear time
- e.g. [McCreight’76], [Ukkonen’95]
- algorithms are complicated [will not be covered]
• Used for all sorts of string problems
cacao$
ca a o$ $
acao$
cao$
o$
cao$ o$ cao$ o$ $ ao$
o$
cao$ acao$ ao$
cacao$ $
16. Suffix trees for string matching
Does text T contain pattern P?
• Construct suffix tree for T in O(|T|) time
• Query takes O(|P|) time!
• same as Knuth-Morris-Pratt, but...
cacao$
ca a o$ $
acao$
cao$
o$
cao$ o$ cao$ o$ $ ao$
o$
cao$ acao$ ao$
cacao$ $
17. Suffix trees for string matching
• Scenario: text T is fixed, patterns P1,...,Pk come one at a time
- T is very long
• Time: O(|T| + |P1| + |P2| + ... + |Pk|)
• Knuth-Morris-Pratt can be generalized to multiple patterns
(Aho-Corasick alg.). Same complexity as above, but requires
knowledge of P1,...,Pk in advance
cacao$
ca a o$ $
acao$
cao$
o$
cao$ o$ cao$ o$ $ ao$
o$
cao$ acao$ ao$
cacao$ $
18. Getting extra information
Task: Get all occurrences of P in T (give a list of start positions)
• Can be done in O(|P|+c) time where c is # of occurences
- Locate the node corresponding to P
- Traverse all leaves of the subtree rooted at this node
cacao$
ca a o$ $
acao$
cao$
o$
cao$ o$ cao$ o$ $ ao$
o$
cao$ acao$ ao$
cacao$ $
19. Getting extra information
Task: Get # of occurrences of P in T
• Can be done in O(|P|) time
- With O(|T|) preprocessing
cacao$
ca a o$ $
acao$
cao$
o$
cao$ o$ cao$ o$ $ ao$
o$
cao$ acao$ ao$
cacao$ $
20. Generalized suffix tree
• Input: a set of strings {T1,...,Tk}
• Generalized suffix tree: tree containing all suffixes of all strings
• Tree for { abba , bac } : abba$ 1
bba$1
a $2
b c$2 $1 ba$1
a$1
c$2 $1 $2
bba$1 $1 a $1
c$2 ba$1
abba$1 ac$2 a$1 bba$1 bac$2
$1 c$2 ac$2
ba$1 bac$2
c$2
21. Construction of generalized suffix tree for {X, Y}
• Construct suffix tree for X $1Y $2
• Edges leading to red leaves are labeled as “...$1Y $2”; delete “Y $2”
abba$1bac$2
bba$1bac$2
$1bac$2 ba$1bac$2
a$1bac$2
...$1bac$2 $1bac$2 $1bac$2 $1bac$2
...$1bac$1
abba$1bac$2 a$1bac$2 bba$1bac$2 bac$2
ac$2
$1bac$2
c$2
ba$1bac$2 $2
22. Construction of generalized suffix tree for {X, Y}
• Construct suffix tree for X $1Y $2
• Edges leading to red leaves are labeled as “...$1Y $2”; delete “Y $2”
abba$1bac$2
bba$1bac$2
$1 ba$1bac$2
a$1bac$2
...$1 $1 $1 $1bac$2
...$1
abba$1 a$1 bba$1 bac$2
ac$2
$1
c$2
ba$1 $2
23. Construction of generalized suffix tree {T 1,...,Tk}
• Can use the same trick
- construct suffix tree for T1 $1 ... Tk $k
• Linear runtime: O(|T1|+...+|Tk|)
• In practice, modify the algorithm (e.g. Ukkonen’s alg.)
to get a slightly faster performance
24. Computing common longest subsequence of {T1,T2}
• Mark each internal node with red (blue)
if it has at least one red (blue) child
• Nodes with both colors contain common subsequences
abba$1
bba$1
a $2
b c$2 $1 ba$1
a$1
c$2 $1 $2
bba$1 $1 a $1
c$2 ba$1
abba$1 ac$2 a$1 bba$1 bac$2
$1 c$2 ac$2
ba$1 bac$2
c$2
25. Computing common longest subsequence of {T1,...,TK}
Task: For each r = 2,...,K compute the length of the longest sequence
contained in at least r strings
1. Construct generalized suffix tree for {T1,...,TK}
...
...
26. Computing common longest subsequence of {T1,...,TK}
Task: For each r = 2,...,K compute the length of the longest sequence
contained in at least r strings
1. Construct generalized suffix tree for {T1,...,TK}
2. Mark each node v with color k=1,...,K if it has a leaf with that color
3. Compute c(v) = # of colors at v
- Naive computation: O(nK), n=|T1|+...+|TK|. Can also be done in O(n)
String corresponding to v is contained in exactly c(v) input strings
...
...
27. Suffix array for text T[1..n]
SA[0..n]
mississippi 0 ε 11 • SA[i] = id of the i’th
ississippi 1 i 10 smallest suffix
ssissippi 2 ippi 7
sissippi 3 issippi 4
issippi 4 ississippi 1
• Number k corresponds
ssippi 5 mississippi 0 to suffix T[k+1..n]
sippi 6 pi 9
ippi 7 ppi 8
ppi 8 sippi 6
pi 9 sissippi 3 • Lexicographic order:
i 10 ssippi 5 - X < Xα...
ε 11 ssissippi 2 - Xα... < Xβ... if α<β
Suffix array: stores suffixes of string T in a lexicographic order
28. Pattern search from a suffix array
Task: Find pattern P = iss in text T = mississippi
m n
ε 11
i 10 • All occurrences appear contiguously in the array
ippi 7
issippi 4 • Computing start & end indexes: binary search
ississippi 1
mississippi 0 • Worst-case: O(m log n)
pi 9
ppi 8
sippi 6 • In practice, often O(m + log n)
sissippi 3
ssippi 5 • Can be improved to O(m + log n) worst-case
ssissippi 2 - need to store extra 2n numbers
29. Construction of suffix arrays
ε 11 • SA[0..n] can be constructed in linear time
i 10
ippi 7 - from a suffix tree
issippi 4
- or directly
ississippi 1
mississippi 0 [Kim,Sim,Park,Park’03]
pi 9 [Kärkkäinen&Sanders’03]
ppi 8 [Ko & Aluru’03]
sippi 6
sissippi 3
ssippi 5
ssissippi 2
30. Summary
• Suffix trees & suffix arrays: two powerful data structures for strings
• Space requirements:
- suffix trees: linear but with a large constant, e.g. 20 |T|
- suffix arrays: more efficient, e.g. 5 |T|
• Suffix arrays are more suited to large alphabets
• Practitioners like suffix arrays (simplicity, space efficiency)
• Theoreticians like suffix trees (explicit structure) ε 11
i 10
ippi 7
issippi 4
ississippi 1
mississippi 0
pi 9
ppi 8
sippi 6
sissippi 3
ssippi 5
ssissippi 2
31. Further information on string algorithms
• Dan Gusfield, “Algorithms on Strings, Trees and Sequences:
Computer Science and Computational Biology”, 1997
• Slides mentioning more recent results:
http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt