2. Query-by-Example Spoken-Term Detection
Given a spoken query we search for instances at lexical
level within spoken documents
It is similar to Spoken Term Detection (NIST STD2006,
Babel 2013) but…
Queries are spoken
Different speakers
Different acoustic conditions
No prior knowledge of the
language might be available
4. Information Retrieval-based DTW
• Inspired on the Subsequence-Dynamic time warping
algorithm by Müller [1]
• It performs a ‘sparse’ matching of two signals like
Jansen [2]
• Uses ideas borrowed from Information retrieval to
preserve memory (lots of it)
• It can take advantage of pre-indexing all reference
data and thus perform a fast frame-level matching
(described in [3])
[1] Meinard Müller, “Information Retrieval for Music and Motion”, Springer-Verlag, ISBN 978-3-540-74047-6, pp. 147150, 2010
[2] Aren Jansen, Benjamin Van Durme, “Indexing Raw Acoustic Features for Scalable Zero Resource Search”, Proc.
Interspeech 2012
[3] Gautam Mantena, Xavier Anguera, “Speed Improvements to Information Retrieval-based Dynamic Time Warping
Using Hierarchical K-means Clustering”, in Proc. ICASSP 2013
12. From 2D to 1D: Memory efficient matching
We borrow an alignment
algorithm used for
Information Retrieval
It finds unconstrained startend locations but does not
allow any time-warping
With IRDTW we modified this algorithm to allow for
time-warped matching
13. We use the ‘matching counts’ vector in the dynamic
programming instead of the similarity matrix.
The end position of the
paths define their location
in the 1D vector
The new matching point defines
a target location where one of
the paths will warp to
14. What information is stored in this vector?
DT
DT = tqi - trj
For each path we store:
• query(start, end)
• reference(start,end)
• Accumulated Distance
• #matching points
• Only paths with #matches > 1 are stored in the ΔT vector
• Size(ΔT) = size_query + size_ref (can be constrained using a circular buffer)
15. Applying warping constraints in 1D
Constraints in the similarity matrix translate as:
1. Consider all paths within range
DT
Wrange
Wrange/2
1. Check for local constraints
• Basic warping:
Δr > 0
• 2X warping:
Δq ≥ Δr/2
Δq ≤ 2*Δr
Query
Δq > 0
Δq
Δr
Reference
16. Best matching path selection
We select the path with most number of matches. It is
then warped to end in the current matching point
DT = tqi - trj
DT
New path info:
• q_end = tqi
• r_end = trj
• Accum. Distance += d(qi, rj)
• #matches++
we can dynamically save memory by eliminating obsolete paths
18. Acoustic features
• Posteriorgram features are used (Zhang-Glass
2010)
– MFCC-39 -> GMM-64 Posterior probability vectors
• Distance between features:
æ N-1 x [i]× y [i] ö
n
÷
d(xm , yn ) = -log çå m
ç
xm yn ÷
è i=0
ø
19. Query-by-example Spoken Term
Detection system*
Background
model training
Search
corpus
Feature
extractor
Background
model
Query
Feature
extractor
Index mode
VAD models
training
Energy-based
VAD
Development
dataset
VAD
model
Energy-based
VAD
IR-DTW
Local S-DTW
refinement
Overlap
prunning
Search mode
*X. Anguera, “Telefonica system for the Spoken Web Search Task at Mediaeval 2012”,
Mediaeval 2012 Workshop, Pisa, Italy
20. Performance evaluation
• Database: Mediaeval SWS 2012 data (4
African languages, subset of Lwazy database*)
– ~4h development corpus + 100 queries
– ~4h evaluation corpus + 100 queries
• Metrics:
– Minimum Term Weighted Value (MTWV)
– Memory usage
*E. Barnard, M. Davel, C. V. Heerden, “ASR Corpus Design for Resource-Scarse Languages”, in
Proc. Interspeech 2009
21. Minimum Term Weighted Value
System
Dev. Set
Eval Set
Diagonal
0.258
0.276
IR-DTW
0.394
0.394
S-DTW
0.443
0.450
Rails system
0.381
0.384
Contrastive systems:
• Diagonal: Substitute IR-DTW by only allowing diagonal matches
• S-DTW: Implementation as in [1]
• Rails system: scores from [2] on the same database
[1] X. Anguera and M. Ferrarons, “Memory-Efficient Subsequence-DTW for Query-by-Example
Spoken Term Detection”, in Proc. ICME, 2013
[2] A. Jansen, B. V. Durme and P. Clark, “The JHU-HLTCOE Spoken Web Search System for
Mediaeval 2012”, in Proc. Mediaeval Workshop 2012, Pisa, Italy
22. Memory usage analysis
System
Dev. Set (mean/std)
Eval set (mean/std)
S-DTW
506.2MB/342.8MB
568.1MB/326.4MB
IR-DTW
91.7MB/15MB
112.3MB/21.8MB
23. Conclusions and Future Work
• We have introduced the IR-DTW algorithm and
demonstrated its potential in the QbE-STD task.
– Its main advantage is its low memory usage
– Accuracy still falls short from an exhaustive/traditional
search
Not anymore!
• We are testing IR-DTW in other tasks
– Large volumes of data that disallow building similarity
matrices
– Applications not in speech that can benefit from
sparse matching
24. Thanks for your attention
Questions?
Xavier Anguera
xanguera@tid.es
Download the code from here:
http://www.xavieranguera.com/resources/resources.html#IRDT
W