Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Information Retrieval-based
Dynamic Time Warping
Xavier Anguera
Telefonica Research
Spain

Query-by-Example Spoken-Term Detection
Given a spoken query we search for instances at lexical
level within spoken documents
It is similar to Spoken Term Detection (NIST STD2006,
Babel 2013) but…
 Queries are spoken

 Different speakers
 Different acoustic conditions
 No prior knowledge of the
language might be available

Information Retrieval-based
Dynamic Time Warping Algorithm
(IRDTW)

Information Retrieval-based DTW
• Inspired on the Subsequence-Dynamic time warping
algorithm by Müller [1]
• It performs a ‘sparse’ matching of two signals like
Jansen [2]
• Uses ideas borrowed from Information retrieval to
preserve memory (lots of it)
• It can take advantage of pre-indexing all reference
data and thus perform a fast frame-level matching
(described in [3])
[1] Meinard Müller, “Information Retrieval for Music and Motion”, Springer-Verlag, ISBN 978-3-540-74047-6, pp. 147150, 2010
[2] Aren Jansen, Benjamin Van Durme, “Indexing Raw Acoustic Features for Scalable Zero Resource Search”, Proc.
Interspeech 2012
[3] Gautam Mantena, Xavier Anguera, “Speed Improvements to Information Retrieval-based Dynamic Time Warping
Using Hierarchical K-means Clustering”, in Proc. ICASSP 2013

Query term

Subsequence-DTW algorithm (review)

Reference term

‘Sparse’ frame matching
Only the closest (lowest distance) query-reference pairs are
considered. These can be found through…
• Exhaustive comparison
• Efficient retrieval using indexing techniques
10

20

30

40

50

60

70
100

200

300

400

500

600

700

800

10

20

30

40

50

60

70
100

200

300

400

500

600

700

800

‘Sparse’ dynamic programming
IR-DTW
Query

Query

S-DTW

Reference

IR-DTW warping constraints
Query

IR-DTW

WRange =

maxQDist
2

WRange =maxQDist

Possible constraints:
• Amount of warping:
• basic warping
• 2X warping
• Length to the match

From 2D to 1D: Memory efficient matching
We borrow an alignment
algorithm used for
Information Retrieval
It finds unconstrained startend locations but does not
allow any time-warping

With IRDTW we modified this algorithm to allow for
time-warped matching

We use the ‘matching counts’ vector in the dynamic
programming instead of the similarity matrix.
The end position of the
paths define their location
in the 1D vector
The new matching point defines
a target location where one of
the paths will warp to

What information is stored in this vector?
DT
DT = tqi - trj

For each path we store:
• query(start, end)
• reference(start,end)
• Accumulated Distance
• #matching points

• Only paths with #matches > 1 are stored in the ΔT vector
• Size(ΔT) = size_query + size_ref (can be constrained using a circular buffer)

Applying warping constraints in 1D
Constraints in the similarity matrix translate as:
1. Consider all paths within range
DT
Wrange

Wrange/2

1. Check for local constraints
• Basic warping:
Δr > 0

• 2X warping:
Δq ≥ Δr/2

Δq ≤ 2*Δr

Query

Δq > 0

Δq
Δr
Reference

Best matching path selection
We select the path with most number of matches. It is
then warped to end in the current matching point
DT = tqi - trj

DT

New path info:
• q_end = tqi
• r_end = trj
• Accum. Distance += d(qi, rj)
• #matches++

we can dynamically save memory by eliminating obsolete paths

Query-by-Example
Spoken-Term detection
system

Acoustic features
• Posteriorgram features are used (Zhang-Glass
2010)
– MFCC-39 -> GMM-64 Posterior probability vectors

• Distance between features:
æ N-1 x [i]× y [i] ö
n
÷
d(xm , yn ) = -log çå m
ç
xm yn ÷
è i=0
ø

Query-by-example Spoken Term
Detection system*
Background
model training

Search
corpus

Feature
extractor
Background
model

Query

Feature
extractor

Index mode

VAD models
training

Energy-based
VAD

Development
dataset

VAD
model
Energy-based
VAD

IR-DTW

Local S-DTW
refinement

Overlap
prunning

Search mode
*X. Anguera, “Telefonica system for the Spoken Web Search Task at Mediaeval 2012”,
Mediaeval 2012 Workshop, Pisa, Italy

Performance evaluation
• Database: Mediaeval SWS 2012 data (4
African languages, subset of Lwazy database*)
– ~4h development corpus + 100 queries
– ~4h evaluation corpus + 100 queries

• Metrics:
– Minimum Term Weighted Value (MTWV)
– Memory usage

*E. Barnard, M. Davel, C. V. Heerden, “ASR Corpus Design for Resource-Scarse Languages”, in
Proc. Interspeech 2009

Minimum Term Weighted Value
System

Dev. Set

Eval Set

Diagonal

0.258

0.276

IR-DTW

0.394

0.394

S-DTW

0.443

0.450

Rails system

0.381

0.384

Contrastive systems:
• Diagonal: Substitute IR-DTW by only allowing diagonal matches
• S-DTW: Implementation as in [1]
• Rails system: scores from [2] on the same database
[1] X. Anguera and M. Ferrarons, “Memory-Efficient Subsequence-DTW for Query-by-Example
Spoken Term Detection”, in Proc. ICME, 2013
[2] A. Jansen, B. V. Durme and P. Clark, “The JHU-HLTCOE Spoken Web Search System for
Mediaeval 2012”, in Proc. Mediaeval Workshop 2012, Pisa, Italy

Memory usage analysis

System

Dev. Set (mean/std)

Eval set (mean/std)

S-DTW

506.2MB/342.8MB

568.1MB/326.4MB

IR-DTW

91.7MB/15MB

112.3MB/21.8MB

Conclusions and Future Work
• We have introduced the IR-DTW algorithm and
demonstrated its potential in the QbE-STD task.
– Its main advantage is its low memory usage
– Accuracy still falls short from an exhaustive/traditional
search
Not anymore!

• We are testing IR-DTW in other tasks
– Large volumes of data that disallow building similarity
matrices
– Applications not in speech that can benefit from
sparse matching

Thanks for your attention

Questions?
Xavier Anguera
xanguera@tid.es
Download the code from here:
http://www.xavieranguera.com/resources/resources.html#IRDT
W

Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Similar to Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation (20)

More from Xavier Anguera

More from Xavier Anguera (6)

Recently uploaded

Recently uploaded (20)

Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation