6. Considered prior art
• Avery Wang, “An industrial strength audio search
algorithm,” in Proc. ISMIR, 2003.
• Jaap Haitsma and Antonius Kalker, “A highly robust
audio fingerprinting system,” in Proc. ISMIR, 2002.
• Shumeet Baluja and Michele Covell, “Waveprint:
Efficient wavelet-based audio fingerprinting”, Proc.
Pattern Recognition 41 (2008)
7.
8. 10ms, 100ms Hamming window
FFT 1024, bandwidth
limited to 300-3KHz
18 or 34 MEL-spectrum bands
9. Selection of salient spectral points
Mel
spectrum
1. Detect all maximas
2. Trim to the desired
number
(∆ t) 2
T hr [n] = α E [n − 1] exp −
2 ∗ σ2
where ∆ t is the distance (in frames) between the prev
peak and the considered peak, E [n − 1] is the energy o
α=0.98 σ=40
previously selected peak and α, σ are two parameters use
set the threshold falling rate and its width, respectively. In
proposed implementation we set them to 0.98 and 40 fol
ing a similar implementation used in the Shazam fingerp
by Dan Ellis1.
∆t
(“beautiful”, Timit-female)
time
2.3. Spectr ogram Masking Around Salient Points
Once the salient points have been selected we apply a m
10. Spectral masking around salient points
Si
Sj
ì S > S ® b[n] = 1
ï
i
j
í
ï otherwise ® b[n] = 0
î
Spectral Regions
• Include one or several timefrequency values
• Overlaps are allowed
• The number of comparisons
defines the size of the fingerprint
• Designed manually (for now)
16. database
NIST-TRECVID 2010-2011 data for video-copy detection
– 400h reference videos
– 1400 audio queries per year (201 unique videos X 7)
– 7 audio transformations
•
•
•
•
•
•
•
original
MP3 compression
MP3 compression + multiband companding
Bandwidth limit (500-3K) + single-band companding
Mixed with speech
Mixed with speech + multiband companding
Mixed with speech + bandwidth limit + monoband companding
17. Metric & baseline
• normalized detection cost rate (NDCR) in
balanced profile
NDCRBALANCED = PMISS + 200RFA
• Compare results with a similar fingerprint to
the Philips fingerprint
Jaap Haitsma and Antonius Kalker, “A highly robust audio fingerprinting system,” in Proc.
International Symposium on Music Information Retrieval (ISMIR), 2002.
22. Conclusions
• A novel binary fingerprint is proposed to
improve on some shortcomings from well
reputed prior art.
• We show that we can extract the FP and use it
for VCD with excellent results
Notas del editor
Discriminability: separate between audios that are not the same, and join together those audios that are the sameRobustness: Even if we attack it with transformations we still can have similar discriminabilityCompactness: It does not take much space to storeEfficiency: to compute and to compare with
CONS: encodes tuples of points, which needs to encode more pairs in order to have the same error probability than when encoding only oneEncodes the exact frequency location, which then to compare you need to transform from binary to integer and substiture
CONS:Noisy areas are also encoded, leading to big errorsOnly adjacent energies are compared, not being very flexibleSingle energy points are encoded, leading to big problems when these are very similar and there is noise
CONS:High computational cost to extract the fingerprint
Individual encoding: we encode the surroundings of each spectral maxima, which allows us to encode less keypoints and obtain quite a low probability of errors, as errors in the detection of one maxima do not transform onto the others.Truly binary fingerprint: we can compare the fingerprints totally in the binary diomain or with a comparison table (for the bands)Variable complexity: we can easily define the number of bits in the FPVariable rate: depending on the application we might be interested in more or less FP per secondEasy scalable implementation: inverted file indices
MASK extraction process
Trimming inspired on Dan Ellis’ implementation of the Shazam Algorithm
The method is universal for any kind of audioSee salient points with a maxima and a decay, or clear maximas
Or maximas stable over time in the same frequency
Or varying maximas over time
Or varying maximas overtimeTHIS INDICATES HOW IMPORTANT IT IS TO DETECT THE MAXIMAS,BUT ALSO TO ENCODE HOW THE ENERGY FLOWS AROUND THESE MAXIMA
Spectral region Si: groups of spectral time-frequency energy values grouped together and whose energy is averagedSpectral regions are compared and information is encoded by comparing their values
If the maxima is on the N-2 or 2 bands we replicate the last one to make the mask fit
Results when returning the 20-best and only the 1-best:interesting depending on the application (Trecvid versus search application)20-best gets a big hit due to the importance of FA in Trecvid
Min NDCR sets the optimum threshold for each transformation -> it is not a realistic caseWe compute the actual NDCR setting the threshold to the mean of the optimum thresholds for all transformsThe threshold STD is shown -> much smaller for MASK
Results in more detailThe only bad one is transform 6 (mix with speech and multiband compression)
Histogram for the 1-best scoresGround truth says 70% of matchesVery flat Philips scores vs. bimodal MASK scoresThreshold used of 0.27 is a good one in this figure