SlideShare una empresa de Scribd logo
1 de 22
MASK: Robust Local Features for
Audio Fingerprinting
Xavier Anguera, Antonio Garzón and
Tomasz Adamek
Telefonica Research
Outline
•
•
•
•

What is audio fingerprinting
MASK proposal
Experiments
Conclusions
?
watermarking

fingerprinting
What makes a good audio
fingerprint?
Discriminability

Robustness

MASK
FIngerprint

Compactness

Efficiency

MASK == Masked Audio Spectral Keypoints
Considered prior art
• Avery Wang, “An industrial strength audio search
algorithm,” in Proc. ISMIR, 2003.
• Jaap Haitsma and Antonius Kalker, “A highly robust
audio fingerprinting system,” in Proc. ISMIR, 2002.
• Shumeet Baluja and Michele Covell, “Waveprint:
Efficient wavelet-based audio fingerprinting”, Proc.
Pattern Recognition 41 (2008)
10ms, 100ms Hamming window

FFT 1024, bandwidth
limited to 300-3KHz

18 or 34 MEL-spectrum bands
Selection of salient spectral points
Mel
spectrum

1. Detect all maximas
2. Trim to the desired
number

(∆ t) 2
T hr [n] = α E [n − 1] exp −
2 ∗ σ2
where ∆ t is the distance (in frames) between the prev
peak and the considered peak, E [n − 1] is the energy o
α=0.98 σ=40
previously selected peak and α, σ are two parameters use
set the threshold falling rate and its width, respectively. In
proposed implementation we set them to 0.98 and 40 fol
ing a similar implementation used in the Shazam fingerp
by Dan Ellis1.
∆t

(“beautiful”, Timit-female)

time

2.3. Spectr ogram Masking Around Salient Points

Once the salient points have been selected we apply a m
Spectral masking around salient points

Si

Sj

ì S > S ® b[n] = 1
ï
i
j
í
ï otherwise ® b[n] = 0
î

Spectral Regions
• Include one or several timefrequency values
• Overlaps are allowed
• The number of comparisons
defines the size of the fingerprint
• Designed manually (for now)
Current MASK regions
19 frames = 190ms

5 MEL bands
Fingerprint encoding

• 4-5 bits for the MEL band where the maxima is
located
• 22 Bits obtained from spectral regions comparison
Indexing and retrieval
Reference inverted file index
011…00
001…01
…
100…11

MASK FP

(movie1, 10); (movie6, 4); (movie9, 1) ; (movie7, 34)
(movie5, 35); (movie7, 80)
…
(movie9, 24); (movie3, 5); (movie8, 11)

Exact
matching

DDBB

Content
ID

Time
offset

-Nq
Nq

QUERY MASK
0->011…00
1->100…11
…
13->111…01
14->011…01
15->000…10

0

Hamming > 4 -> 0
Hamming < 4 -> 1

Matching
segment start

Final score

Matching
Segment end
Experimental section
database
NIST-TRECVID 2010-2011 data for video-copy detection
– 400h reference videos
– 1400 audio queries per year (201 unique videos X 7)
– 7 audio transformations
•
•
•
•
•
•
•

original
MP3 compression
MP3 compression + multiband companding
Bandwidth limit (500-3K) + single-band companding
Mixed with speech
Mixed with speech + multiband companding
Mixed with speech + bandwidth limit + monoband companding
Metric & baseline
• normalized detection cost rate (NDCR) in
balanced profile
NDCRBALANCED = PMISS + 200RFA

• Compare results with a similar fingerprint to
the Philips fingerprint
Jaap Haitsma and Antonius Kalker, “A highly robust audio fingerprinting system,” in Proc.
International Symposium on Music Information Retrieval (ISMIR), 2002.
Comparison per transformation
Scores histogram
Conclusions
• A novel binary fingerprint is proposed to
improve on some shortcomings from well
reputed prior art.
• We show that we can extract the FP and use it
for VCD with excellent results

Más contenido relacionado

Similar a MASK: Robust Local Features for Audio Fingerprinting

Vanguard-Equipment-Brochure
Vanguard-Equipment-BrochureVanguard-Equipment-Brochure
Vanguard-Equipment-Brochure
Donna M. Pepe
 
lesson 2 digital data acquisition and data processing
lesson 2 digital data acquisition and data processinglesson 2 digital data acquisition and data processing
lesson 2 digital data acquisition and data processing
Mathew John
 
DEF CON 23: Spread Spectrum Satcom Hacking: Attacking The GlobalStar Simplex ...
DEF CON 23: Spread Spectrum Satcom Hacking: Attacking The GlobalStar Simplex ...DEF CON 23: Spread Spectrum Satcom Hacking: Attacking The GlobalStar Simplex ...
DEF CON 23: Spread Spectrum Satcom Hacking: Attacking The GlobalStar Simplex ...
Synack
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
NAVER Engineering
 

Similar a MASK: Robust Local Features for Audio Fingerprinting (20)

[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
 
Vanguard-Equipment-Brochure
Vanguard-Equipment-BrochureVanguard-Equipment-Brochure
Vanguard-Equipment-Brochure
 
Speaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet VocoderSpeaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet Vocoder
 
Acoustic echo cancellation using nlms adaptive algorithm ranbeer
Acoustic echo cancellation using nlms adaptive algorithm ranbeerAcoustic echo cancellation using nlms adaptive algorithm ranbeer
Acoustic echo cancellation using nlms adaptive algorithm ranbeer
 
Listening at the Cocktail Party with Deep Neural Networks and TensorFlow
Listening at the Cocktail Party with Deep Neural Networks and TensorFlowListening at the Cocktail Party with Deep Neural Networks and TensorFlow
Listening at the Cocktail Party with Deep Neural Networks and TensorFlow
 
The FE-I4 Pixel Readout System-on-Chip for ATLAS Experiment Upgrades
The FE-I4 Pixel Readout System-on-Chip  for ATLAS Experiment UpgradesThe FE-I4 Pixel Readout System-on-Chip  for ATLAS Experiment Upgrades
The FE-I4 Pixel Readout System-on-Chip for ATLAS Experiment Upgrades
 
Track 1 session 3 - st dev con 2016 - smart home and building
Track 1   session 3 - st dev con 2016 - smart home and buildingTrack 1   session 3 - st dev con 2016 - smart home and building
Track 1 session 3 - st dev con 2016 - smart home and building
 
mems poster
mems postermems poster
mems poster
 
lesson 2 digital data acquisition and data processing
lesson 2 digital data acquisition and data processinglesson 2 digital data acquisition and data processing
lesson 2 digital data acquisition and data processing
 
極紫外線散射儀於先進製程檢測應用
極紫外線散射儀於先進製程檢測應用極紫外線散射儀於先進製程檢測應用
極紫外線散射儀於先進製程檢測應用
 
Black Hat '15: Spread Spectrum Satcom Hacking: Attacking The GlobalStar Simpl...
Black Hat '15: Spread Spectrum Satcom Hacking: Attacking The GlobalStar Simpl...Black Hat '15: Spread Spectrum Satcom Hacking: Attacking The GlobalStar Simpl...
Black Hat '15: Spread Spectrum Satcom Hacking: Attacking The GlobalStar Simpl...
 
DEF CON 23: Spread Spectrum Satcom Hacking: Attacking The GlobalStar Simplex ...
DEF CON 23: Spread Spectrum Satcom Hacking: Attacking The GlobalStar Simplex ...DEF CON 23: Spread Spectrum Satcom Hacking: Attacking The GlobalStar Simplex ...
DEF CON 23: Spread Spectrum Satcom Hacking: Attacking The GlobalStar Simplex ...
 
add9.5.ppt
add9.5.pptadd9.5.ppt
add9.5.ppt
 
HVTpaperDI2003v5
HVTpaperDI2003v5HVTpaperDI2003v5
HVTpaperDI2003v5
 
NAMICON - New Advances in Development of NDT Technologies and Equipment - Oct...
NAMICON - New Advances in Development of NDT Technologies and Equipment - Oct...NAMICON - New Advances in Development of NDT Technologies and Equipment - Oct...
NAMICON - New Advances in Development of NDT Technologies and Equipment - Oct...
 
Finalreport
FinalreportFinalreport
Finalreport
 
Project - Sound Model Similarity Search
Project - Sound Model Similarity SearchProject - Sound Model Similarity Search
Project - Sound Model Similarity Search
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
 
Sampling
SamplingSampling
Sampling
 
Design and Simulation Microstrip patch Antenna using CST Microwave Studio
Design and Simulation Microstrip patch Antenna  using CST Microwave StudioDesign and Simulation Microstrip patch Antenna  using CST Microwave Studio
Design and Simulation Microstrip patch Antenna using CST Microwave Studio
 

Más de Xavier Anguera

Más de Xavier Anguera (6)

Kaldi-voice: Your personal speech recognition server using open source code
Kaldi-voice: Your personal speech recognition server using open source codeKaldi-voice: Your personal speech recognition server using open source code
Kaldi-voice: Your personal speech recognition server using open source code
 
Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation
Information Retrieval Dynamic Time Warping - Interspeech 2013 presentationInformation Retrieval Dynamic Time Warping - Interspeech 2013 presentation
Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation
 
Mediaeval 2013 Spoken Web Search results slides
Mediaeval 2013 Spoken Web Search results slidesMediaeval 2013 Spoken Web Search results slides
Mediaeval 2013 Spoken Web Search results slides
 
Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk
Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational TalkDaghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk
Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk
 
Time Machine session @ ICME 2012 - DTW's New Youth
Time Machine session @ ICME 2012 - DTW's New YouthTime Machine session @ ICME 2012 - DTW's New Youth
Time Machine session @ ICME 2012 - DTW's New Youth
 
Multimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applicationsMultimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applications
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

MASK: Robust Local Features for Audio Fingerprinting

  • 1. MASK: Robust Local Features for Audio Fingerprinting Xavier Anguera, Antonio Garzón and Tomasz Adamek Telefonica Research
  • 2. Outline • • • • What is audio fingerprinting MASK proposal Experiments Conclusions
  • 4. What makes a good audio fingerprint?
  • 6. Considered prior art • Avery Wang, “An industrial strength audio search algorithm,” in Proc. ISMIR, 2003. • Jaap Haitsma and Antonius Kalker, “A highly robust audio fingerprinting system,” in Proc. ISMIR, 2002. • Shumeet Baluja and Michele Covell, “Waveprint: Efficient wavelet-based audio fingerprinting”, Proc. Pattern Recognition 41 (2008)
  • 7.
  • 8. 10ms, 100ms Hamming window FFT 1024, bandwidth limited to 300-3KHz 18 or 34 MEL-spectrum bands
  • 9. Selection of salient spectral points Mel spectrum 1. Detect all maximas 2. Trim to the desired number (∆ t) 2 T hr [n] = α E [n − 1] exp − 2 ∗ σ2 where ∆ t is the distance (in frames) between the prev peak and the considered peak, E [n − 1] is the energy o α=0.98 σ=40 previously selected peak and α, σ are two parameters use set the threshold falling rate and its width, respectively. In proposed implementation we set them to 0.98 and 40 fol ing a similar implementation used in the Shazam fingerp by Dan Ellis1. ∆t (“beautiful”, Timit-female) time 2.3. Spectr ogram Masking Around Salient Points Once the salient points have been selected we apply a m
  • 10. Spectral masking around salient points Si Sj ì S > S ® b[n] = 1 ï i j í ï otherwise ® b[n] = 0 î Spectral Regions • Include one or several timefrequency values • Overlaps are allowed • The number of comparisons defines the size of the fingerprint • Designed manually (for now)
  • 11. Current MASK regions 19 frames = 190ms 5 MEL bands
  • 12.
  • 13. Fingerprint encoding • 4-5 bits for the MEL band where the maxima is located • 22 Bits obtained from spectral regions comparison
  • 14. Indexing and retrieval Reference inverted file index 011…00 001…01 … 100…11 MASK FP (movie1, 10); (movie6, 4); (movie9, 1) ; (movie7, 34) (movie5, 35); (movie7, 80) … (movie9, 24); (movie3, 5); (movie8, 11) Exact matching DDBB Content ID Time offset -Nq Nq QUERY MASK 0->011…00 1->100…11 … 13->111…01 14->011…01 15->000…10 0 Hamming > 4 -> 0 Hamming < 4 -> 1 Matching segment start Final score Matching Segment end
  • 16. database NIST-TRECVID 2010-2011 data for video-copy detection – 400h reference videos – 1400 audio queries per year (201 unique videos X 7) – 7 audio transformations • • • • • • • original MP3 compression MP3 compression + multiband companding Bandwidth limit (500-3K) + single-band companding Mixed with speech Mixed with speech + multiband companding Mixed with speech + bandwidth limit + monoband companding
  • 17. Metric & baseline • normalized detection cost rate (NDCR) in balanced profile NDCRBALANCED = PMISS + 200RFA • Compare results with a similar fingerprint to the Philips fingerprint Jaap Haitsma and Antonius Kalker, “A highly robust audio fingerprinting system,” in Proc. International Symposium on Music Information Retrieval (ISMIR), 2002.
  • 18.
  • 19.
  • 22. Conclusions • A novel binary fingerprint is proposed to improve on some shortcomings from well reputed prior art. • We show that we can extract the FP and use it for VCD with excellent results

Notas del editor

  1. Discriminability: separate between audios that are not the same, and join together those audios that are the sameRobustness: Even if we attack it with transformations we still can have similar discriminabilityCompactness: It does not take much space to storeEfficiency: to compute and to compare with
  2. CONS: encodes tuples of points, which needs to encode more pairs in order to have the same error probability than when encoding only oneEncodes the exact frequency location, which then to compare you need to transform from binary to integer and substiture
  3. CONS:Noisy areas are also encoded, leading to big errorsOnly adjacent energies are compared, not being very flexibleSingle energy points are encoded, leading to big problems when these are very similar and there is noise
  4. CONS:High computational cost to extract the fingerprint
  5. Individual encoding: we encode the surroundings of each spectral maxima, which allows us to encode less keypoints and obtain quite a low probability of errors, as errors in the detection of one maxima do not transform onto the others.Truly binary fingerprint: we can compare the fingerprints totally in the binary diomain or with a comparison table (for the bands)Variable complexity: we can easily define the number of bits in the FPVariable rate: depending on the application we might be interested in more or less FP per secondEasy scalable implementation: inverted file indices
  6. MASK extraction process
  7. Trimming inspired on Dan Ellis’ implementation of the Shazam Algorithm
  8. The method is universal for any kind of audioSee salient points with a maxima and a decay, or clear maximas
  9. Or maximas stable over time in the same frequency
  10. Or varying maximas over time
  11. Or varying maximas overtimeTHIS INDICATES HOW IMPORTANT IT IS TO DETECT THE MAXIMAS,BUT ALSO TO ENCODE HOW THE ENERGY FLOWS AROUND THESE MAXIMA
  12. Spectral region Si: groups of spectral time-frequency energy values grouped together and whose energy is averagedSpectral regions are compared and information is encoded by comparing their values
  13. If the maxima is on the N-2 or 2 bands we replicate the last one to make the mask fit
  14. Results when returning the 20-best and only the 1-best:interesting depending on the application (Trecvid versus search application)20-best gets a big hit due to the importance of FA in Trecvid
  15. Min NDCR sets the optimum threshold for each transformation -&gt; it is not a realistic caseWe compute the actual NDCR setting the threshold to the mean of the optimum thresholds for all transformsThe threshold STD is shown -&gt; much smaller for MASK
  16. Results in more detailThe only bad one is transform 6 (mix with speech and multiband compression)
  17. Histogram for the 1-best scoresGround truth says 70% of matchesVery flat Philips scores vs. bimodal MASK scoresThreshold used of 0.27 is a good one in this figure