MASK: Robust Local Features for Audio Fingerprinting

•Descargar como PPTX, PDF•

2 recomendaciones•1,461 vistas

Xavier Anguera

Best paper award at ICME 2012

Tecnología

MASK: Robust Local Features for
Audio Fingerprinting
Xavier Anguera, Antonio Garzón and
Tomasz Adamek
Telefonica Research

Outline
•
•
•
•

What is audio fingerprinting
MASK proposal
Experiments
Conclusions

Discriminability

Robustness

MASK
FIngerprint

Compactness

Efficiency

MASK == Masked Audio Spectral Keypoints

Considered prior art
• Avery Wang, “An industrial strength audio search
algorithm,” in Proc. ISMIR, 2003.
• Jaap Haitsma and Antonius Kalker, “A highly robust
audio fingerprinting system,” in Proc. ISMIR, 2002.
• Shumeet Baluja and Michele Covell, “Waveprint:
Efficient wavelet-based audio fingerprinting”, Proc.
Pattern Recognition 41 (2008)

10ms, 100ms Hamming window

FFT 1024, bandwidth
limited to 300-3KHz

18 or 34 MEL-spectrum bands

Selection of salient spectral points
Mel
spectrum

1. Detect all maximas
2. Trim to the desired
number

(∆ t) 2
T hr [n] = α E [n − 1] exp −
2 ∗ σ2
where ∆ t is the distance (in frames) between the prev
peak and the considered peak, E [n − 1] is the energy o
α=0.98 σ=40
previously selected peak and α, σ are two parameters use
set the threshold falling rate and its width, respectively. In
proposed implementation we set them to 0.98 and 40 fol
ing a similar implementation used in the Shazam ﬁngerp
by Dan Ellis1.
∆t

(“beautiful”, Timit-female)

time

2.3. Spectr ogram Masking Around Salient Points

Once the salient points have been selected we apply a m

Spectral masking around salient points

Si

Sj

ì S > S ® b[n] = 1
ï
i
j
í
ï otherwise ® b[n] = 0
î

Spectral Regions
• Include one or several timefrequency values
• Overlaps are allowed
• The number of comparisons
defines the size of the fingerprint
• Designed manually (for now)

Current MASK regions
19 frames = 190ms

5 MEL bands

Fingerprint encoding

• 4-5 bits for the MEL band where the maxima is
located
• 22 Bits obtained from spectral regions comparison

Indexing and retrieval
Reference inverted file index
011…00
001…01
…
100…11

MASK FP

(movie1, 10); (movie6, 4); (movie9, 1) ; (movie7, 34)
(movie5, 35); (movie7, 80)
…
(movie9, 24); (movie3, 5); (movie8, 11)

Exact
matching

DDBB

Content
ID

Time
offset

-Nq
Nq

QUERY MASK
0->011…00
1->100…11
…
13->111…01
14->011…01
15->000…10

0

Hamming > 4 -> 0
Hamming < 4 -> 1

Matching
segment start

Final score

Matching
Segment end

database
NIST-TRECVID 2010-2011 data for video-copy detection
– 400h reference videos
– 1400 audio queries per year (201 unique videos X 7)
– 7 audio transformations
•
•
•
•
•
•
•

original
MP3 compression
MP3 compression + multiband companding
Bandwidth limit (500-3K) + single-band companding
Mixed with speech
Mixed with speech + multiband companding
Mixed with speech + bandwidth limit + monoband companding

Conclusions
• A novel binary fingerprint is proposed to
improve on some shortcomings from well
reputed prior art.
• We show that we can extract the FP and use it
for VCD with excellent results

Más contenido relacionado

Similar a MASK: Robust Local Features for Audio Fingerprinting

[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...

NUGU developers

Vanguard-Equipment-Brochure

Donna M. Pepe

Speaker Dependent WaveNet Vocoder

Akira Tamamori

Acoustic echo cancellation using nlms adaptive algorithm ranbeer

Ranbeer Tyagi

Many people are amazing at focusing their attention on one person or one voice in a multi speaker scenario, and ‘muting’ other people and background noise. This is known as the cocktail party effect. For other people it is a challenge to separate audio sources. In this presentation I will focus on solving this problem with deep neural networks and TensorFlow. I will share technical and implementation details with the audience, and talk about gains, pains points, and merits of the solutions as it relates to: * Preparing, transforming and augmenting relevant data for speech separation and noise removal. * Creating, training and optimizing various neural network architectures. * Hardware options for running networks on tiny devices. * And the end goal : Real-time speech separation on a small embedded platform. I will present a vision of future smart air pods, smart headsets and smart hearing aids that will be running deep neural networks . Participants will get an insight into some of the latest advances and limitations in speech separation with deep neural networks on embedded devices in regards to: * Data transformation and augmentation. * Deep neural network models for speech separation and for removing noise. * Training smaller and faster neural networks. * Creating a real-time speech separation pipeline.

Listening at the Cocktail Party with Deep Neural Networks and TensorFlow

Databricks

Novel pixel readout system-on-chip (SoC) has been designed to meet the ever increasing demands of the present and future generation of LHC pixel detectors. The FE-I4 architecture has higher luminosity and rate capability as well as a smaller single pixel area compared to its predecessors and is currently the most complex chip designed for particle physics applications. The IC has been designed in 130nm CMOS technology. The state of the art of the FE-I4 will be presented, including the architecture overview, simulation results, preliminary measurements and a global design flow.

The FE-I4 Pixel Readout System-on-Chip for ATLAS Experiment Upgrades

themperek

Track 1 session 3 - st dev con 2016 - smart home and building

ST_World

mems poster

Albert Patterson

lesson 2 digital data acquisition and data processing

Mathew John

極紫外線散射儀於先進製程檢測應用

CHENHuiMei

Black Hat 2015 Recently, there have been several highly publicized talks about satellite hacking. However, most only touch on the theoretical rather than demonstrate actual vulnerabilities and real world attack scenarios. This talk will demystify some of the technologies behind satellite communications and do what no one has done before - take the audience step-by-step from reverse engineering to exploitation of the GlobalStar simplex satcom protocol and demonstrate a full blown signals intelligence collection and spoofing capability. I will also demonstrate how an attacker might simulate critical conditions in satellite connected SCADA systems. In recent years, Globalstar has gained popularity with the introduction of its consumer focused SPOT asset-tracking solutions. During the session, I'll deconstruct the transmitters used in these (and commercial) solutions and reveal design and implementation flaws that result in the ability to intercept, spoof, falsify, and intelligently jam communications. Due to design tradeoffs these vulnerabilities are realistically unpatchable and put millions of devices, critical infrastructure, emergency services, and high value assets at risk.

Black Hat '15: Spread Spectrum Satcom Hacking: Attacking The GlobalStar Simpl...

Synack

DEF CON 23 Recently there have been several highly publicized talks about satellite hacking. However, most only touch on the theoretical rather than demonstrate actual vulnerabilities and real world attack scenarios. This talk will demystify some of the technologies behind satellite communications and do what no one has done before - take the audience step-by-step from reverse engineering to exploitation of the GlobalStar simplex satcom protocol and demonstrate a full blown signals intelligence collection and spoofing capability. I will also demonstrate how an attacker might simulate critical conditions in satellite connected SCADA systems. In recent years, Globalstar has gained popularity with the introduction of its consumer focused SPOT asset-tracking solutions. During the session, I’ll deconstruct the transmitters used in these (and commercial) solutions and reveal design and implementation flaws that result in the ability to intercept, spoof, falsify, and intelligently jam communications. Due to design tradeoffs these vulnerabilities are realistically unpatchable and put millions of devices, critical infrastructure, emergency services, and high value assets at risk. Colby Moore is Synack's Manager of Special Activities. He works on the oddball and difficult problems that no one else knows how to tackle and strives to embrace the attacker mindset during all engagements. He is a former employee of VRL and has identified countless 0day vulnerabilities in embedded systems and major applications. In his spare time you will find him focusing on that sweet spot where hardware and software meet, usually resulting in very interesting consequences.

DEF CON 23: Spread Spectrum Satcom Hacking: Attacking The GlobalStar Simplex ...

Synack

add9.5.ppt

AshenafiGirma5

HVTpaperDI2003v5

Mark Stockfisch

NAMICON - New Advances in Development of NDT Technologies and Equipment - Oct...

argebit

Finalreport

columbose

This project involved the design and implementation of an Automation tool that would search a database of sound models for a given audio input query. The sound model to which the algorithm points to is the one that can best model / can produce sounds very similar to the input sound query. It is a growing application in the Games, Music production industry where there is a growing need to have an easy access to sounds without having to develop / code them every single time there is a need. Searching is always easier than developing from scratch! This project was performed at Interactive Digital Media Institute, Singapore

Project - Sound Model Similarity Search

Sudarshan Bala

발표자: 김준태 (KAIST 박사과정) 발표일: 2018.10 Voice activity detection (VAD) and speech enhancement (SE) are important front-end technologies for noise robust speech recognition system. From incoming noisy signal, VAD detects the speech signal only and SE removes the noise signal while conserving the speech signal. For VAD and SE, this presentation will cover the traditional methods, deep learning based methods, and our papers as follows: 1. J. Kim and M. Hahn, "Voice Activity Detection Using an Adaptive Context Attention Model," in IEEE Signal Processing Letters, vol. 25, no. 8, pp. 1181-1185, Aug. 2018. 2. J. Kim and M. Hahn, "Speech Enhancement Using a Two Step Network," submitted to IEEE Signal Processing Letters, 2018. Also, this presentation will briefly introduce some experimental results in real-world environment (far-field, noisy environment), conducted on the embedded board. For VAD, Traditional VAD methods. Deep learning based VAD methods. Paper presentation: J. Kim and M. Hahn, "Voice Activity Detection Using an Adaptive Context Attention Model," in IEEE Signal Processing Letters, vol. 25, no. 8, pp. 1181-1185, Aug. 2018. End point detection based on VAD. Experimental results of DNN-EPD on embedded board in real-world environment. For SE, Traditional SE methods. Deep learning based SE methods. Paper presentation: J. Kim and M. Hahn, "Speech Enhancement Using a Two Step Network," submitted to IEEE Signal Processing Letters, 2018. Experimental results in real-world environment.

Deep Learning Based Voice Activity Detection and Speech Enhancement

NAVER Engineering

Sampling

Muhammad Uzair Rasheed

Design and Simulation Microstrip patch Antenna using CST Microwave Studio

Aymen Al-obaidi

Similar a MASK: Robust Local Features for Audio Fingerprinting (20)

[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...

Vanguard-Equipment-Brochure

Speaker Dependent WaveNet Vocoder

Acoustic echo cancellation using nlms adaptive algorithm ranbeer

Listening at the Cocktail Party with Deep Neural Networks and TensorFlow

The FE-I4 Pixel Readout System-on-Chip for ATLAS Experiment Upgrades

Track 1 session 3 - st dev con 2016 - smart home and building

mems poster

lesson 2 digital data acquisition and data processing

極紫外線散射儀於先進製程檢測應用

Black Hat '15: Spread Spectrum Satcom Hacking: Attacking The GlobalStar Simpl...

DEF CON 23: Spread Spectrum Satcom Hacking: Attacking The GlobalStar Simplex ...

add9.5.ppt

HVTpaperDI2003v5

NAMICON - New Advances in Development of NDT Technologies and Equipment - Oct...

Finalreport

Project - Sound Model Similarity Search

Deep Learning Based Voice Activity Detection and Speech Enhancement

Sampling

Design and Simulation Microstrip patch Antenna using CST Microwave Studio

Más de Xavier Anguera

Kaldi-voice: Your personal speech recognition server using open source code

Xavier Anguera

Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Xavier Anguera

Mediaeval 2013 Spoken Web Search results slides

Xavier Anguera

Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk

Xavier Anguera

Time Machine session @ ICME 2012 - DTW's New Youth

Xavier Anguera

In this presentation I focus on 3 projects I have been working in the last year. The first one is a novel pattern matching algorithm, based on the well known Dynamic Time Warping. The presented algorithm can be used to find real-valued subsequences within a longer sequence, without prior knowledge of their start-end points. I have applied the algorithm for the task of acoustic matching, for which I will show some preliminary results. Then I will continue to explain a second DTW-based algorithm, this one being able do an online of two musical pieces. One of the music pieces can be input life or be retrieved from an audio file, while the second one is extracted from an online music video. The online alignment allows for the music video to be played in total synchrony with the corresponding ambient/recorded audio. Finally, I will talk about video copy detection, which is the task of finding video duplicate segments within a big database. I will explain our multimodal approach, based on audio-visual change-based features.

Multimodal pattern matching algorithms and applications

Xavier Anguera

Más de Xavier Anguera (6)

Kaldi-voice: Your personal speech recognition server using open source code

Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Mediaeval 2013 Spoken Web Search results slides

Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk

Time Machine session @ ICME 2012 - DTW's New Youth

Multimodal pattern matching algorithms and applications

Último

Partners Life - Insurer Innovation Award 2024

The Digital Insurer

Tech Trends Report 2024 Future Today Institute.pdf

hans926745

Histor y of HAM Radio presentation slide

vu2urc

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

This presentations targets students or working professionals. You may know Google for search, YouTube, Android, Chrome, and Gmail, but did you know Google has many developer tools, platforms & APIs? This comprehensive yet still high-level overview outlines the most impactful tools for where to run your code, store & analyze your data. It will also inspire you as to what's possible. This talk is 50 minutes in length.

Powerful Google developer tools for immediate impact! (2023-24 C)

wesley chun

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

[2024]Digital Global Overview Report 2024 Meltwater.pdf

hans926745

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

As privacy and data protection regulations evolve rapidly, organizations operating in multiple jurisdictions face mounting challenges to ensure compliance and safeguard customer data. With state-specific privacy laws coming up in multiple states this year, it is essential to understand what their unique data protection regulations will require clearly. How will data privacy evolve in the US in 2024? How to stay compliant? Our panellists will guide you through the intricacies of these states' specific data privacy laws, clarifying complex legal frameworks and compliance requirements. This webinar will review: - The essential aspects of each state's privacy landscape and the latest updates - Common compliance challenges faced by organizations operating in multiple states and best practices to achieve regulatory adherence - Valuable insights into potential changes to existing regulations and prepare your organization for the evolving landscape

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc

Created by Mozilla Research in 2012 and now part of Linux Foundation Europe, the Servo project is an experimental rendering engine written in Rust. It combines memory safety and concurrency to create an independent, modular, and embeddable rendering engine that adheres to web standards. Stewardship of Servo moved from Mozilla Research to the Linux Foundation in 2020, where its mission remains unchanged. After some slow years, in 2023 there has been renewed activity on the project, with a roadmap now focused on improving the engine’s CSS 2 conformance, exploring Android support, and making Servo a practical embeddable rendering engine. In this presentation, Rakhi Sharma reviews the status of the project, our recent developments in 2023, our collaboration with Tauri to make Servo an easy-to-use embeddable rendering engine, and our plans for the future to make Servo an alternative web rendering engine for the embedded devices industry. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://ossna2024.sched.com/event/1aBNF/a-year-of-servo-reboot-where-are-we-now-rakhi-sharma-igalia

A Year of the Servo Reboot: Where Are We Now?

Igalia

Three things you will take away from the session: • How to run an effective tenant-to-tenant migration • Best practices for before, during, and after migration • Tips for using migration as a springboard to prepare for Copilot in Microsoft 365 Main ideas: Migration Overview: The presentation covers the current reality of cross-tenant migrations, the triggers, phases, best practices, and benefits of a successful tenant migration Considerations: When considering a migration, it is important to consider the migration scope, performance, customization, flexibility, user-friendly interface, automation, monitoring, support, training, scalability, data integrity, data security, cost, and licensing structure Next Wave: The next wave of change includes the launch of Copilot, which requires businesses to be prepared for upcoming changes related to Copilot and the cloud, and to consolidate data and tighten governance ShareGate: ShareGate can help with pre-migration analysis, configurable migration tool, and automated, end-user driven collaborative governance

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

sammart93

🐬 The future of MySQL is Postgres 🐘

RTylerCroy

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DianaGray10

Developing An App To Navigate The Roads of Brazil

V3cube

Data Cloud, More than a CDP by Matt Robison

Anna Loughnan Colquhoun

Finology Group – Insurtech Innovation Award 2024

The Digital Insurer

Imagine a world where information flows as swiftly as thought itself, making decision-making as fluid as the data driving it. Every moment is critical, and the right tools can significantly boost your organization’s performance. The power of real-time data automation through FME can turn this vision into reality. Aimed at professionals eager to leverage real-time data for enhanced decision-making and efficiency, this webinar will cover the essentials of real-time data and its significance. We’ll explore: FME’s role in real-time event processing, from data intake and analysis to transformation and reporting An overview of leveraging streams vs. automations FME’s impact across various industries highlighted by real-life case studies Live demonstrations on setting up FME workflows for real-time data Practical advice on getting started, best practices, and tips for effective implementation Join us to enhance your skills in real-time data automation with FME, and take your operational capabilities to the next level.

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Safe Software

GenAI Risks & Security Meetup 01052024.pdf

lior mazor

Discord is a free app offering voice, video, and text chat functionalities, primarily catering to the gaming community. It serves as a hub for users to create and join servers tailored to their interests. Discord’s ecosystem comprises servers, each functioning as a distinct online community with its own channels dedicated to specific topics or activities. Users can engage in text-based discussions, voice calls, or video chats within these channels. Understanding Discord Servers Discord servers are virtual spaces where users congregate to interact, share content, and build communities. Servers may revolve around gaming, hobbies, interests, or fandoms, providing a platform for like-minded individuals to connect. Communication Features Discord offers a range of communication tools, including text channels for messaging, voice channels for real-time audio conversations, and video channels for face-to-face interactions. These features facilitate seamless communication and collaboration. What Does NSFW Mean? The acronym NSFW stands for “Not Safe For Work,” indicating content that may be inappropriate for professional or public settings. NSFW Content NSFW content encompasses material that is sexually explicit, violent, or otherwise graphic in nature. It often includes nudity, profanity, or depictions of sensitive topics.

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

UK Journal

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

MASK: Robust Local Features for Audio Fingerprinting

1. MASK: Robust Local Features for Audio Fingerprinting Xavier Anguera, Antonio Garzón and Tomasz Adamek Telefonica Research

2. Outline • • • • What is audio fingerprinting MASK proposal Experiments Conclusions

3. ? watermarking fingerprinting

4. What makes a good audio fingerprint?

5. Discriminability Robustness MASK FIngerprint Compactness Efficiency MASK == Masked Audio Spectral Keypoints

6. Considered prior art • Avery Wang, “An industrial strength audio search algorithm,” in Proc. ISMIR, 2003. • Jaap Haitsma and Antonius Kalker, “A highly robust audio fingerprinting system,” in Proc. ISMIR, 2002. • Shumeet Baluja and Michele Covell, “Waveprint: Efficient wavelet-based audio fingerprinting”, Proc. Pattern Recognition 41 (2008)

8. 10ms, 100ms Hamming window FFT 1024, bandwidth limited to 300-3KHz 18 or 34 MEL-spectrum bands

9. Selection of salient spectral points Mel spectrum 1. Detect all maximas 2. Trim to the desired number (∆ t) 2 T hr [n] = α E [n − 1] exp − 2 ∗ σ2 where ∆ t is the distance (in frames) between the prev peak and the considered peak, E [n − 1] is the energy o α=0.98 σ=40 previously selected peak and α, σ are two parameters use set the threshold falling rate and its width, respectively. In proposed implementation we set them to 0.98 and 40 fol ing a similar implementation used in the Shazam ﬁngerp by Dan Ellis1. ∆t (“beautiful”, Timit-female) time 2.3. Spectr ogram Masking Around Salient Points Once the salient points have been selected we apply a m

10. Spectral masking around salient points Si Sj ì S > S ® b[n] = 1 ï i j í ï otherwise ® b[n] = 0 î Spectral Regions • Include one or several timefrequency values • Overlaps are allowed • The number of comparisons defines the size of the fingerprint • Designed manually (for now)

11. Current MASK regions 19 frames = 190ms 5 MEL bands

12.

13. Fingerprint encoding • 4-5 bits for the MEL band where the maxima is located • 22 Bits obtained from spectral regions comparison

14. Indexing and retrieval Reference inverted file index 011…00 001…01 … 100…11 MASK FP (movie1, 10); (movie6, 4); (movie9, 1) ; (movie7, 34) (movie5, 35); (movie7, 80) … (movie9, 24); (movie3, 5); (movie8, 11) Exact matching DDBB Content ID Time offset -Nq Nq QUERY MASK 0->011…00 1->100…11 … 13->111…01 14->011…01 15->000…10 0 Hamming > 4 -> 0 Hamming < 4 -> 1 Matching segment start Final score Matching Segment end

15. Experimental section

16. database NIST-TRECVID 2010-2011 data for video-copy detection – 400h reference videos – 1400 audio queries per year (201 unique videos X 7) – 7 audio transformations • • • • • • • original MP3 compression MP3 compression + multiband companding Bandwidth limit (500-3K) + single-band companding Mixed with speech Mixed with speech + multiband companding Mixed with speech + bandwidth limit + monoband companding

17. Metric & baseline • normalized detection cost rate (NDCR) in balanced profile NDCRBALANCED = PMISS + 200RFA • Compare results with a similar fingerprint to the Philips fingerprint Jaap Haitsma and Antonius Kalker, “A highly robust audio fingerprinting system,” in Proc. International Symposium on Music Information Retrieval (ISMIR), 2002.

18.

19.

20. Comparison per transformation

21. Scores histogram

22. Conclusions • A novel binary fingerprint is proposed to improve on some shortcomings from well reputed prior art. • We show that we can extract the FP and use it for VCD with excellent results

Notas del editor

Discriminability: separate between audios that are not the same, and join together those audios that are the sameRobustness: Even if we attack it with transformations we still can have similar discriminabilityCompactness: It does not take much space to storeEfficiency: to compute and to compare with
CONS: encodes tuples of points, which needs to encode more pairs in order to have the same error probability than when encoding only oneEncodes the exact frequency location, which then to compare you need to transform from binary to integer and substiture
CONS:Noisy areas are also encoded, leading to big errorsOnly adjacent energies are compared, not being very flexibleSingle energy points are encoded, leading to big problems when these are very similar and there is noise
CONS:High computational cost to extract the fingerprint
Individual encoding: we encode the surroundings of each spectral maxima, which allows us to encode less keypoints and obtain quite a low probability of errors, as errors in the detection of one maxima do not transform onto the others.Truly binary fingerprint: we can compare the fingerprints totally in the binary diomain or with a comparison table (for the bands)Variable complexity: we can easily define the number of bits in the FPVariable rate: depending on the application we might be interested in more or less FP per secondEasy scalable implementation: inverted file indices
MASK extraction process
Trimming inspired on Dan Ellis’ implementation of the Shazam Algorithm
The method is universal for any kind of audioSee salient points with a maxima and a decay, or clear maximas
Or maximas stable over time in the same frequency
Or varying maximas over time
Or varying maximas overtimeTHIS INDICATES HOW IMPORTANT IT IS TO DETECT THE MAXIMAS,BUT ALSO TO ENCODE HOW THE ENERGY FLOWS AROUND THESE MAXIMA
Spectral region Si: groups of spectral time-frequency energy values grouped together and whose energy is averagedSpectral regions are compared and information is encoded by comparing their values
If the maxima is on the N-2 or 2 bands we replicate the last one to make the mask fit
Results when returning the 20-best and only the 1-best:interesting depending on the application (Trecvid versus search application)20-best gets a big hit due to the importance of FA in Trecvid
Min NDCR sets the optimum threshold for each transformation -> it is not a realistic caseWe compute the actual NDCR setting the threshold to the mean of the optimum thresholds for all transformsThe threshold STD is shown -> much smaller for MASK
Results in more detailThe only bad one is transform 6 (mix with speech and multiband compression)
Histogram for the 1-best scoresGround truth says 70% of matchesVery flat Philips scores vs. bimodal MASK scoresThreshold used of 0.27 is a good one in this figure

MASK: Robust Local Features for Audio Fingerprinting

Recomendados

Recomendados

Más contenido relacionado

Similar a MASK: Robust Local Features for Audio Fingerprinting

Similar a MASK: Robust Local Features for Audio Fingerprinting (20)

Más de Xavier Anguera

Más de Xavier Anguera (6)

Último

Último (20)

MASK: Robust Local Features for Audio Fingerprinting

Notas del editor