SlideShare una empresa de Scribd logo
1 de 20
Descargar para leer sin conexión
L3-Net Deep Audio
Embeddings to Improve
COVID-19 Detection
from Smartphone Data
Mattia G. Campana (IIT-CNR)


Andrea Rovati (UniMi)


Franca Delmastro (IIT-CNR)


Elena Pagani (UniMi)
IEEE SMARTCOMP 2022, June 20-24


Aalto University, Espoo, Finland
L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
AI response to the COVID-19 pandemic
2 IEEE SMARTCOMP 2022
Help the healthcare system
• Machine Learning (ML) classifiers for blood test results


• Deep Learning (DL) models to analyze chest X-ray and lungs
Computed Tomography (CT) images
Track behaviours in public places
• Monitoring social distancing


• Face mask detection systems
L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
m-health systems based on respiratory sounds
3 IEEE SMARTCOMP 2022
Diagnosis
• Pervasive & low-cost solution for fast screening


• Support the healthcare system in identifying new cases (prevention of new outbreaks)


• Track the disease evolution
L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
COVID-19 Detection from respiratory sounds
4 IEEE SMARTCOMP 2022
Handcrafted acoustic features (HC)
Main drawbacks
• Dif
fi
cult to
fi
nd the best set of features • Typically outperformed by Deep Learning models
Shallow


classifier
Time domain Frequency domain Time-frequency representations
• RMS Energy (how loud is the signal)


• Zero crossing rate (how fast the signal changes)
• Spectral centroid


• Period (freq. with highest amplitude)
• Spectrogram


• Mel-Frequency Cepstral Coefficients (MFCC)
Features


Extraction
COVID-19


positive/negative
L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
COVID-19 Detection from respiratory sounds
5 IEEE SMARTCOMP 2022
DL-based approach
Representative work:
E. A. Mohammed et al., “An ensemble learning approach to digital corona virus preliminary screening from cough
sounds”, Scienti
fi
c Reports, 2021.
Main drawback: Requires large-scale datasets, especially for complex models
Graphical representation


(i.e., Spectrogram-like image)
Convolutional Neural Network


(CNN)
COVID-19


positive/negative
Ensemble of CNN with different audio representations
L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
COVID-19 Detection from respiratory sounds
6 IEEE SMARTCOMP 2022
“Hybrid” approach
Representative work:
Brown, Chloë, et al. "Exploring automatic diagnosis of COVID-19 from crowdsourced respiratory sound data." In
Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020.
HC features
Deep audio
embeddings
+ Shallow


classifier
Pre-trained DL model
COVID-19


positive/negative
477 HC features + VGGish (trained with AudioSet ~ 2 million samples)
L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
Improving the Hybrid approach
7 IEEE SMARTCOMP 2022
Investigation of an alternative embedding model: L3-Net
HC features
Deep audio
embeddings
+ Shallow


classifier
Pre-trained DL model
COVID-19


positive/negative
L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
L3-Net: Look, Listen and Learn
8 IEEE SMARTCOMP 2022
Arandjelovic, Relja, and Andrew Zisserman. "Look, listen and learn." Proceedings of the IEEE International Conference on Computer Vision. 2017.
Fusion layers


(Fully-connected)
Video embeddings
Audio embeddings
Mel-Spectrogram


(1s window)
Video frame image
Image and
audio come
from the
same video?
Video sub-network
Audio sub-network
L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
L3-Net for COVID-19 Detection
9 IEEE SMARTCOMP 2022
+
Shallow
Classifier
COVID-19
positive/negative
Cough/Breath
audio sample
Audio frames Mel-Spectrogram
HC features
Dimensionality
reduction (PCA)
Audio
fi
le embeddings


Combination of the
frames embeddings


(Mean + std)
Audio embeddings
Audio sub-network
Cramer, Jason, et al. "Look, listen, and learn more: Design choices for deep audio embeddings." ICASSP 2019-2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.
OpenL3 model trained with AudioSet
L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
Experimental Evaluation: Goals
IEEE SMARTCOMP 2022
1) Improve the classification performance with respect to:












































2)Can we perform the classification task directly on the mobile device?
- Brown et al. (2020): same approach but different embedding model (i.e., VGGish vs L3-Net)


- Mohammed et al. (2021): ensemble model (CNN trained from scratch vs pre-trained model)
- Memory footprint evaluation
9
L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
Datasets
10 IEEE SMARTCOMP 2022
Cambridge


crowdsourced breath and cough audio samples (data agreement)


www.covid-19-sounds.org
COSWARA


crowdsourced cough samples


coswara.iisc.ac.in


https://github.com/iiscleap/Coswara-Data
Virufy


Cough samples collected in hospital; labels based on


COVID-19 PCR test results


https://github.com/virufy/virufy-covid
Cambridge COSWARA Virufy
62
860
282
7
2758
752
Healthy
COVID-19
Best model
Dev set Test set
Performances


AUC, Precision, Recall
Training & Tuning
Balanced Dataset


(Under-sampling)
Train set Validation set
L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
Evaluation protocol
11 IEEE SMARTCOMP 2022
5-fold nested Cross Validation


with stratified user-based splits
PCA explained variance: [0.7, 0.8, 0.9, 0.95, 0.99]
Shallow classifiers: Logistic Regression (LR), Support Vector Machines (SVM), AdaBoost (AB), Random Forest (RF)
Features sets:
F1: deep audio embeddings F2: embeddings + Period, Tempo, Duration
F3: embeddings + HC features, except Δ-MFCC, Δ2-MFCC F4: embeddings + all HC feature (i.e, 477 HC)
L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
Classification Results vs Brown et al. (2020)
12 IEEE SMARTCOMP 2022
TABLE III: Classification results
Task Method Modality Features Classifier PCA Mean (± std)
AUC Precision Recall
1
baseline Cough + Breath F2 LR .95 .80 (.07) .72 (.06) .69 (.11)
our (same) Cough + Breath F2 LR .95 .76 (.092) .69 (.095) .68 (.158)
our (best) Cough + Breath F2 SVM .70 .80 (.068) .77 (.096) .68 (.139)
2
baseline Cough F2 SVM .90 .82 (.18) .80 (.16) .72 (.23)
our (same) Cough F2 SVM .90 .69 (.227) .74 (.187) .61 (.276)
our (best) Breath F1 LR .80 .84 (.168) .92 (.106) .60 (.237)
3
baseline Breath F3 SVM .70 .80 (.14) .69 (.20) .69 (.26)
our (same) Breath F3 SVM .70 .64 (.254) .69 (.154) .66 (.269)
our (best) Breath F1 AB .70 .88 (.066) .82 (.152) .79 (.192)
rufy
baseline Top 4 Ensemble CNN - .77 .80 .71
our F3 LR .99 .99 (.001) .99 (.006) .99 (.007)
1: COVID-positive vs COVID-negative


2: COVID-positive with cough vs COVID-negative


3: COVID-positive with cough vs COVID-negative with asthma and cough
Dataset: Cambridge (cough & breath audio samples)
3 classification tasks
Gain (%)
Task 1 Task 2 Task 3
10
-12
-1
13
12
5
8
2
0
AUC
Precision
Recall
TABLE III: Classification results
Task Method Modality Features Classifier PCA Mean (± std)
AUC Precision Recall
1
baseline Cough + Breath F2 LR .95 .80 (.07) .72 (.06) .69 (.11)
our (same) Cough + Breath F2 LR .95 .76 (.092) .69 (.095) .68 (.158)
our (best) Cough + Breath F2 SVM .70 .80 (.068) .77 (.096) .68 (.139)
2
baseline Cough F2 SVM .90 .82 (.18) .80 (.16) .72 (.23)
our (same) Cough F2 SVM .90 .69 (.227) .74 (.187) .61 (.276)
our (best) Breath F1 LR .80 .84 (.168) .92 (.106) .60 (.237)
3
baseline Breath F3 SVM .70 .80 (.14) .69 (.20) .69 (.26)
our (same) Breath F3 SVM .70 .64 (.254) .69 (.154) .66 (.269)
our (best) Breath F1 AB .70 .88 (.066) .82 (.152) .79 (.192)
rufy
baseline Top 4 Ensemble CNN - .77 .80 .71
our F3 LR .99 .99 (.001) .99 (.006) .99 (.007)
TABLE III: Classification results
Task Method Modality Features Classifier PCA Mean (± std)
AUC Precision Recall
1
baseline Cough + Breath F2 LR .95 .80 (.07) .72 (.06) .69 (.11)
our (same) Cough + Breath F2 LR .95 .76 (.092) .69 (.095) .68 (.158)
our (best) Cough + Breath F2 SVM .70 .80 (.068) .77 (.096) .68 (.139)
2
baseline Cough F2 SVM .90 .82 (.18) .80 (.16) .72 (.23)
our (same) Cough F2 SVM .90 .69 (.227) .74 (.187) .61 (.276)
our (best) Breath F1 LR .80 .84 (.168) .92 (.106) .60 (.237)
3
baseline Breath F3 SVM .70 .80 (.14) .69 (.20) .69 (.26)
our (same) Breath F3 SVM .70 .64 (.254) .69 (.154) .66 (.269)
our (best) Breath F1 AB .70 .88 (.066) .82 (.152) .79 (.192)
rufy
baseline Top 4 Ensemble CNN - .77 .80 .71
our F3 LR .99 .99 (.001) .99 (.006) .99 (.007)
L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
Classification Results vs Mohammed et al. (2021)
13 IEEE SMARTCOMP 2022
Dataset: COSWARA + Virufy (cough audio samples)
TABLE III: Classification results
Task Method Modality Features Classifier PCA Mean (± std)
AUC Precision Recall
1
baseline Cough + Breath F2 LR .95 .80 (.07) .72 (.06) .69 (.11)
our (same) Cough + Breath F2 LR .95 .76 (.092) .69 (.095) .68 (.158)
our (best) Cough + Breath F2 SVM .70 .80 (.068) .77 (.096) .68 (.139)
2
baseline Cough F2 SVM .90 .82 (.18) .80 (.16) .72 (.23)
our (same) Cough F2 SVM .90 .69 (.227) .74 (.187) .61 (.276)
our (best) Breath F1 LR .80 .84 (.168) .92 (.106) .60 (.237)
3
baseline Breath F3 SVM .70 .80 (.14) .69 (.20) .69 (.26)
our (same) Breath F3 SVM .70 .64 (.254) .69 (.154) .66 (.269)
our (best) Breath F1 AB .70 .88 (.066) .82 (.152) .79 (.192)
ufy
baseline Top 4 Ensemble CNN - .77 .80 .71
our F3 LR .99 .99 (.001) .99 (.006) .99 (.007)
TABLE III: Classification results
Task Method Modality Features Classifier PCA Mean (± std)
AUC Precision Recall
1
baseline Cough + Breath F2 LR .95 .80 (.07) .72 (.06) .69 (.11)
our (same) Cough + Breath F2 LR .95 .76 (.092) .69 (.095) .68 (.158)
our (best) Cough + Breath F2 SVM .70 .80 (.068) .77 (.096) .68 (.139)
2
baseline Cough F2 SVM .90 .82 (.18) .80 (.16) .72 (.23)
our (same) Cough F2 SVM .90 .69 (.227) .74 (.187) .61 (.276)
our (best) Breath F1 LR .80 .84 (.168) .92 (.106) .60 (.237)
3
baseline Breath F3 SVM .70 .80 (.14) .69 (.20) .69 (.26)
our (same) Breath F3 SVM .70 .64 (.254) .69 (.154) .66 (.269)
our (best) Breath F1 AB .70 .88 (.066) .82 (.152) .79 (.192)
ufy
baseline Top 4 Ensemble CNN - .77 .80 .71
our F3 LR .99 .99 (.001) .99 (.006) .99 (.007)
4 CNN with different inputs:
- Power Spectrum


- MFCC
- Spectrogram


- Mel-spectrogram
SVM .99 (.001) .99 (.002) .98 (.01)
RF .81 (.024) .79 (.031) .59 (.07)
AB .85 (.011) .77 (.021) .75 (.03)
Gain (%)
28
19
22
AUC
Precision
Recall
Classification Task: COVID-positive vs COVID-negative
L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
Memory footprint
14 IEEE SMARTCOMP 2022
Cambridge Task 1 COSWARA + Virufy
Cambridge Task 2 Cambridge Task 3
LR with PCA 99%: 7.19 KB
AB with PCA 70%: 17 KB
LR with PCA 80%: 1.03 KB
SVM with PCA 70%: 48 KB
Low memory impact in all the experiments
L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
Contributions
15 IEEE SMARTCOMP 2022
• We investigated the use of a pre-trained instance of L3-Net (OpenL3) to improve the COVID-19
detection from respiratory sound data


• Evaluation: subject-independent experiments with 3 datasets


• Results: +8% AUC vs VGGish, +22% AUC vs ensemble of end-to-end CNN


• Low memory footprint: we can perform the whole task on resource-constrained devices
L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
Future Work
15 IEEE SMARTCOMP 2022
• Distinguish between COVID-19 and other respiratory diseases (e.g., asthma)
Fixed CNN layers Train FC layers
Diagnosis
• Fine-tuning OpenL3, proposing a single model for both features extraction and classification
• Extensive comparison of different audio embedding models
L3-Net Deep Audio
Embeddings to Improve
COVID-19 Detection
from Smartphone Data
Mattia G. Campana


Ubiquitous Internet Research Unit




Institute of Informatics and Telematics


National Research Council of Italy
mattiacampana.github.io
mattia.campana@iit.cnr.it
linkedin.com/in/mattiacampana
L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
Handcrafted acoustic Features
15 IEEE SMARTCOMP 2022
• The audio sample is re-sampled to a standard value for audio tasks (e.g., 16kHz or 22kHz)


• Extraction of features related to both frame (i.e., audio chunks) and segment (whole sample) perspectives


• We used the same 477 HC features (including statistics) considered by Brown et al. (2020)
Feature Description
Duration Total length (in seconds) of the audio sample
Onset Number of pitch onset (i.e., “events”) in the audio signal
Tempo Rate of beats that occur at regular intervals throughout the entire audio signal
Period The frequency with the highest amplitude among those obtained from the Fast Fourier transform
(FFT)
RMS Energy Root-Mean-Square of the signal power (i.e., the magnitude of the short-time Fourier transform)
Spectral Centroid The centroid value of the frame-wise magnitude spectrogram. Identifies percussive and sustained
sounds.
Roll-off Frequency The frequency under which the 85% of the total energy of the frame-wise spectrum is contained
Zero-crossing rate The number of times the signal value crosses the zero axe, and it is computed for each frame
MFCC Shape of the cosine transformation of the song logarithmic spectrum, expressed in Mel-bands
Δ-MFCC and Δ2-MFCC The first and second order derivatives of MFCC along time
L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
L3-Net vs VGGish
15 IEEE SMARTCOMP 2022
# parameters: • L3-Net: 4.7M
• VGGish: 62M
Cramer, Jason, et al. "Look, listen, and learn more: Design choices for deep audio embeddings." ICASSP 2019-2019 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.

Más contenido relacionado

Similar a L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data

/conferences/spr2002/presentations/ssimmons/simmons.ppt
/conferences/spr2002/presentations/ssimmons/simmons.ppt/conferences/spr2002/presentations/ssimmons/simmons.ppt
/conferences/spr2002/presentations/ssimmons/simmons.ppt
Videoguy
 
SImOS
SImOSSImOS
SImOS
dalgetty
 
AMATH582_Final_Poster
AMATH582_Final_PosterAMATH582_Final_Poster
AMATH582_Final_Poster
Mark Chang
 
Biolelemetry1
Biolelemetry1Biolelemetry1
Biolelemetry1
Samuely
 
Krishna thesis presentation
Krishna thesis presentationKrishna thesis presentation
Krishna thesis presentation
ahyaimie
 
EECS452EMGFinalProjectReportPDF
EECS452EMGFinalProjectReportPDFEECS452EMGFinalProjectReportPDF
EECS452EMGFinalProjectReportPDF
Angie Zhang
 

Similar a L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data (20)

AUTOMATIC COVID DETECTION USING COUGH SIGNAL ANALYSIS
AUTOMATIC COVID DETECTION USING COUGH SIGNAL ANALYSISAUTOMATIC COVID DETECTION USING COUGH SIGNAL ANALYSIS
AUTOMATIC COVID DETECTION USING COUGH SIGNAL ANALYSIS
 
Employing deep learning for lung sounds classification
Employing deep learning for lung sounds classificationEmploying deep learning for lung sounds classification
Employing deep learning for lung sounds classification
 
/conferences/spr2002/presentations/ssimmons/simmons.ppt
/conferences/spr2002/presentations/ssimmons/simmons.ppt/conferences/spr2002/presentations/ssimmons/simmons.ppt
/conferences/spr2002/presentations/ssimmons/simmons.ppt
 
EEG based security
EEG based security EEG based security
EEG based security
 
MobiDE’2012, Phoenix, AZ, United States, 20 May, 2012
MobiDE’2012, Phoenix, AZ, United States, 20 May, 2012MobiDE’2012, Phoenix, AZ, United States, 20 May, 2012
MobiDE’2012, Phoenix, AZ, United States, 20 May, 2012
 
SImOS
SImOSSImOS
SImOS
 
IRJET- Hearing Loss Detection through Audiogram in Mobile Devices
IRJET-  	  Hearing Loss Detection through Audiogram in Mobile DevicesIRJET-  	  Hearing Loss Detection through Audiogram in Mobile Devices
IRJET- Hearing Loss Detection through Audiogram in Mobile Devices
 
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
 
AMATH582_Final_Poster
AMATH582_Final_PosterAMATH582_Final_Poster
AMATH582_Final_Poster
 
Biolelemetry1
Biolelemetry1Biolelemetry1
Biolelemetry1
 
OPTE: Online Per-title Encoding for Live Video Streaming
OPTE: Online Per-title Encoding for Live Video StreamingOPTE: Online Per-title Encoding for Live Video Streaming
OPTE: Online Per-title Encoding for Live Video Streaming
 
OPTE: Online Per-title Encoding for Live Video Streaming.pdf
OPTE: Online Per-title Encoding for Live Video Streaming.pdfOPTE: Online Per-title Encoding for Live Video Streaming.pdf
OPTE: Online Per-title Encoding for Live Video Streaming.pdf
 
Krishna thesis presentation
Krishna thesis presentationKrishna thesis presentation
Krishna thesis presentation
 
F5242832
F5242832F5242832
F5242832
 
Biometric presentation attack detection
Biometric presentation attack detectionBiometric presentation attack detection
Biometric presentation attack detection
 
EECS452EMGFinalProjectReportPDF
EECS452EMGFinalProjectReportPDFEECS452EMGFinalProjectReportPDF
EECS452EMGFinalProjectReportPDF
 
Enhancing image based data hiding method using reduced difference expansion a...
Enhancing image based data hiding method using reduced difference expansion a...Enhancing image based data hiding method using reduced difference expansion a...
Enhancing image based data hiding method using reduced difference expansion a...
 
14 00-20171207 rance-piv_c
14 00-20171207 rance-piv_c14 00-20171207 rance-piv_c
14 00-20171207 rance-piv_c
 
Automatic COVID-19 lung images classification system based on convolution ne...
Automatic COVID-19 lung images classification system based  on convolution ne...Automatic COVID-19 lung images classification system based  on convolution ne...
Automatic COVID-19 lung images classification system based on convolution ne...
 
Air quality forecasting using convolutional neural networks
Air quality forecasting using convolutional neural networksAir quality forecasting using convolutional neural networks
Air quality forecasting using convolutional neural networks
 

Último

Último (10)

Deciding The Topic of our Magazine.pptx.
Deciding The Topic of our Magazine.pptx.Deciding The Topic of our Magazine.pptx.
Deciding The Topic of our Magazine.pptx.
 
Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdf
Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdfOracle Database Administration I (1Z0-082) Exam Dumps 2024.pdf
Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdf
 
Breathing in New Life_ Part 3 05 22 2024.pptx
Breathing in New Life_ Part 3 05 22 2024.pptxBreathing in New Life_ Part 3 05 22 2024.pptx
Breathing in New Life_ Part 3 05 22 2024.pptx
 
ServiceNow CIS-Discovery Exam Dumps 2024
ServiceNow CIS-Discovery Exam Dumps 2024ServiceNow CIS-Discovery Exam Dumps 2024
ServiceNow CIS-Discovery Exam Dumps 2024
 
ACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdf
ACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdfACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdf
ACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdf
 
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdfMicrosoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
 
DAY 0 8 A Revelation 05-19-2024 PPT.pptx
DAY 0 8 A Revelation 05-19-2024 PPT.pptxDAY 0 8 A Revelation 05-19-2024 PPT.pptx
DAY 0 8 A Revelation 05-19-2024 PPT.pptx
 
OC Streetcar Final Presentation-Downtown Santa Ana
OC Streetcar Final Presentation-Downtown Santa AnaOC Streetcar Final Presentation-Downtown Santa Ana
OC Streetcar Final Presentation-Downtown Santa Ana
 
Understanding Poverty: A Community Questionnaire
Understanding Poverty: A Community QuestionnaireUnderstanding Poverty: A Community Questionnaire
Understanding Poverty: A Community Questionnaire
 
The Influence and Evolution of Mogul Press in Contemporary Public Relations.docx
The Influence and Evolution of Mogul Press in Contemporary Public Relations.docxThe Influence and Evolution of Mogul Press in Contemporary Public Relations.docx
The Influence and Evolution of Mogul Press in Contemporary Public Relations.docx
 

L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data

  • 1. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data Mattia G. Campana (IIT-CNR) Andrea Rovati (UniMi) Franca Delmastro (IIT-CNR) Elena Pagani (UniMi) IEEE SMARTCOMP 2022, June 20-24 Aalto University, Espoo, Finland
  • 2. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data AI response to the COVID-19 pandemic 2 IEEE SMARTCOMP 2022 Help the healthcare system • Machine Learning (ML) classifiers for blood test results • Deep Learning (DL) models to analyze chest X-ray and lungs Computed Tomography (CT) images Track behaviours in public places • Monitoring social distancing • Face mask detection systems
  • 3. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data m-health systems based on respiratory sounds 3 IEEE SMARTCOMP 2022 Diagnosis • Pervasive & low-cost solution for fast screening • Support the healthcare system in identifying new cases (prevention of new outbreaks) • Track the disease evolution
  • 4. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data COVID-19 Detection from respiratory sounds 4 IEEE SMARTCOMP 2022 Handcrafted acoustic features (HC) Main drawbacks • Dif fi cult to fi nd the best set of features • Typically outperformed by Deep Learning models Shallow 
 classifier Time domain Frequency domain Time-frequency representations • RMS Energy (how loud is the signal) • Zero crossing rate (how fast the signal changes) • Spectral centroid • Period (freq. with highest amplitude) • Spectrogram • Mel-Frequency Cepstral Coefficients (MFCC) Features Extraction COVID-19 
 positive/negative
  • 5. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data COVID-19 Detection from respiratory sounds 5 IEEE SMARTCOMP 2022 DL-based approach Representative work: E. A. Mohammed et al., “An ensemble learning approach to digital corona virus preliminary screening from cough sounds”, Scienti fi c Reports, 2021. Main drawback: Requires large-scale datasets, especially for complex models Graphical representation 
 (i.e., Spectrogram-like image) Convolutional Neural Network 
 (CNN) COVID-19 
 positive/negative Ensemble of CNN with different audio representations
  • 6. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data COVID-19 Detection from respiratory sounds 6 IEEE SMARTCOMP 2022 “Hybrid” approach Representative work: Brown, Chloë, et al. "Exploring automatic diagnosis of COVID-19 from crowdsourced respiratory sound data." In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020. HC features Deep audio embeddings + Shallow 
 classifier Pre-trained DL model COVID-19 
 positive/negative 477 HC features + VGGish (trained with AudioSet ~ 2 million samples)
  • 7. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data Improving the Hybrid approach 7 IEEE SMARTCOMP 2022 Investigation of an alternative embedding model: L3-Net HC features Deep audio embeddings + Shallow 
 classifier Pre-trained DL model COVID-19 
 positive/negative
  • 8. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data L3-Net: Look, Listen and Learn 8 IEEE SMARTCOMP 2022 Arandjelovic, Relja, and Andrew Zisserman. "Look, listen and learn." Proceedings of the IEEE International Conference on Computer Vision. 2017. Fusion layers (Fully-connected) Video embeddings Audio embeddings Mel-Spectrogram (1s window) Video frame image Image and audio come from the same video? Video sub-network Audio sub-network
  • 9. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data L3-Net for COVID-19 Detection 9 IEEE SMARTCOMP 2022 + Shallow Classifier COVID-19 positive/negative Cough/Breath audio sample Audio frames Mel-Spectrogram HC features Dimensionality reduction (PCA) Audio fi le embeddings Combination of the frames embeddings (Mean + std) Audio embeddings Audio sub-network Cramer, Jason, et al. "Look, listen, and learn more: Design choices for deep audio embeddings." ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019. OpenL3 model trained with AudioSet
  • 10. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data Experimental Evaluation: Goals IEEE SMARTCOMP 2022 1) Improve the classification performance with respect to: 
 








































 2)Can we perform the classification task directly on the mobile device? - Brown et al. (2020): same approach but different embedding model (i.e., VGGish vs L3-Net) - Mohammed et al. (2021): ensemble model (CNN trained from scratch vs pre-trained model) - Memory footprint evaluation 9
  • 11. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data Datasets 10 IEEE SMARTCOMP 2022 Cambridge crowdsourced breath and cough audio samples (data agreement) www.covid-19-sounds.org COSWARA crowdsourced cough samples coswara.iisc.ac.in https://github.com/iiscleap/Coswara-Data Virufy Cough samples collected in hospital; labels based on 
 COVID-19 PCR test results https://github.com/virufy/virufy-covid Cambridge COSWARA Virufy 62 860 282 7 2758 752 Healthy COVID-19
  • 12. Best model Dev set Test set Performances 
 AUC, Precision, Recall Training & Tuning Balanced Dataset 
 (Under-sampling) Train set Validation set L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data Evaluation protocol 11 IEEE SMARTCOMP 2022 5-fold nested Cross Validation 
 with stratified user-based splits PCA explained variance: [0.7, 0.8, 0.9, 0.95, 0.99] Shallow classifiers: Logistic Regression (LR), Support Vector Machines (SVM), AdaBoost (AB), Random Forest (RF) Features sets: F1: deep audio embeddings F2: embeddings + Period, Tempo, Duration F3: embeddings + HC features, except Δ-MFCC, Δ2-MFCC F4: embeddings + all HC feature (i.e, 477 HC)
  • 13. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data Classification Results vs Brown et al. (2020) 12 IEEE SMARTCOMP 2022 TABLE III: Classification results Task Method Modality Features Classifier PCA Mean (± std) AUC Precision Recall 1 baseline Cough + Breath F2 LR .95 .80 (.07) .72 (.06) .69 (.11) our (same) Cough + Breath F2 LR .95 .76 (.092) .69 (.095) .68 (.158) our (best) Cough + Breath F2 SVM .70 .80 (.068) .77 (.096) .68 (.139) 2 baseline Cough F2 SVM .90 .82 (.18) .80 (.16) .72 (.23) our (same) Cough F2 SVM .90 .69 (.227) .74 (.187) .61 (.276) our (best) Breath F1 LR .80 .84 (.168) .92 (.106) .60 (.237) 3 baseline Breath F3 SVM .70 .80 (.14) .69 (.20) .69 (.26) our (same) Breath F3 SVM .70 .64 (.254) .69 (.154) .66 (.269) our (best) Breath F1 AB .70 .88 (.066) .82 (.152) .79 (.192) rufy baseline Top 4 Ensemble CNN - .77 .80 .71 our F3 LR .99 .99 (.001) .99 (.006) .99 (.007) 1: COVID-positive vs COVID-negative 2: COVID-positive with cough vs COVID-negative 3: COVID-positive with cough vs COVID-negative with asthma and cough Dataset: Cambridge (cough & breath audio samples) 3 classification tasks Gain (%) Task 1 Task 2 Task 3 10 -12 -1 13 12 5 8 2 0 AUC Precision Recall TABLE III: Classification results Task Method Modality Features Classifier PCA Mean (± std) AUC Precision Recall 1 baseline Cough + Breath F2 LR .95 .80 (.07) .72 (.06) .69 (.11) our (same) Cough + Breath F2 LR .95 .76 (.092) .69 (.095) .68 (.158) our (best) Cough + Breath F2 SVM .70 .80 (.068) .77 (.096) .68 (.139) 2 baseline Cough F2 SVM .90 .82 (.18) .80 (.16) .72 (.23) our (same) Cough F2 SVM .90 .69 (.227) .74 (.187) .61 (.276) our (best) Breath F1 LR .80 .84 (.168) .92 (.106) .60 (.237) 3 baseline Breath F3 SVM .70 .80 (.14) .69 (.20) .69 (.26) our (same) Breath F3 SVM .70 .64 (.254) .69 (.154) .66 (.269) our (best) Breath F1 AB .70 .88 (.066) .82 (.152) .79 (.192) rufy baseline Top 4 Ensemble CNN - .77 .80 .71 our F3 LR .99 .99 (.001) .99 (.006) .99 (.007) TABLE III: Classification results Task Method Modality Features Classifier PCA Mean (± std) AUC Precision Recall 1 baseline Cough + Breath F2 LR .95 .80 (.07) .72 (.06) .69 (.11) our (same) Cough + Breath F2 LR .95 .76 (.092) .69 (.095) .68 (.158) our (best) Cough + Breath F2 SVM .70 .80 (.068) .77 (.096) .68 (.139) 2 baseline Cough F2 SVM .90 .82 (.18) .80 (.16) .72 (.23) our (same) Cough F2 SVM .90 .69 (.227) .74 (.187) .61 (.276) our (best) Breath F1 LR .80 .84 (.168) .92 (.106) .60 (.237) 3 baseline Breath F3 SVM .70 .80 (.14) .69 (.20) .69 (.26) our (same) Breath F3 SVM .70 .64 (.254) .69 (.154) .66 (.269) our (best) Breath F1 AB .70 .88 (.066) .82 (.152) .79 (.192) rufy baseline Top 4 Ensemble CNN - .77 .80 .71 our F3 LR .99 .99 (.001) .99 (.006) .99 (.007)
  • 14. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data Classification Results vs Mohammed et al. (2021) 13 IEEE SMARTCOMP 2022 Dataset: COSWARA + Virufy (cough audio samples) TABLE III: Classification results Task Method Modality Features Classifier PCA Mean (± std) AUC Precision Recall 1 baseline Cough + Breath F2 LR .95 .80 (.07) .72 (.06) .69 (.11) our (same) Cough + Breath F2 LR .95 .76 (.092) .69 (.095) .68 (.158) our (best) Cough + Breath F2 SVM .70 .80 (.068) .77 (.096) .68 (.139) 2 baseline Cough F2 SVM .90 .82 (.18) .80 (.16) .72 (.23) our (same) Cough F2 SVM .90 .69 (.227) .74 (.187) .61 (.276) our (best) Breath F1 LR .80 .84 (.168) .92 (.106) .60 (.237) 3 baseline Breath F3 SVM .70 .80 (.14) .69 (.20) .69 (.26) our (same) Breath F3 SVM .70 .64 (.254) .69 (.154) .66 (.269) our (best) Breath F1 AB .70 .88 (.066) .82 (.152) .79 (.192) ufy baseline Top 4 Ensemble CNN - .77 .80 .71 our F3 LR .99 .99 (.001) .99 (.006) .99 (.007) TABLE III: Classification results Task Method Modality Features Classifier PCA Mean (± std) AUC Precision Recall 1 baseline Cough + Breath F2 LR .95 .80 (.07) .72 (.06) .69 (.11) our (same) Cough + Breath F2 LR .95 .76 (.092) .69 (.095) .68 (.158) our (best) Cough + Breath F2 SVM .70 .80 (.068) .77 (.096) .68 (.139) 2 baseline Cough F2 SVM .90 .82 (.18) .80 (.16) .72 (.23) our (same) Cough F2 SVM .90 .69 (.227) .74 (.187) .61 (.276) our (best) Breath F1 LR .80 .84 (.168) .92 (.106) .60 (.237) 3 baseline Breath F3 SVM .70 .80 (.14) .69 (.20) .69 (.26) our (same) Breath F3 SVM .70 .64 (.254) .69 (.154) .66 (.269) our (best) Breath F1 AB .70 .88 (.066) .82 (.152) .79 (.192) ufy baseline Top 4 Ensemble CNN - .77 .80 .71 our F3 LR .99 .99 (.001) .99 (.006) .99 (.007) 4 CNN with different inputs: - Power Spectrum - MFCC - Spectrogram - Mel-spectrogram SVM .99 (.001) .99 (.002) .98 (.01) RF .81 (.024) .79 (.031) .59 (.07) AB .85 (.011) .77 (.021) .75 (.03) Gain (%) 28 19 22 AUC Precision Recall Classification Task: COVID-positive vs COVID-negative
  • 15. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data Memory footprint 14 IEEE SMARTCOMP 2022 Cambridge Task 1 COSWARA + Virufy Cambridge Task 2 Cambridge Task 3 LR with PCA 99%: 7.19 KB AB with PCA 70%: 17 KB LR with PCA 80%: 1.03 KB SVM with PCA 70%: 48 KB Low memory impact in all the experiments
  • 16. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data Contributions 15 IEEE SMARTCOMP 2022 • We investigated the use of a pre-trained instance of L3-Net (OpenL3) to improve the COVID-19 detection from respiratory sound data • Evaluation: subject-independent experiments with 3 datasets • Results: +8% AUC vs VGGish, +22% AUC vs ensemble of end-to-end CNN • Low memory footprint: we can perform the whole task on resource-constrained devices
  • 17. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data Future Work 15 IEEE SMARTCOMP 2022 • Distinguish between COVID-19 and other respiratory diseases (e.g., asthma) Fixed CNN layers Train FC layers Diagnosis • Fine-tuning OpenL3, proposing a single model for both features extraction and classification • Extensive comparison of different audio embedding models
  • 18. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data Mattia G. Campana Ubiquitous Internet Research Unit 
 Institute of Informatics and Telematics 
 National Research Council of Italy mattiacampana.github.io mattia.campana@iit.cnr.it linkedin.com/in/mattiacampana
  • 19. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data Handcrafted acoustic Features 15 IEEE SMARTCOMP 2022 • The audio sample is re-sampled to a standard value for audio tasks (e.g., 16kHz or 22kHz) • Extraction of features related to both frame (i.e., audio chunks) and segment (whole sample) perspectives • We used the same 477 HC features (including statistics) considered by Brown et al. (2020) Feature Description Duration Total length (in seconds) of the audio sample Onset Number of pitch onset (i.e., “events”) in the audio signal Tempo Rate of beats that occur at regular intervals throughout the entire audio signal Period The frequency with the highest amplitude among those obtained from the Fast Fourier transform (FFT) RMS Energy Root-Mean-Square of the signal power (i.e., the magnitude of the short-time Fourier transform) Spectral Centroid The centroid value of the frame-wise magnitude spectrogram. Identifies percussive and sustained sounds. Roll-off Frequency The frequency under which the 85% of the total energy of the frame-wise spectrum is contained Zero-crossing rate The number of times the signal value crosses the zero axe, and it is computed for each frame MFCC Shape of the cosine transformation of the song logarithmic spectrum, expressed in Mel-bands Δ-MFCC and Δ2-MFCC The first and second order derivatives of MFCC along time
  • 20. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data L3-Net vs VGGish 15 IEEE SMARTCOMP 2022 # parameters: • L3-Net: 4.7M • VGGish: 62M Cramer, Jason, et al. "Look, listen, and learn more: Design choices for deep audio embeddings." ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.