SlideShare una empresa de Scribd logo
1 de 88
Marriage of Computer Vision, Speech and
Natural Language
- Yaman Kumar (MIDAS Lab-IIITD, SUNY at Buffalo)
- Rajiv Ratn Shah (MIDAS Lab-IIITD)
What is Speech?
Text Part of Speech Vision Part of Speech Aural Part of Speech
Why Speech?
Marriage of
Speech &
Language
Conversational speech
Information in the acoustic
signal beyond the words
Interactive nature of
conversations
Speech is ... more than spoken words
• Rich in ‘extra-linguistic’ information
• breathing noises
• lip-smacks
• Hand movements
• Facial Expressions
• Rich in ‘para-linguistic’ information
• Personality
• Attitude
• Emotion
• Individuality
Some Examples
• Disfluency
• I am uh uh very …. I am very excited to see you
• He is my em …… Yaman is my best friend
• Intonation and Stress
1. *This* is my laptop (and not that)
• This is *my* laptop (and not yours)
• This is my *laptop* (and not book)
2. He found it on the street?
• And in reply, He found it on the street
• No punctuation and very open grammar
• ASR errors
• to the listener
• a child (‘parentese’)
• a non-native person
• a hearing-impaired
individual
• an animal
• a machine(!)
• to the cognitive load
• Interaction with other
tasks
• stressful/emotional
situations
• to the
environment
• noise
• reverberation
Speech is
Adaptive
• to the task
• Casual
conversation
• Reading out
loud
• Public
speaking
Content
• Content in spoken medium is the "information or experiences directed towards end-users or an
audience".
Why is Content Important?
Whom do you prefer?
• A speaker with style, elegance, panache but with a weak content (talking
too much off-topic, not providing enough details about facts).
OR
• An average speaker but with a good content (ideas stick to the main
topic, provides interesting/required background information).
Content
What defines a Good Content? ( High Relevance and High Sufficiency )
Relevance
• Related to the topic
• Connected to the prompt in a
bigger story.
• No Unwanted information or off
topic.
Sufficiency
• Adequate details (which are also
relevant)
• All points covered
• No Missing parts
Response: IVE ACCOMPLISHED UM MANY THINGS IN LIFE ONE OF
THEM IS IS BEING A PHILANTHROPIST IVE HELPED A LOT OF PEOPLE
MOST SPECIALLY CHILDREN I GO TO SOME UM POOR AREAS AND
WE TEACH LIKE THOSE CHILDREN SOME KNOWLEDGE THAT THEY
DONT KNOW YET LIKE FOR EXAMPLE IM GOING TO BE THEIR
TEACHER AND I I INFORM THEM ALL THE THINGS LIKE UM WHAT TO
WRITE HOW TO READ HOW TO DESCRIBE SOMETHING AND THIS IS
REALLY IMPORTANT IN MY LIFE BECAUSE BEING A TEACHER IS
REALLY GOOD FOR ME AND I THINK IT WILL REALLY HELP ME GROW
MY ABILITY TO HELP PEOPLE MOST SPECIALLY CHILDREN
Response: IT IS IMPORTANT TO CHOOSE WISELY FOR YOUR CAREER
AND ITS ALSO IMPORTANT THAT YOU CHOOSE THAT CAREER
BECAUSE UH THIS IS YOUR PASSION AND THIS IS YOUR REALLY ONE
JOB AND BECAUSE IF YOU DONT WANT THAT JOB OR CAR CAREER
BUT YOU CHOOSE IT UH YOU WILL AT THE END OF THE DAY YOU
WILL NOT BE UH MOTIVATED TO WORK WITH IT AND YOU WILL NOT
BE YOU ARE UH THERES A TENDENCY THAT YOU WILL NOT ACHIEVE
YOUR GOAL OR DESIRE IN YOUR IN THAT CAREER AND YOURE NOT
BE WILL BE SUCCESSFUL IN THAT CAREER IT IS IMPORTANT TO
CHOOSE WISELY YOUR CAREER AND UH CONSIDER THAT THIS IS
YOUR UH THIS IS WHAT YOU REALLY WANT AND THIS IS YOUR
PASSIONS AND ARE IT IS UH IF YOU CHOOSE YOUR CAREER BE SURE
YOU ARE ENJOYING IT NOT DOING IT
Relevance: High
Speaker sticks to the things asked in prompt.
(Being philanthropist or teacher as
accomplishments, important of the same.)
Sufficiency: High
Explains in detail about how he helped children
as a teacher, how did he help and importance
of the same
Relevance: Low
Speaker goes too off topic from what is being
asked. (About car, being successful, what good
career is, instead of talking about
accomplishments.)
Sufficiency: Low
Provides no information that addresses the
points in the prompt.
Prompt: You have to narrate to a career advisor 1 thing you accomplished which you are proud of and how it was
important for you.
D…Di….Disfluencies
• Interruptions in the smooth flow of speech
• These interruptions often occur in spoken communication. They usually help the speakers to buy more time
while they express their thought process.
• Reparandum (RM) - Refers to the unintended and
unnecessary part of the disfluency span
(This span can be deleted in order to obtain fluency)
• Interregnum (IM) - Refers to the part that lies
between RM and RR.
(This span helps the speaker to fill the intermediate gap)
• Repair (RR) - Refers to the corrected span of the RM.
(This span should maintain the context of RM)
D…Di….Disfluencies
• Examples
• Filled pauses : "This is a uhmm … good example"
• Discourse Markers : " It's really nice to .. you know .. play outside sometimes."
• Self-Correction : " So we will... we can go there."
• Repetitions : "The... the... the decision was not mine to make"
• Restart : "We would like to eat ... let’s go to the park"
• Why can't we recognize these disfluencies solely by looking at the words ? 🤔
• Consideration of the audio helps in understanding the intention of speaker and hence deciding if
there is a disfluency or not.
• Can get confused with some fluently done repetitions - "Superman is the most most most
powerful superhero ! "
• Can also get confused from various other interruptions like non-verbal sounds and even silence !
Pronunciation
/prəˌnʌnsɪˈeɪʃ(ə)n/
Mispronunciation Detection: Problem where the perceived pronunciation
doesn't match with intended pronunciation, but we can understand the
meaning. Example. Pronunciation of word park.
• Phoneme Recognition Problem: State of the art phoneme (sounds in a
language) recognition systems has a phoneme error rate of 18% for
native speech data.
• Non-native accent: Phonemes might be recognized correctly but acoustic
models (models used to detect phonemes) are often confused by non-
native speech. Some phonemes (sounds) exist in the native
language which do not have an alternative in the non-native language.
E.g. Je sound in French has no English mapping which confuses the
acoustic model to predict wrong sequences of phonemes.
Pronunciation
Intelligibility: There is a lot of
difference between the intended
speech and spoken speech.
Example: Pronunciation of word
mEssage is incorrect. A good ASR
system will perceive it as mAssage and
rate it correctly pronounced. However,
the user meant to say mEssage.
Discourse Coherence
• Discourse is a coherent combination of spoken (or written) utterances
communicated between a speaker (or writer) and a listener (or
reader).
• Discourse is a PRODUCT? ✍️ (linguistic perspective)
• Discourse is a PROCESS!! 🤔🤔 (cognitive perspective)
• Discourse coherence is the semantic relationship between
propositions or communicative events in discourse.
• It is a feature of the perception 👀👂 of discourse rather than the
content of discourse itself.
Discourse Coherence
Discourse as Product ✍
• A well written speech.
• How the discourse content is
structured and organized by the
speaker.
• Cohesion in text, use of discourse
markers, connectives, etc.
• How readable is the text, how
complex is the text, etc.
Discourse as Process 🤔
• A well delivered speech.
• How the discourse content is
delivered efficiently to the
listener.
• Prosodic variation, use of stress,
intonation, pauses, etc.
• How intelligible is the
speech, how focused is the
listener, etc.
Prosody
• Prosodic features span...
• several speech segments
• several syllables
• whole utterances
• Such ‘suprasegmental’ behaviour includes ...
• lexical stress (Prominence of Syllables)
• lexical tone (Pitch pattern to distinguish words)
• rhythmic stress (Emphasis)
• intonation (Difference of Expressive meaning)
It’s not what you say, but how you say it.
The Two
Ronnies
- Four Candles vs Fork
Handles
Speech is Ambiguous
Silent Speech is Even More Ambiguous
• Elephant Juice vs I Love You
• Million vs Billion
• Pet vs Bell vs Men
Speak Them To Yourself!
You lip movements are exactly same!
Exploring Semi-Supervised Learning
for Predicting Listener Backchannels
Accepted at CHI’21!
Vidit Jain, Maitree Leekha,
Jainendra Shukla, Rajiv Ratn Shah
Introduction
● Developing human-like conversational agents is important!
○ Applications in education and healthcare
● Challenge: how to make them seem natural?
○ Human conversations are complex!
● Listener backchannels: a crucial element of human conversation:
○ Listener’s “regular” feedback to the speaker, indicating presence
○ Verbal: e.g., short utterances
○ Non-verbal: e.g., head shake, nod, smile etc.
● We focus on modelling these backchannels as a step towards natural
Human Robot Interactions (HRIs).
Research Questions
Key Research Gaps:
● Prior works [1, 2 and more] relied on large amounts of manually
annotated data to train listener backchannel prediction (LBP) models
○ This is expensive in terms of man hours
● In addition, all previous works have focused on only English
conversations
Major Contributions:
● Validating the use of semi-supervised techniques for LBP
○ Models using only 25% of manual annotation performed at par!
● Unlike past works, we use Hindi conversations
[1] Park, Hae Won, et al. "Telling stories to robots: The effect of backchanneling on a child's storytelling." 2017 12th ACM/IEEE
International Conference on Human-Robot Interaction (HRI. IEEE, 2017.
[2] Goswami, Mononito, Minkush Manuja, and Maitree Leekha. "Towards Social & Engaging Peer Learning: Predicting Backchanneling
and Disengagement in Children." arXiv preprint arXiv:2007.11346 (2020).
Dataset
● We use the multimodal Hindi based Vyaktitv dataset [3]
○ 25 conversations, each ~16 min long
○ Video and audio feeds available for each participant (50 recordings)
● Annotations Done:
○ 3 annotators
○ Signal (kappa): Nod (0.7), Head-shake (0.6), Mouth (0.6), Eyebrow (0.5),
Utterances (0.5)
● Features Extracted:
○ OpenFace - visual features: 18 facial action units (FAU), gaze velocities & accelerations,
translational and rotational head velocities & accelerations, blink rate, pupil location, and smile ratio
○ pyAudioAnalysis - audio features: voice activity, MFCC, F0, energy
[3] Khan, Shahid Nawaz, et al. "Vyaktitv: A Multimodal Peer-to-Peer Hindi Conversations based Dataset for Personality Assessment."
2020 IEEE Sixth International Conference on Multimedia Big Data (BigMM). IEEE, 2020.
System Architecture
Methodology: (i) Semi-supervised learning for identifying backchannels and type of
signals emitted using a subset of labeled data. (ii) Learning to predict these instances
and signals using the speaker's context.
Task Formulations
Identification
Given a listener’s audio and video feeds, identify
when he backchannels?
These are the true labels in the prediction task
We use semi-supervision here to generate these
pseudo-labels (instance & type)
Prediction
Given a speaker’s context (~3-7 sec long), predict
whether the listener will backchannel
immediately after it.
Use only speaker’s features to predict the
instance & type of backchannel (verbal/visual)
Key Findings
● The semi-supervised process was able to identify backchannel instances
and signal types very well
○ Respective accuracies- 0.90 (ResNet) & 0.85 (RF)- only 25% manual annotation
as seed!
● Comparing prediction models trained using manually annotated vs semi
supervised pseudo labels:
○ Using semi-supervision, we reach ~94% of the baseline performance!
● Qualitative Study: Majority participants could not distinguish between
the two prediction models!
Demo
Our final system trained using semi-supervision
Lip Movement as Inputs for Information Retrieval
https://www.aaai.org/ojs/index.php/AAAI/article/view/5649
https://www.aaai.org/ojs/index.php/AAAI/article/view/5148
https://www.aaai.org/ojs/index.php/AAAI/article/view/4106
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/3269.pdf
https://www.isca-speech.org/archive/Interspeech_2019/abstracts/3273.html
https://www.youtube.com/watch?v=3BqQQnTfnlE&list=PL9rvax0EIUA6PDoiDT2Wp462GsT
nikrvY
Visual Speech Recognition
Let’s put your lip reading abilities
to test
(SHOW OF HANDS)
CONFIDENCE CONFERENCE
CONCERNS CONFLICT
CONFIDENCE CONFERENCE
CONCERNS CONFLICT
MOBIVSR
Predictions:
•CONFERENCE 65%
•CONFLICT 20%
•OFFICERS 10%
•OFFICE 5%
SPECIAL
SPONGE
DESPERATION
SPEECH
SPECIAL SPONGE
DESPERATION SPEECH
MOBIVSR
Predictions:
•SPEECH 85%
•BRITISH 10%
•PRESSURE 2%
•INFLATION 1%
Let’s jump to MobiVSR difficulty level.
Your options:
ABOUT
ABSOLUTELY
ABUSE
ACCESS
ACCORDING
ACCUSED
ACROSS
ACTION
ACTUALLY
AFFAIRS
AFFECTED
AFRICA
AFTER
AFTERNOON
AGAIN
AGAINST
AGREE
AGREEMENT
AHEAD
ALLEGATION
S
ALLOW
ALLOWED
ALMOST
ALREADY
ALWAYS
AMERICA
AMERICAN
AMONG
AMOUNT
ANNOUNCED
ANOTHER
ANSWER
ANYTHING
AREAS
AROUND
ARRESTED
ASKED
ASKING
ATTACK
ATTACKS
AUTHORITIE
S
BANKS
BECAUSE
BECOME
BEFORE
BEHIND
BEING
BELIEVE
BENEFIT
BENEFITS
BETTER
BETWEEN
BIGGEST
BILLION
BLACK
BORDER
BRING
BRITAIN
BRITISH
BROUGHT
BUDGET
BUILD
BUILDING
BUSINESS
BUSINESSES
CALLED
CAMERON
CAMPAIGN
CANCER
CANNOT
CAPITAL
CASES
CENTRAL
CERTAINLY
CHALLENGE
CHANCE
CHANGE
CHANGES
CHARGE
CHARGES
CHIEF
CHILD
CHILDREN
CHINA
CLAIMS
CLEAR
CLOSE
CLOUD
COMES
COMING
COMMUNITY
COMPANIES
COMPANY
CONCERNS
CONFERENCE
CONFLICT
CONSERVATIV
E
CONTINUE
CONTROL
COULD
COUNCIL
COUNTRIES
COUNTRY
COUPLE
COURSE
COURT
CRIME
CRISIS
CURRENT
CUSTOMERS
DAVID
DEATH
DEBATE
DECIDED
DECISION
DEFICIT
DEGREES
DESCRIBED
DESPITE
DETAILS
DIFFERENCE
DIFFERENT
DIFFICULT
DOING
DURING
EARLY
EARLY
EASTERN
ECONOMIC
ECONOMY
EDITOR
EDUCATION
ELECTION
EMERGENCY
ENERGY
ENGLAND
ENOUGH
EUROPE
EUROPEAN
EVENING
EVENTS
EVERY
EVERYBODY
EVERYONE
EVERYTHING
EVIDENCE
EXACTLY
EXAMPLE
EXPECT
EXPECTED
EXTRA
FACING
FAMILIES
FAMILY
FIGHT
FIGHTING
FIGURES
FINAL
FINANCIAL
FIRST
FOCUS
FOLLOWING
FOOTBALL
FORCE
FORCES
FOREIGN
FORMER
FORWARD
FOUND
FRANCE
FRENCH
FRIDAY
FRONT
FURTHER
FUTURE
GAMES
GENERAL
GEORGE
GERMANY
GETTING
GIVEN
GIVING
GLOBAL
GOING
GOVERNMENT
GREAT
GREECE
GROUND
GROUP
GROWING
GROWTH
GUILTY
HAPPEN
HAPPENED
HAPPENING
HAVING
HEALTH
HEARD
HEART
HEAVY
HIGHER
HISTORY
HOMES
HOSPITAL
HOURS
HOUSE
HOUSING
HUMAN
HUNDREDS
IMMIGRATION
IMPACT
IMPORTANT
INCREASE
INDEPENDENT
INDUSTRY
INFLATION
INFORMATION
INQUIRY
INSIDE
INTEREST
INVESTMENT
INVOLVED
IRELAND
ISLAMIC
ISSUE
ISSUES
ITSELF
JAMES
JUDGE
JUSTICE
KILLED
KNOWN
LABOUR
LARGE
LATER
LATEST
LEADER
LEADERS
LEADERSHIP
LEAST
LEAVE
LEGAL
LEVEL
LEVELS
LIKELY
LITTLE
LIVES
LIVING
LOCAL
LONDON
LONGER
LOOKING
MAJOR
MAJORITY
MAKES
MAKING
MANCHESTER
MARKET
MASSIVE
MATTER
MAYBE
MEANS
MEASURES
MEDIA
MEDICAL
MEETING
MEMBER
MEMBERS
MESSAGE
MIDDLE
MIGHT
MIGRANTS
MILITARY
MILLION
MILLIONS
MINISTER
MINISTERS
MINUTES
MISSING
MOMENT
MONEY
MONTH
MONTHS
MORNING
MOVING
MURDER
NATIONAL
NEEDS
NEVER
NIGHT
NORTH
NORTHERN
NOTHING
NUMBER
NUMBERS
OBAMA
OFFICE
OFFICERS
OFFICIALS
OFTEN
OPERATION
OPPOSITION
ORDER
OTHER
OTHERS
OUTSIDE
PARENTS
PARLIAMENT
PARTIES
PARTS
PARTY
PATIENTS
PAYING
PEOPLE
PERHAPS
PERIOD
PERSON
PERSONAL
PHONE
PLACE
PLACES
PLANS
POINT
POLICE
POLICY
POLITICAL
POLITICIANS
POLITICS
POSITION
POSSIBLE
POTENTIAL
POWER
POWERS
PRESIDENT
PRESS
PRESSURE
PRETTY
PRICE
PRICES
PRIME
PRISON
PRIVATE
PROBABLY
PROBLEM
PROBLEMS
PROCESS
PROTECT
PROVIDE
PUBLIC
QUESTION
QUESTIONS
QUITE
RATES
RATHER
REALLY
REASON
RECENT
RECORD
REFERENDUM
REMEMBER
REPORT
REPORTS
RESPONSE
RESULT
RETURN
RIGHT
RIGHTS
RULES
RUNNING
RUSSIA
RUSSIAN
SAYING
SCHOOL
SCHOOLS
SCOTLAND
SCOTTISH
SECOND
SECRETARY
SECTOR
SECURITY
SEEMS
SENIOR
SENSE
SERIES
SERIOUS
SERVICE
SERVICES
SEVEN
SEVERAL
SHORT
SHOULD
SIDES
SIGNIFICANT
SIMPLY
SINCE
SINGLE
SITUATION
SMALL
SOCIAL
SOCIETY
SOMEONE
SOMETHING
SOUTH
SOUTHERN
SPEAKING
SPECIAL
SPEECH
SPEND
SPENDING
SPENT
STAFF
STAGE
STAND
START
STARTED
STATE
STATEMENT
STATES
STILL
STORY
STREET
STRONG
SUNDAY
SUNSHINE
SUPPORT
SYRIA
SYRIAN
SYSTEM
TAKEN
TAKING
TALKING
TALKS
TEMPERATURE
S
TERMS
THEIR
THEMSELVES
THERE
THESE
THING
THINGS
THINK
THIRD
THOSE
THOUGHT
THOUSANDS
THREAT
THREE
THROUGH
TIMES
TODAY
TOGETHER
TOMORROW
TONIGHT
TOWARDS
TRADE
TRIAL
TRUST
TRYING
UNDER
UNDERSTAND
UNION
UNITED
UNTIL
USING
VICTIMS
VIOLENCE
VOTERS
WAITING
WALES
WANTED
WANTS
WARNING
WATCHING
WATER
WEAPONS
WEATHER
WEEKEND
WEEKS
WELCOME
WELFARE
WESTERN
WESTMINSTE
R
WHERE
WHETHER
WHICH
WHILE
WHOLE
WINDS
WITHIN
WITHOUT
WOMEN
WORDS
WORKERS
WORKING
WORLD
WORST
WOULD
WRONG
YEARS
YESTERDAY
YOUNG
MOBIVSR
Predictions:
•DIFFICULT 40%
•GIVING 20%
•GIVEN 10%
•EVERYTHING 5%
Speech as Inputs for Information Retrieval
https://www.aaai.org
/ojs/index.php/AAAI/
article/view/4106
https://www.isca-
speech.org/archive/In
terspeech_2019/pdfs/
3269.pdf
https://www.isca-
speech.org/archive/Interspeec
h_2019/abstracts/3273.html
https://www.youtube.com/wa
tch?v=3BqQQnTfnlE&list=PL9r
vax0EIUA6PDoiDT2Wp462GsT
nikrvY
https://www.aaai.org
/ojs/index.php/AAAI/
article/view/5649
https://www.aaai.org
/ojs/index.php/AAAI/
article/view/5148
Lip Movement Speech for
Information Retrieval
Note: During speech reconstruction, the sex of the speaker is preserved
Demonstration: English Video to Speech Reconstruction
Demo: Chinese
Video to Speech
Reconstruction
Demonstration:
Hindi Video to
Speech
Reconstruction
Example
person with
dysarthria
https://ww
w.aaai.org/
ojs/index.p
hp/AAAI/ar
ticle/view/
4106
https://ww
w.isca-
speech.org
/archive/In
terspeech_
2019/pdfs/
3269.pdf
https://www.aaai.org/ojs/index
.php/AAAI/article/view/5649
https://www.aaai.org/ojs/index
.php/AAAI/article/view/5148
https://www.isca-
speech.org/archive/Interspee
ch_2019/abstracts/3273.htm
l
https://www.youtube.com/w
atch?v=3BqQQnTfnlE&list=PL
9rvax0EIUA6PDoiDT2Wp462
GsTnikrvY
Lip
Movement
Speech Video
for
Information
Retrieval
Video Construction
GAN output for an English phrase
Viseme concatenation TC GAN Generated Output with Inter-Visemes
Output for an English phrase, Good Bye
GAN output for a Hindi phrase
Viseme concatenation
TC GAN Generated Output with Inter-
Visemes
Output for an Hindi phrase, Aap Kaise hai
(How are you)
LIFI: Towards Linguistically
Informed Frame Interpolation
Aradhya Neeraj Mathur¹, Devansh Batra², Yaman Kumar¹, Rajiv Ratn Shah¹, Roger Zimmermann³
Indraprastha Institute of Information Technology Delhi, India¹
Netaji Subhas University of Technology, Delhi²
National University of Singapore (NUS)³
Motivation
56
• Speech videos are extremely common across the internet (lectures, YouTube videos and even video calling apps),
but no video interpolation methods pay heed to nuances of speech videos.
• Visual Modality of speech is complicated. While uttering a single sentence, our lips cycle through dozens of visemes.
• First 30 frames of a speaker speaking the sentence "I don't exactly walk around with a
hundred and thirty five million dollars in my wallet". Notice the rich lip movement with
opening and closing of the mouth.
Motivation
57
We try to reconstruct this speech video by interpolating the intermediate
frames from the first and last frames using state of the art models.
Expected
Original frames
(with rich mouth
movements)
Observed
Interpolated frames
(with virtually no
mouth movements)
Some Surprising metrics
L1 = 0.0498,
MSE = 0.0088,
SSIM = 0.9521,
PSNR = 20.5415
Which are surprisingly good!!?
This means that we need better
evaluation criteria for interpolation or
reconstruction of speech videos.
Proposed Work
58
1. Challenge Datasets for Speech Video Reconstruction (based on LRS3-TED)
Guess the words spoken?
……………
"Well the short answer to the
question is no, it's not the
same thing"
Random Frame Corruption (40%) Extreme Sparsity Corruption (75%) Prefix Corruption Suffix Corruption
Proposed Work
59
1. Challenge Datasets for Speech Video
Reconstruction (based on LRS3-TED)
Visemic Corruption
(visemes of a particular type being corrupted
and requiring regeneration)
Intra Word Corruption
(Corruption of frames within the occurrence of
a large word)
Inter Word Corruption
(Corruption of frames across word boundaries)
Proposed Work
60
2. Visemic reconstruction with ROI Loss unit
A modified FCN3D with ROI extraction unit to
calculate ROI loss.
Instead of training the reconstruction network
with only the L1 loss between reconstructed
and original images, we introduce an ROI Loss
which measures the similarity between visemic
regions of interests between observed and
generated facial images.
To accomplish this, we develop an ROI unit as
shown on the left.
Proposed Work
61
Key Findings
We evaluate a Fully Convolutinal Network (FCN3D), a convolutional bi-
directional LSTM and the original FCN3D network after addition of the
ROI unit and Visemic Loss during training.
We observe:
1. In different types of corruptions different networks perform
differently.
2. While SuperSloMo performs very well in random frame corruption,
we see that it performs much poorly on other types of
corruptions.
3. As expected, a sequential LSTM based generator works much
better than a fully connected convolutional network when there
are corruptions in consecutive frames as shown in prefix and suffix
corruption
4. Most Importantly, addition of an ROI loss also helps a network
perform better on all forms of corruption and non-ROI based
metrics, as shown by the results for (FCN3D+ROI)
Performance of different models over datasets
containing random corruptions, prefix
corruptions and suffix corruption
Performance of different models over
datasets containing corruptions on
different visemes
Touchless Typing Using Head
Movement-based Gestures
Shivam Rustagi¹, Aakash Garg¹, Pranay Raj Anand², Rajesh Kumar³, Yaman Kumar², Rajiv Ratn Shah²
Delhi Technological University, India¹
Indraprastha Institute of Information Technology Delhi, India²
Haverford College, USA³
Motivation
63
Traditional Input DevicesDiseases which render these devices useless
● Upper limb paralysis
● Deformed limb
● Damaged fingers/hand
● Various other disabilities
Motivation
64
Related Work
[1] A. Nowosielski, “Two-
letters-key keyboard for
predictive touchless typing
with head movements
[2] J. Tu, H. Tao, and T.
Huang, “Face as mouse
through visual face
tracking,”
[3] M. Nabati and A. Behrad, “3d
head pose estimation and
camera mouse implementation
using a monocular video camera
Related Work
66
MID AIR TOUCHLESS TYPING TECHNIQUES
[4] A. Markussen, M. R. Jakobsen, and
K. Hornbundefinedk,
“Vulture: A mid-air word-gesture
keyboard
Using fingers
[5] C. Yu, Y. Gu, Z. Yang, X. Yi, H. Luo, and
Y. Shi, “Tap,dwell or gesture? exploring
head-based text entry techniques for
hmds,”
Using head
67
Proposed Work
for the 10,000 most common English words there are 8529 unique cluster sequences with each sequence
having on an average 1.17 different words. So once we predict the cluster sequence, it can be translated to
1-2 valid words on an average..
Data Collection: Setup
68
Equipment Configuration Purpose
Monitor and
Keyboard
17 inches monitor and standard keyboard ● The color coded QWERTY
keyboard was displayed on
monitor.
● Keyboard used to start and
stop recording.
Camera (on
tripods)
● 3 Samsung M10 mobile cameras which recorded
videos at 30fps, 1920 x 1080 resolution
● All 3 mobiles had OpenCamera app installed
● 1 Samsung M10 mobile with MUSE2 app
The 3 cameras were kept at angles -
45, 0 and 45 degrees respectively to
record the head movements.
MUSE2
headband
Sensors such as accelerometer and gyroscope The sensors recorded the
acceleration and rotation of head.
Moderator’s
laptop
Standard The python script on laptop was
responsible to start and stop
cameras simultaneously.
*Note: For our research we have used only the
central view (the Camera-2) recordings.
❑ Total number of users volunteered = 25 ( 16 male; 9 female; 3 user data discarded on manual inspection)
❑ Each user recorded 3 samples of video each for 35 (words: 20, phrases: 10, sentences: 5 as per Table 1)
❑ Total number of video samples = 2310 (22 x 35 x 3)
69
Category Text
Words locate, single, family, would, place, large, work, take, live,box, method, listen, house, learn, come,
some, ice, old, fly, leg
Phrases hello, excuse me, i am sorry, thank you, good bye, see you,nice to meet you, you are welcome,
how are you, have a good time
Sentences i never gave up, best time to live, catch the trade winds, hear a voice within you, he will forget it
Table 1. List of 20 words, 10 phrases and 5 sentences that was typed by each user. Each of these was iterated for 3 times.
Data Collection: Description
Data Collection: Procedure
70
Camera-1 Camera-2
Data Collection: Statistics
71
Category Avg. Number of letters per Entry
Words 4.33
Phrases 10.6
Sentences 18.6
❏ The words were selected to have proper
cluster-coverage.
❏ The phrases and sentences were selected
from OuluVS[6] and TIMIT[7] dataset
respectively.
Fig. Coverage of each cluster across dataset
Fig. Avg Gesture per minute for each user ( avg = 49.26, std = 5.3 )
72
The proposed method is based on a CNN-RNN architecture, the feature extractor part, as shown above, is
based on HopeNet architecture that predicts the yaw, pitch and roll features for the input image. The
network is trained using a multi-task classification scheme. We utilize the available pretrained model on
large-pose face images from 300W dataset.
Hopenet Architecture
73
HopeNet output visualized on a user. The three vectors are
constructed from the euler angles (features) predicted by the
network.
Working of Hopenet
74
The features from the HopeNet are passed into a multi-layered BiGRU network, which is then trained using a CTC loss
function. During the inference phase we used beam search to decode the cluster sequence.
CNN-RNN architecture
7575
Evaluation Metric: DTW
76
The method is evaluated on two scenarios:
● Inter-User: Training on user set S1, Testing on user set S2 such that S1 and S2 are mutually exclusive. Cluster
sequences are kept the same for training and testing.
● Intra-user: For every user, i.e set S = {S1 U S2}, we record 3 samples per sequence. For training, 2 samples were
taken and the testing is done on the 3rd sample
Results
77
Our work presents a meaningful way of mapping gestures to character (cluster) sequence which could be beneficial for
people with disabilities.
Also, our dataset is publically available which could help improve the current system.
In the future, the aim is to improve the performance issue by:
Using more training data containing a variety of meaningful sequences, and
1. Combining video feeds from multiple cameras, brainwaves recorded via EEG sensors, acceleration, and rotation of
the user’s head recorded via accelerometer and gyroscope.
Other future applications could also work in the direction of integrating the interface with wearable
devices and mobile computing. This will bring together a newer set of applications like browsing from wearable
glasses.
Conclusion and Future Work
Information Retrieval through Soft Biometrics
https://arxiv.org/pdf/2001.09134.pdf
SeekSuspect: Retrieving Suspects from Criminal Datasets using Visual Memory
Aayush Jain*, Meet Shah*, Suraj Pandey*, Mansi Agarwal*, Rajiv Ratn Shah,
Yifang Yin
● Police maintain a crime dossier system that entails information like photographs and physical details.
● Finding suspects by name is possible, but fails when we only have informant's visual memory.
● Law enforcement agencies used to hire sketch artists, but they are limited in number.
● We propose SeekSuspect, a fast interactive suspect retrieval system.
● SeekSuspect employs sophisticated deep learning and computer vision techniques
○ to modify the search space and
○ find the envisioned image effectively and efficiently
I do not exactly
remember
who she was
Is this the
person you
wish to
search for?Female, fair, black hair...
Relevant images
SeekSuspect
Similar images
SeekSuspect
https://midas.iiitd.edu.in/ https://facebook.com/midasiiitd/
https://twitter.com/midasiiitd/ https://linkedin.com/company/midasiiitd/
Team
• Director: Dr. Rajiv Ratn Shah
• PhD Students: Hitkul, Shivangi, Ritwik, Mohit, Yaman, Hemant, Kriti, Astha
• MTech Students: Abhishek, Suraj, Meet, Aayush, William, Subhani, etc.
• Research Assistants: Manraj, Pakhi, Karmanya, Mehar, Saket, Anuj, etc.
• BTech Students (both full-time and remote students):
• DTU: Maitree Leekha, Mansi Agarwal, Shivang Chopra, Rohan Mishra, Himanshu, etc.
• NSUT: Ramit Sahwney, Puneet Mathur, Avinash Swaminathan, Rohit Jain, Hritwik, etc.
• IIT: Pradyumn Gupta, Abhigyan Khaund, Palak Goenka, Amit Jindal, Prateek Manocha, etc.
• IIIT: Vedant Bhatia, Raj K Gupta, Shagun Uppal, Osheen Sachdev, Siddharth Dhawan, etc.
• Alumnus (Placements, Internship, MS Admissions):
• Companies: Google, Microsoft, Amazon, Adobe, Tower Research, Walmart, Qualcomm,
Goldman Sachs. Bloomberg, IBM Research, Wadhwani AI, Samsung Research, etc.
• Academia: CMU, Columbia University, University of Pennsylvania, University of Maryland,
University of Southern California, Erasmus Mundus, University of Virginia, Georgia Tech, etc.
Collaborators
• Prof Roger Zimmermann, National University of Singapore, Singapore
• Prof Changyou Chen, State University of New York at Buffalo, USA
• Prof Mohan Kankanhalli, National University of Singapore, Singapore
• Prof Ponnurangam Kumaraguru (PK), IIIT Delhi, India
• Dr. Amanda Stent, Bloomberg, New York, USA
• Dr. Debanjan Mahata, Bloomberg, New York, USA
• Prof. Rada Mihalcea, University of Michigan, USA
• Prof. Shin'ichi Satoh, National Institute of Informatics, Japan
• Prof. Jessy Li, University of Texas at Austin, USA
• Prof. Huan Liu, Arizona State University, USA
• Prof. Naimul Khan, Ryerson University, Canada
• Prof. Diyi Yang, Georgia Institute of Technology, USA
• Prof Payman Vafaee, Columbia University, USA
• Prof Cornelia Caragea, University of Illinois at Chicago, USA
• Dr. Mika Hama, SLTI, USA, and many more...
Research (AI for Social Good)
• NLP and Multimedia based systems for society (education, healthcare, etc.)
• Automatic speech recognition (ASR) for different domains and accents (e.g., Indian, African)
• Visual speech recognition/reconstruction (VSR) such as lipreading and speech reconstruction
• Hate speech and malicious user detection in code-switched scenarios on social media
• Mental health problems such as suicidal ideation and depression detection on social media
• Building multimodal information retrieval and information extraction systems
• Knowledge graph construction for different domains, e.g., medical, e-commerce, defence. etc.
• Automated systems for number plate and damage detection, car insurance claim, e-challan, etc.
• Multimodal sentiment analysis and its applications in education, policy making, etc.
• Detecting, analyzing, and recommending advertisements in videos streams
• Fake news detection and propagation, suspect detection, personality detection, etc.
• Publications (but are not limited to)
• AAAI, CIKM, ACL, EMNLP, WSDM, COLING, ACM Multimedia, ICDM, INTERSPEECH, WWW, ICASSP, WACV,
BigMM, IEEE ISM, NAACL, ACM Hypertext, ACM SIGSPATIAL, Elsevier KBS, IEEE Intelligent Systems, IEEE MIPR,
ACM MM Asia, AACL, Springer book chapters, etc.
Research (AI for Social Good)
• Awards (but are not limited to)
• Won the outstanding paper award as COLING 2020
• Got selected to Heidelberg Laureate Forum (HLF) in 2018, 2019, 2020
• Best student poster in AAAI 2019, Honolulu, Hawai, USA
• Best poster and best industrial paper in IEEE BigMM 2019, Singapore
• Winner of the ACM INDIA Chapters Technology Solution Contest 2019 in Jaipur, India
• Won the honorable mention award in ICDM Knowledge Graph Contest 2019 in Beijing, China
• Won the best poster runner-up award at IEEE ISM 2018 conference in Taichung, Taiwan
• Skills, Tools, and Frameworks (but are not limited to)
• Natural Language Processing, Image Processing, Speech Processing
• Multimodal Computing
• Python, Java Script, Java
• AI/ Machine Learning/ Deep Learning
• Tensorflow, PyTorch, Keras, etc.
Sponsors:
References
1. Conversational Systems and the Marriage of Speech & Language by Mari Ostendorf (University of Washington)
2. Speech 101 by Robert Moore The University of Sheffield
3. https://www.youtube.com/watch?v=PWGeUztTkRA&ab_channel=Mark_Mitton
4. The Two Ronnies Show
5. Preliminaries to a Theory of Speech Disfluencies (Elizabeth Shriberg, 1994)
6. A Short Analysis of Discourse Coherence (Wang and Guo, 2014)
7. A. Nowosielski, “Two-letters-key keyboard for predictive touchless typing with head movements,” 07 2017, pp. 68–79
8. J. Tu, H. Tao, and T. Huang, “Face as mouse through visual face tracking,” Comput. Vis. Image Underst., vol. 108, no. 1–2, p. 35–40, Oct. 2007.
[Online]. Available: https://doi.org/10.1016/j.cviu.2006.11.007
9. 3d head pose estimation and camera mouse implementation using a monocular video camera,” Signal, Image and Video Processing, vol. 9, 01
2012.
10. A. Markussen, M. R. Jakobsen, and K. Hornbundefinedk, “Vulture: A mid-air word-gesture keyboard,” in CHI ’14, 2014.
11. C. Yu, Y. Gu, Z. Yang, X. Yi, H. Luo, and Y. Shi, “Tap, dwell or gesture? exploring head-based text entry techniques for hmds,” in CHI ’17, 2017.
12. Zhao G, Barnard M & Pietikäinen M (2009) Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia 11(7):1254-1265.
13. Garofolo, J. & Lamel, Lori & Fisher, W. & Fiscus, Jonathan & Pallett, D. & Dahlgren, N. & Zue, V.. (1992). TIMIT Acoustic-phonetic Continuous
Speech Corpus. Linguistic Data Consortium.
1. Gandharv Mohan, MIDAS Lab IIITD, Btech 2021
2. Akash Sharma, MIDAS Lab IIITD, Btech 2021
3. Rajaswa Patil, MIDAS Lab IIITD, Btech 2021
4. Avyakt Gupta, MIDAS Lab IIITD, Btech 2021
5. Gaurav Aggarwal, MIDAS Lab IIITD, Btech 2021
6. Devansh Batra, MIDAS Lab IIITD, Btech 2021
7. Aradhya Neeraj Mathur, MIDAS Lab IIITD, PhD Student
8. Maitree Leekha, MIDAS Lab IIITD, Btech 2020
9. Jainendra Shukla, HMI Lab IIITD, Assistant Professor
10. Vidit Jain, MIDAS and HMI Lab, Btech 2021
11. Rajesh Kumar, Haverford College USA, Assistant Professor
12. Shivam, Akash, Mohit, Vishaal, Mansi, Aayush, Meet, Suraj, and many other MIDAS members
Acknowledgements

Más contenido relacionado

La actualidad más candente

Year 10 Talk Show Oral Presentation
Year 10 Talk Show Oral Presentation Year 10 Talk Show Oral Presentation
Year 10 Talk Show Oral Presentation Christine Wells
 
Listening and Speaking
Listening and SpeakingListening and Speaking
Listening and SpeakingKhun Khru
 
THE IMPORTANCE OF TEACHING LISTENING AND SPEAKING SKILLS
THE IMPORTANCE OF TEACHING LISTENING AND SPEAKING SKILLSTHE IMPORTANCE OF TEACHING LISTENING AND SPEAKING SKILLS
THE IMPORTANCE OF TEACHING LISTENING AND SPEAKING SKILLSMorteza Mohammadi
 
Presentation reading and listening
Presentation  reading and listeningPresentation  reading and listening
Presentation reading and listeningEdgar Lucero
 
Teaching listening skills and Spoken Communication Skills
Teaching listening skills and Spoken Communication SkillsTeaching listening skills and Spoken Communication Skills
Teaching listening skills and Spoken Communication SkillsDokka Srinivasu
 
The speaking process
The speaking processThe speaking process
The speaking processchela894
 
Teaching Listening Skills to English as a Foreign Language Students through E...
Teaching Listening Skills to English as a Foreign Language Students through E...Teaching Listening Skills to English as a Foreign Language Students through E...
Teaching Listening Skills to English as a Foreign Language Students through E...ijtsrd
 
Factors affecting speaking skills
Factors affecting speaking skillsFactors affecting speaking skills
Factors affecting speaking skillsEng Eng
 
Effecttive listening a crucial skill in english language
Effecttive listening  a crucial skill in english languageEffecttive listening  a crucial skill in english language
Effecttive listening a crucial skill in english languagesyeadha
 
Class 18 reading, listening, speaking and writing (in business) strategies
Class 18  reading, listening, speaking and writing (in business) strategiesClass 18  reading, listening, speaking and writing (in business) strategies
Class 18 reading, listening, speaking and writing (in business) strategiesDemi Salmeron Sanchez
 
Teaching listening and speaking
Teaching listening and speakingTeaching listening and speaking
Teaching listening and speakingasavitski
 
Teaching Speaking & Listening
Teaching Speaking & ListeningTeaching Speaking & Listening
Teaching Speaking & ListeningErin Lowry
 
Speaking skill presentation
Speaking skill presentationSpeaking skill presentation
Speaking skill presentationchela894
 

La actualidad más candente (20)

Year 10 Talk Show Oral Presentation
Year 10 Talk Show Oral Presentation Year 10 Talk Show Oral Presentation
Year 10 Talk Show Oral Presentation
 
Language skills
Language skillsLanguage skills
Language skills
 
Listening and Speaking
Listening and SpeakingListening and Speaking
Listening and Speaking
 
2003 Lsrw
2003 Lsrw2003 Lsrw
2003 Lsrw
 
THE IMPORTANCE OF TEACHING LISTENING AND SPEAKING SKILLS
THE IMPORTANCE OF TEACHING LISTENING AND SPEAKING SKILLSTHE IMPORTANCE OF TEACHING LISTENING AND SPEAKING SKILLS
THE IMPORTANCE OF TEACHING LISTENING AND SPEAKING SKILLS
 
Presentation reading and listening
Presentation  reading and listeningPresentation  reading and listening
Presentation reading and listening
 
Teaching listening skills and Spoken Communication Skills
Teaching listening skills and Spoken Communication SkillsTeaching listening skills and Spoken Communication Skills
Teaching listening skills and Spoken Communication Skills
 
The speaking process
The speaking processThe speaking process
The speaking process
 
Teaching listening
Teaching listeningTeaching listening
Teaching listening
 
Listening strategies
Listening strategies Listening strategies
Listening strategies
 
Teaching Listening Skills to English as a Foreign Language Students through E...
Teaching Listening Skills to English as a Foreign Language Students through E...Teaching Listening Skills to English as a Foreign Language Students through E...
Teaching Listening Skills to English as a Foreign Language Students through E...
 
Teaching Lisening Strategies
Teaching Lisening Strategies Teaching Lisening Strategies
Teaching Lisening Strategies
 
Factors affecting speaking skills
Factors affecting speaking skillsFactors affecting speaking skills
Factors affecting speaking skills
 
Effecttive listening a crucial skill in english language
Effecttive listening  a crucial skill in english languageEffecttive listening  a crucial skill in english language
Effecttive listening a crucial skill in english language
 
Class 18 reading, listening, speaking and writing (in business) strategies
Class 18  reading, listening, speaking and writing (in business) strategiesClass 18  reading, listening, speaking and writing (in business) strategies
Class 18 reading, listening, speaking and writing (in business) strategies
 
Teaching listening and speaking
Teaching listening and speakingTeaching listening and speaking
Teaching listening and speaking
 
Teaching Speaking & Listening
Teaching Speaking & ListeningTeaching Speaking & Listening
Teaching Speaking & Listening
 
Delivering Speech
Delivering SpeechDelivering Speech
Delivering Speech
 
Speaking skill presentation
Speaking skill presentationSpeaking skill presentation
Speaking skill presentation
 
Teaching Oral Skill
Teaching Oral SkillTeaching Oral Skill
Teaching Oral Skill
 

Similar a Marriage of speech, vision and natural language processing

Accent reduction by Justin Murray @ REAL LIFE English
Accent reduction by Justin Murray @ REAL LIFE EnglishAccent reduction by Justin Murray @ REAL LIFE English
Accent reduction by Justin Murray @ REAL LIFE EnglishJason R. Levine
 
When your persona talks: Mitigating linguistic bias in voice interaction design
When your persona talks: Mitigating linguistic bias in voice interaction designWhen your persona talks: Mitigating linguistic bias in voice interaction design
When your persona talks: Mitigating linguistic bias in voice interaction designMary Constance Parks
 
Teaching listening and speaking
Teaching listening and speakingTeaching listening and speaking
Teaching listening and speakingTetyana Pavlenko
 
teachinglisteningandspeaking-110107132547-phpapp01.pdf
teachinglisteningandspeaking-110107132547-phpapp01.pdfteachinglisteningandspeaking-110107132547-phpapp01.pdf
teachinglisteningandspeaking-110107132547-phpapp01.pdfElly51526
 
Role Of Communication In Financial Planning
Role Of Communication In Financial PlanningRole Of Communication In Financial Planning
Role Of Communication In Financial PlanningSunil Kumar
 
English masterclass 2012
English masterclass 2012English masterclass 2012
English masterclass 2012year11revision
 
Effective presentation strategies
Effective presentation strategiesEffective presentation strategies
Effective presentation strategiesHarsh Dabhi
 
Using Corpus Linguistics to Teach ESL Pronunication
Using Corpus Linguistics to Teach ESL PronunicationUsing Corpus Linguistics to Teach ESL Pronunication
Using Corpus Linguistics to Teach ESL PronunicationRebecca Allen
 
The Elephant in the Room - The Taboo Issue of a Teacher's English
The Elephant in the Room - The Taboo Issue of a Teacher's EnglishThe Elephant in the Room - The Taboo Issue of a Teacher's English
The Elephant in the Room - The Taboo Issue of a Teacher's EnglishHigor Cavalcante
 
Teaching Listening
Teaching ListeningTeaching Listening
Teaching Listeningbrandybarter
 
Oral communication week 4
Oral communication week 4Oral communication week 4
Oral communication week 4Joy Trinidad
 
communication230502.pptx
communication230502.pptxcommunication230502.pptx
communication230502.pptxJayarani31
 
Autism:Technology & Communication - Part 2, Univ. of Redlands, 11.15.14
Autism:Technology & Communication - Part 2, Univ. of Redlands, 11.15.14Autism:Technology & Communication - Part 2, Univ. of Redlands, 11.15.14
Autism:Technology & Communication - Part 2, Univ. of Redlands, 11.15.14SCAAC-N
 
5 complicatedproblems
5 complicatedproblems5 complicatedproblems
5 complicatedproblemsHywel Evans
 
Teachinglisteningspeaking2 131218093901-phpapp02
Teachinglisteningspeaking2 131218093901-phpapp02Teachinglisteningspeaking2 131218093901-phpapp02
Teachinglisteningspeaking2 131218093901-phpapp02Mohamad Razif bin Disa
 

Similar a Marriage of speech, vision and natural language processing (20)

Accent reduction by Justin Murray @ REAL LIFE English
Accent reduction by Justin Murray @ REAL LIFE EnglishAccent reduction by Justin Murray @ REAL LIFE English
Accent reduction by Justin Murray @ REAL LIFE English
 
202320.pptx
202320.pptx202320.pptx
202320.pptx
 
When your persona talks: Mitigating linguistic bias in voice interaction design
When your persona talks: Mitigating linguistic bias in voice interaction designWhen your persona talks: Mitigating linguistic bias in voice interaction design
When your persona talks: Mitigating linguistic bias in voice interaction design
 
Teaching listening and speaking
Teaching listening and speakingTeaching listening and speaking
Teaching listening and speaking
 
teachinglisteningandspeaking-110107132547-phpapp01.pdf
teachinglisteningandspeaking-110107132547-phpapp01.pdfteachinglisteningandspeaking-110107132547-phpapp01.pdf
teachinglisteningandspeaking-110107132547-phpapp01.pdf
 
Non verbal communication
Non verbal communicationNon verbal communication
Non verbal communication
 
Role Of Communication In Financial Planning
Role Of Communication In Financial PlanningRole Of Communication In Financial Planning
Role Of Communication In Financial Planning
 
How to teach listening
How to teach listening How to teach listening
How to teach listening
 
English masterclass 2012
English masterclass 2012English masterclass 2012
English masterclass 2012
 
Effective presentation strategies
Effective presentation strategiesEffective presentation strategies
Effective presentation strategies
 
LISTENING SKILLS.pptx
LISTENING SKILLS.pptxLISTENING SKILLS.pptx
LISTENING SKILLS.pptx
 
Using Corpus Linguistics to Teach ESL Pronunication
Using Corpus Linguistics to Teach ESL PronunicationUsing Corpus Linguistics to Teach ESL Pronunication
Using Corpus Linguistics to Teach ESL Pronunication
 
The Elephant in the Room - The Taboo Issue of a Teacher's English
The Elephant in the Room - The Taboo Issue of a Teacher's EnglishThe Elephant in the Room - The Taboo Issue of a Teacher's English
The Elephant in the Room - The Taboo Issue of a Teacher's English
 
Teaching Listening
Teaching ListeningTeaching Listening
Teaching Listening
 
Oral communication week 4
Oral communication week 4Oral communication week 4
Oral communication week 4
 
communication230502.pptx
communication230502.pptxcommunication230502.pptx
communication230502.pptx
 
Autism:Technology & Communication - Part 2, Univ. of Redlands, 11.15.14
Autism:Technology & Communication - Part 2, Univ. of Redlands, 11.15.14Autism:Technology & Communication - Part 2, Univ. of Redlands, 11.15.14
Autism:Technology & Communication - Part 2, Univ. of Redlands, 11.15.14
 
5 complicatedproblems
5 complicatedproblems5 complicatedproblems
5 complicatedproblems
 
Teachinglisteningspeaking2 131218093901-phpapp02
Teachinglisteningspeaking2 131218093901-phpapp02Teachinglisteningspeaking2 131218093901-phpapp02
Teachinglisteningspeaking2 131218093901-phpapp02
 
Communication skills
Communication skillsCommunication skills
Communication skills
 

Último

Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesRAJNEESHKUMAR341697
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Servicemeghakumariji156
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksMagic Marks
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersMairaAshraf6
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086anil_gaur
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxnuruddin69
 
Air Compressor reciprocating single stage
Air Compressor reciprocating single stageAir Compressor reciprocating single stage
Air Compressor reciprocating single stageAbc194748
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projectssmsksolar
 

Último (20)

Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptx
 
Air Compressor reciprocating single stage
Air Compressor reciprocating single stageAir Compressor reciprocating single stage
Air Compressor reciprocating single stage
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 

Marriage of speech, vision and natural language processing

  • 1. Marriage of Computer Vision, Speech and Natural Language - Yaman Kumar (MIDAS Lab-IIITD, SUNY at Buffalo) - Rajiv Ratn Shah (MIDAS Lab-IIITD)
  • 2. What is Speech? Text Part of Speech Vision Part of Speech Aural Part of Speech
  • 4. Marriage of Speech & Language Conversational speech Information in the acoustic signal beyond the words Interactive nature of conversations
  • 5. Speech is ... more than spoken words • Rich in ‘extra-linguistic’ information • breathing noises • lip-smacks • Hand movements • Facial Expressions • Rich in ‘para-linguistic’ information • Personality • Attitude • Emotion • Individuality
  • 6. Some Examples • Disfluency • I am uh uh very …. I am very excited to see you • He is my em …… Yaman is my best friend • Intonation and Stress 1. *This* is my laptop (and not that) • This is *my* laptop (and not yours) • This is my *laptop* (and not book) 2. He found it on the street? • And in reply, He found it on the street • No punctuation and very open grammar • ASR errors
  • 7. • to the listener • a child (‘parentese’) • a non-native person • a hearing-impaired individual • an animal • a machine(!) • to the cognitive load • Interaction with other tasks • stressful/emotional situations • to the environment • noise • reverberation Speech is Adaptive • to the task • Casual conversation • Reading out loud • Public speaking
  • 8. Content • Content in spoken medium is the "information or experiences directed towards end-users or an audience". Why is Content Important? Whom do you prefer? • A speaker with style, elegance, panache but with a weak content (talking too much off-topic, not providing enough details about facts). OR • An average speaker but with a good content (ideas stick to the main topic, provides interesting/required background information).
  • 9. Content What defines a Good Content? ( High Relevance and High Sufficiency ) Relevance • Related to the topic • Connected to the prompt in a bigger story. • No Unwanted information or off topic. Sufficiency • Adequate details (which are also relevant) • All points covered • No Missing parts
  • 10. Response: IVE ACCOMPLISHED UM MANY THINGS IN LIFE ONE OF THEM IS IS BEING A PHILANTHROPIST IVE HELPED A LOT OF PEOPLE MOST SPECIALLY CHILDREN I GO TO SOME UM POOR AREAS AND WE TEACH LIKE THOSE CHILDREN SOME KNOWLEDGE THAT THEY DONT KNOW YET LIKE FOR EXAMPLE IM GOING TO BE THEIR TEACHER AND I I INFORM THEM ALL THE THINGS LIKE UM WHAT TO WRITE HOW TO READ HOW TO DESCRIBE SOMETHING AND THIS IS REALLY IMPORTANT IN MY LIFE BECAUSE BEING A TEACHER IS REALLY GOOD FOR ME AND I THINK IT WILL REALLY HELP ME GROW MY ABILITY TO HELP PEOPLE MOST SPECIALLY CHILDREN Response: IT IS IMPORTANT TO CHOOSE WISELY FOR YOUR CAREER AND ITS ALSO IMPORTANT THAT YOU CHOOSE THAT CAREER BECAUSE UH THIS IS YOUR PASSION AND THIS IS YOUR REALLY ONE JOB AND BECAUSE IF YOU DONT WANT THAT JOB OR CAR CAREER BUT YOU CHOOSE IT UH YOU WILL AT THE END OF THE DAY YOU WILL NOT BE UH MOTIVATED TO WORK WITH IT AND YOU WILL NOT BE YOU ARE UH THERES A TENDENCY THAT YOU WILL NOT ACHIEVE YOUR GOAL OR DESIRE IN YOUR IN THAT CAREER AND YOURE NOT BE WILL BE SUCCESSFUL IN THAT CAREER IT IS IMPORTANT TO CHOOSE WISELY YOUR CAREER AND UH CONSIDER THAT THIS IS YOUR UH THIS IS WHAT YOU REALLY WANT AND THIS IS YOUR PASSIONS AND ARE IT IS UH IF YOU CHOOSE YOUR CAREER BE SURE YOU ARE ENJOYING IT NOT DOING IT Relevance: High Speaker sticks to the things asked in prompt. (Being philanthropist or teacher as accomplishments, important of the same.) Sufficiency: High Explains in detail about how he helped children as a teacher, how did he help and importance of the same Relevance: Low Speaker goes too off topic from what is being asked. (About car, being successful, what good career is, instead of talking about accomplishments.) Sufficiency: Low Provides no information that addresses the points in the prompt. Prompt: You have to narrate to a career advisor 1 thing you accomplished which you are proud of and how it was important for you.
  • 11. D…Di….Disfluencies • Interruptions in the smooth flow of speech • These interruptions often occur in spoken communication. They usually help the speakers to buy more time while they express their thought process. • Reparandum (RM) - Refers to the unintended and unnecessary part of the disfluency span (This span can be deleted in order to obtain fluency) • Interregnum (IM) - Refers to the part that lies between RM and RR. (This span helps the speaker to fill the intermediate gap) • Repair (RR) - Refers to the corrected span of the RM. (This span should maintain the context of RM)
  • 12. D…Di….Disfluencies • Examples • Filled pauses : "This is a uhmm … good example" • Discourse Markers : " It's really nice to .. you know .. play outside sometimes." • Self-Correction : " So we will... we can go there." • Repetitions : "The... the... the decision was not mine to make" • Restart : "We would like to eat ... let’s go to the park" • Why can't we recognize these disfluencies solely by looking at the words ? 🤔 • Consideration of the audio helps in understanding the intention of speaker and hence deciding if there is a disfluency or not. • Can get confused with some fluently done repetitions - "Superman is the most most most powerful superhero ! " • Can also get confused from various other interruptions like non-verbal sounds and even silence !
  • 13.
  • 14. Pronunciation /prəˌnʌnsɪˈeɪʃ(ə)n/ Mispronunciation Detection: Problem where the perceived pronunciation doesn't match with intended pronunciation, but we can understand the meaning. Example. Pronunciation of word park. • Phoneme Recognition Problem: State of the art phoneme (sounds in a language) recognition systems has a phoneme error rate of 18% for native speech data. • Non-native accent: Phonemes might be recognized correctly but acoustic models (models used to detect phonemes) are often confused by non- native speech. Some phonemes (sounds) exist in the native language which do not have an alternative in the non-native language. E.g. Je sound in French has no English mapping which confuses the acoustic model to predict wrong sequences of phonemes.
  • 15. Pronunciation Intelligibility: There is a lot of difference between the intended speech and spoken speech. Example: Pronunciation of word mEssage is incorrect. A good ASR system will perceive it as mAssage and rate it correctly pronounced. However, the user meant to say mEssage.
  • 16. Discourse Coherence • Discourse is a coherent combination of spoken (or written) utterances communicated between a speaker (or writer) and a listener (or reader). • Discourse is a PRODUCT? ✍️ (linguistic perspective) • Discourse is a PROCESS!! 🤔🤔 (cognitive perspective) • Discourse coherence is the semantic relationship between propositions or communicative events in discourse. • It is a feature of the perception 👀👂 of discourse rather than the content of discourse itself.
  • 17. Discourse Coherence Discourse as Product ✍ • A well written speech. • How the discourse content is structured and organized by the speaker. • Cohesion in text, use of discourse markers, connectives, etc. • How readable is the text, how complex is the text, etc. Discourse as Process 🤔 • A well delivered speech. • How the discourse content is delivered efficiently to the listener. • Prosodic variation, use of stress, intonation, pauses, etc. • How intelligible is the speech, how focused is the listener, etc.
  • 18. Prosody • Prosodic features span... • several speech segments • several syllables • whole utterances • Such ‘suprasegmental’ behaviour includes ... • lexical stress (Prominence of Syllables) • lexical tone (Pitch pattern to distinguish words) • rhythmic stress (Emphasis) • intonation (Difference of Expressive meaning)
  • 19. It’s not what you say, but how you say it.
  • 20. The Two Ronnies - Four Candles vs Fork Handles Speech is Ambiguous
  • 21. Silent Speech is Even More Ambiguous • Elephant Juice vs I Love You • Million vs Billion • Pet vs Bell vs Men Speak Them To Yourself! You lip movements are exactly same!
  • 22.
  • 23. Exploring Semi-Supervised Learning for Predicting Listener Backchannels Accepted at CHI’21! Vidit Jain, Maitree Leekha, Jainendra Shukla, Rajiv Ratn Shah
  • 24. Introduction ● Developing human-like conversational agents is important! ○ Applications in education and healthcare ● Challenge: how to make them seem natural? ○ Human conversations are complex! ● Listener backchannels: a crucial element of human conversation: ○ Listener’s “regular” feedback to the speaker, indicating presence ○ Verbal: e.g., short utterances ○ Non-verbal: e.g., head shake, nod, smile etc. ● We focus on modelling these backchannels as a step towards natural Human Robot Interactions (HRIs).
  • 25. Research Questions Key Research Gaps: ● Prior works [1, 2 and more] relied on large amounts of manually annotated data to train listener backchannel prediction (LBP) models ○ This is expensive in terms of man hours ● In addition, all previous works have focused on only English conversations Major Contributions: ● Validating the use of semi-supervised techniques for LBP ○ Models using only 25% of manual annotation performed at par! ● Unlike past works, we use Hindi conversations [1] Park, Hae Won, et al. "Telling stories to robots: The effect of backchanneling on a child's storytelling." 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI. IEEE, 2017. [2] Goswami, Mononito, Minkush Manuja, and Maitree Leekha. "Towards Social & Engaging Peer Learning: Predicting Backchanneling and Disengagement in Children." arXiv preprint arXiv:2007.11346 (2020).
  • 26. Dataset ● We use the multimodal Hindi based Vyaktitv dataset [3] ○ 25 conversations, each ~16 min long ○ Video and audio feeds available for each participant (50 recordings) ● Annotations Done: ○ 3 annotators ○ Signal (kappa): Nod (0.7), Head-shake (0.6), Mouth (0.6), Eyebrow (0.5), Utterances (0.5) ● Features Extracted: ○ OpenFace - visual features: 18 facial action units (FAU), gaze velocities & accelerations, translational and rotational head velocities & accelerations, blink rate, pupil location, and smile ratio ○ pyAudioAnalysis - audio features: voice activity, MFCC, F0, energy [3] Khan, Shahid Nawaz, et al. "Vyaktitv: A Multimodal Peer-to-Peer Hindi Conversations based Dataset for Personality Assessment." 2020 IEEE Sixth International Conference on Multimedia Big Data (BigMM). IEEE, 2020.
  • 27. System Architecture Methodology: (i) Semi-supervised learning for identifying backchannels and type of signals emitted using a subset of labeled data. (ii) Learning to predict these instances and signals using the speaker's context.
  • 28. Task Formulations Identification Given a listener’s audio and video feeds, identify when he backchannels? These are the true labels in the prediction task We use semi-supervision here to generate these pseudo-labels (instance & type) Prediction Given a speaker’s context (~3-7 sec long), predict whether the listener will backchannel immediately after it. Use only speaker’s features to predict the instance & type of backchannel (verbal/visual)
  • 29. Key Findings ● The semi-supervised process was able to identify backchannel instances and signal types very well ○ Respective accuracies- 0.90 (ResNet) & 0.85 (RF)- only 25% manual annotation as seed! ● Comparing prediction models trained using manually annotated vs semi supervised pseudo labels: ○ Using semi-supervision, we reach ~94% of the baseline performance! ● Qualitative Study: Majority participants could not distinguish between the two prediction models!
  • 30. Demo Our final system trained using semi-supervision
  • 31. Lip Movement as Inputs for Information Retrieval https://www.aaai.org/ojs/index.php/AAAI/article/view/5649 https://www.aaai.org/ojs/index.php/AAAI/article/view/5148 https://www.aaai.org/ojs/index.php/AAAI/article/view/4106 https://www.isca-speech.org/archive/Interspeech_2019/pdfs/3269.pdf https://www.isca-speech.org/archive/Interspeech_2019/abstracts/3273.html https://www.youtube.com/watch?v=3BqQQnTfnlE&list=PL9rvax0EIUA6PDoiDT2Wp462GsT nikrvY
  • 33. Let’s put your lip reading abilities to test (SHOW OF HANDS)
  • 40. Let’s jump to MobiVSR difficulty level. Your options: ABOUT ABSOLUTELY ABUSE ACCESS ACCORDING ACCUSED ACROSS ACTION ACTUALLY AFFAIRS AFFECTED AFRICA AFTER AFTERNOON AGAIN AGAINST AGREE AGREEMENT AHEAD ALLEGATION S ALLOW ALLOWED ALMOST ALREADY ALWAYS AMERICA AMERICAN AMONG AMOUNT ANNOUNCED ANOTHER ANSWER ANYTHING AREAS AROUND ARRESTED ASKED ASKING ATTACK ATTACKS AUTHORITIE S BANKS BECAUSE BECOME BEFORE BEHIND BEING BELIEVE BENEFIT BENEFITS BETTER BETWEEN BIGGEST BILLION BLACK BORDER BRING BRITAIN BRITISH BROUGHT BUDGET BUILD BUILDING BUSINESS BUSINESSES CALLED CAMERON CAMPAIGN CANCER CANNOT CAPITAL CASES CENTRAL CERTAINLY CHALLENGE CHANCE CHANGE CHANGES CHARGE CHARGES CHIEF CHILD CHILDREN CHINA CLAIMS CLEAR CLOSE CLOUD COMES COMING COMMUNITY COMPANIES COMPANY CONCERNS CONFERENCE CONFLICT CONSERVATIV E CONTINUE CONTROL COULD COUNCIL COUNTRIES COUNTRY COUPLE COURSE COURT CRIME CRISIS CURRENT CUSTOMERS DAVID DEATH DEBATE DECIDED DECISION DEFICIT DEGREES DESCRIBED DESPITE DETAILS DIFFERENCE DIFFERENT DIFFICULT DOING DURING EARLY EARLY EASTERN ECONOMIC ECONOMY EDITOR EDUCATION ELECTION EMERGENCY ENERGY ENGLAND ENOUGH EUROPE EUROPEAN EVENING EVENTS EVERY EVERYBODY EVERYONE EVERYTHING EVIDENCE EXACTLY EXAMPLE EXPECT EXPECTED EXTRA FACING FAMILIES FAMILY FIGHT FIGHTING FIGURES FINAL FINANCIAL FIRST FOCUS FOLLOWING FOOTBALL FORCE FORCES FOREIGN FORMER FORWARD FOUND FRANCE FRENCH FRIDAY FRONT FURTHER FUTURE GAMES GENERAL GEORGE GERMANY GETTING GIVEN GIVING GLOBAL GOING GOVERNMENT GREAT GREECE GROUND GROUP GROWING GROWTH GUILTY HAPPEN HAPPENED HAPPENING HAVING HEALTH HEARD HEART HEAVY HIGHER HISTORY HOMES HOSPITAL HOURS HOUSE HOUSING HUMAN HUNDREDS IMMIGRATION IMPACT IMPORTANT INCREASE INDEPENDENT INDUSTRY INFLATION INFORMATION INQUIRY INSIDE INTEREST INVESTMENT INVOLVED IRELAND ISLAMIC ISSUE ISSUES ITSELF JAMES JUDGE JUSTICE KILLED KNOWN LABOUR LARGE LATER LATEST LEADER LEADERS LEADERSHIP LEAST LEAVE LEGAL LEVEL LEVELS LIKELY LITTLE LIVES LIVING LOCAL LONDON LONGER LOOKING
  • 41. MAJOR MAJORITY MAKES MAKING MANCHESTER MARKET MASSIVE MATTER MAYBE MEANS MEASURES MEDIA MEDICAL MEETING MEMBER MEMBERS MESSAGE MIDDLE MIGHT MIGRANTS MILITARY MILLION MILLIONS MINISTER MINISTERS MINUTES MISSING MOMENT MONEY MONTH MONTHS MORNING MOVING MURDER NATIONAL NEEDS NEVER NIGHT NORTH NORTHERN NOTHING NUMBER NUMBERS OBAMA OFFICE OFFICERS OFFICIALS OFTEN OPERATION OPPOSITION ORDER OTHER OTHERS OUTSIDE PARENTS PARLIAMENT PARTIES PARTS PARTY PATIENTS PAYING PEOPLE PERHAPS PERIOD PERSON PERSONAL PHONE PLACE PLACES PLANS POINT POLICE POLICY POLITICAL POLITICIANS POLITICS POSITION POSSIBLE POTENTIAL POWER POWERS PRESIDENT PRESS PRESSURE PRETTY PRICE PRICES PRIME PRISON PRIVATE PROBABLY PROBLEM PROBLEMS PROCESS PROTECT PROVIDE PUBLIC QUESTION QUESTIONS QUITE RATES RATHER REALLY REASON RECENT RECORD REFERENDUM REMEMBER REPORT REPORTS RESPONSE RESULT RETURN RIGHT RIGHTS RULES RUNNING RUSSIA RUSSIAN SAYING SCHOOL SCHOOLS SCOTLAND SCOTTISH SECOND SECRETARY SECTOR SECURITY SEEMS SENIOR SENSE SERIES SERIOUS SERVICE SERVICES SEVEN SEVERAL SHORT SHOULD SIDES SIGNIFICANT SIMPLY SINCE SINGLE SITUATION SMALL SOCIAL SOCIETY SOMEONE SOMETHING SOUTH SOUTHERN SPEAKING SPECIAL SPEECH SPEND SPENDING SPENT STAFF STAGE STAND START STARTED STATE STATEMENT STATES STILL STORY STREET STRONG SUNDAY SUNSHINE SUPPORT SYRIA SYRIAN SYSTEM TAKEN TAKING TALKING TALKS TEMPERATURE S TERMS THEIR THEMSELVES THERE THESE THING THINGS THINK THIRD THOSE THOUGHT THOUSANDS THREAT THREE THROUGH TIMES TODAY TOGETHER TOMORROW TONIGHT TOWARDS TRADE TRIAL TRUST TRYING UNDER UNDERSTAND UNION UNITED UNTIL USING VICTIMS VIOLENCE VOTERS WAITING WALES WANTED WANTS WARNING WATCHING WATER WEAPONS WEATHER WEEKEND WEEKS WELCOME WELFARE WESTERN WESTMINSTE R WHERE WHETHER WHICH WHILE WHOLE WINDS WITHIN WITHOUT WOMEN WORDS WORKERS WORKING WORLD WORST WOULD WRONG YEARS YESTERDAY YOUNG
  • 43. Speech as Inputs for Information Retrieval
  • 45. Note: During speech reconstruction, the sex of the speaker is preserved Demonstration: English Video to Speech Reconstruction
  • 46. Demo: Chinese Video to Speech Reconstruction
  • 51. GAN output for an English phrase
  • 52. Viseme concatenation TC GAN Generated Output with Inter-Visemes Output for an English phrase, Good Bye
  • 53. GAN output for a Hindi phrase
  • 54. Viseme concatenation TC GAN Generated Output with Inter- Visemes Output for an Hindi phrase, Aap Kaise hai (How are you)
  • 55. LIFI: Towards Linguistically Informed Frame Interpolation Aradhya Neeraj Mathur¹, Devansh Batra², Yaman Kumar¹, Rajiv Ratn Shah¹, Roger Zimmermann³ Indraprastha Institute of Information Technology Delhi, India¹ Netaji Subhas University of Technology, Delhi² National University of Singapore (NUS)³
  • 56. Motivation 56 • Speech videos are extremely common across the internet (lectures, YouTube videos and even video calling apps), but no video interpolation methods pay heed to nuances of speech videos. • Visual Modality of speech is complicated. While uttering a single sentence, our lips cycle through dozens of visemes. • First 30 frames of a speaker speaking the sentence "I don't exactly walk around with a hundred and thirty five million dollars in my wallet". Notice the rich lip movement with opening and closing of the mouth.
  • 57. Motivation 57 We try to reconstruct this speech video by interpolating the intermediate frames from the first and last frames using state of the art models. Expected Original frames (with rich mouth movements) Observed Interpolated frames (with virtually no mouth movements) Some Surprising metrics L1 = 0.0498, MSE = 0.0088, SSIM = 0.9521, PSNR = 20.5415 Which are surprisingly good!!? This means that we need better evaluation criteria for interpolation or reconstruction of speech videos.
  • 58. Proposed Work 58 1. Challenge Datasets for Speech Video Reconstruction (based on LRS3-TED) Guess the words spoken? …………… "Well the short answer to the question is no, it's not the same thing" Random Frame Corruption (40%) Extreme Sparsity Corruption (75%) Prefix Corruption Suffix Corruption
  • 59. Proposed Work 59 1. Challenge Datasets for Speech Video Reconstruction (based on LRS3-TED) Visemic Corruption (visemes of a particular type being corrupted and requiring regeneration) Intra Word Corruption (Corruption of frames within the occurrence of a large word) Inter Word Corruption (Corruption of frames across word boundaries)
  • 60. Proposed Work 60 2. Visemic reconstruction with ROI Loss unit A modified FCN3D with ROI extraction unit to calculate ROI loss. Instead of training the reconstruction network with only the L1 loss between reconstructed and original images, we introduce an ROI Loss which measures the similarity between visemic regions of interests between observed and generated facial images. To accomplish this, we develop an ROI unit as shown on the left.
  • 61. Proposed Work 61 Key Findings We evaluate a Fully Convolutinal Network (FCN3D), a convolutional bi- directional LSTM and the original FCN3D network after addition of the ROI unit and Visemic Loss during training. We observe: 1. In different types of corruptions different networks perform differently. 2. While SuperSloMo performs very well in random frame corruption, we see that it performs much poorly on other types of corruptions. 3. As expected, a sequential LSTM based generator works much better than a fully connected convolutional network when there are corruptions in consecutive frames as shown in prefix and suffix corruption 4. Most Importantly, addition of an ROI loss also helps a network perform better on all forms of corruption and non-ROI based metrics, as shown by the results for (FCN3D+ROI) Performance of different models over datasets containing random corruptions, prefix corruptions and suffix corruption Performance of different models over datasets containing corruptions on different visemes
  • 62. Touchless Typing Using Head Movement-based Gestures Shivam Rustagi¹, Aakash Garg¹, Pranay Raj Anand², Rajesh Kumar³, Yaman Kumar², Rajiv Ratn Shah² Delhi Technological University, India¹ Indraprastha Institute of Information Technology Delhi, India² Haverford College, USA³
  • 63. Motivation 63 Traditional Input DevicesDiseases which render these devices useless ● Upper limb paralysis ● Deformed limb ● Damaged fingers/hand ● Various other disabilities
  • 65. Related Work [1] A. Nowosielski, “Two- letters-key keyboard for predictive touchless typing with head movements [2] J. Tu, H. Tao, and T. Huang, “Face as mouse through visual face tracking,” [3] M. Nabati and A. Behrad, “3d head pose estimation and camera mouse implementation using a monocular video camera
  • 66. Related Work 66 MID AIR TOUCHLESS TYPING TECHNIQUES [4] A. Markussen, M. R. Jakobsen, and K. Hornbundefinedk, “Vulture: A mid-air word-gesture keyboard Using fingers [5] C. Yu, Y. Gu, Z. Yang, X. Yi, H. Luo, and Y. Shi, “Tap,dwell or gesture? exploring head-based text entry techniques for hmds,” Using head
  • 67. 67 Proposed Work for the 10,000 most common English words there are 8529 unique cluster sequences with each sequence having on an average 1.17 different words. So once we predict the cluster sequence, it can be translated to 1-2 valid words on an average..
  • 68. Data Collection: Setup 68 Equipment Configuration Purpose Monitor and Keyboard 17 inches monitor and standard keyboard ● The color coded QWERTY keyboard was displayed on monitor. ● Keyboard used to start and stop recording. Camera (on tripods) ● 3 Samsung M10 mobile cameras which recorded videos at 30fps, 1920 x 1080 resolution ● All 3 mobiles had OpenCamera app installed ● 1 Samsung M10 mobile with MUSE2 app The 3 cameras were kept at angles - 45, 0 and 45 degrees respectively to record the head movements. MUSE2 headband Sensors such as accelerometer and gyroscope The sensors recorded the acceleration and rotation of head. Moderator’s laptop Standard The python script on laptop was responsible to start and stop cameras simultaneously. *Note: For our research we have used only the central view (the Camera-2) recordings.
  • 69. ❑ Total number of users volunteered = 25 ( 16 male; 9 female; 3 user data discarded on manual inspection) ❑ Each user recorded 3 samples of video each for 35 (words: 20, phrases: 10, sentences: 5 as per Table 1) ❑ Total number of video samples = 2310 (22 x 35 x 3) 69 Category Text Words locate, single, family, would, place, large, work, take, live,box, method, listen, house, learn, come, some, ice, old, fly, leg Phrases hello, excuse me, i am sorry, thank you, good bye, see you,nice to meet you, you are welcome, how are you, have a good time Sentences i never gave up, best time to live, catch the trade winds, hear a voice within you, he will forget it Table 1. List of 20 words, 10 phrases and 5 sentences that was typed by each user. Each of these was iterated for 3 times. Data Collection: Description
  • 71. Data Collection: Statistics 71 Category Avg. Number of letters per Entry Words 4.33 Phrases 10.6 Sentences 18.6 ❏ The words were selected to have proper cluster-coverage. ❏ The phrases and sentences were selected from OuluVS[6] and TIMIT[7] dataset respectively. Fig. Coverage of each cluster across dataset Fig. Avg Gesture per minute for each user ( avg = 49.26, std = 5.3 )
  • 72. 72 The proposed method is based on a CNN-RNN architecture, the feature extractor part, as shown above, is based on HopeNet architecture that predicts the yaw, pitch and roll features for the input image. The network is trained using a multi-task classification scheme. We utilize the available pretrained model on large-pose face images from 300W dataset. Hopenet Architecture
  • 73. 73 HopeNet output visualized on a user. The three vectors are constructed from the euler angles (features) predicted by the network. Working of Hopenet
  • 74. 74 The features from the HopeNet are passed into a multi-layered BiGRU network, which is then trained using a CTC loss function. During the inference phase we used beam search to decode the cluster sequence. CNN-RNN architecture
  • 76. 76 The method is evaluated on two scenarios: ● Inter-User: Training on user set S1, Testing on user set S2 such that S1 and S2 are mutually exclusive. Cluster sequences are kept the same for training and testing. ● Intra-user: For every user, i.e set S = {S1 U S2}, we record 3 samples per sequence. For training, 2 samples were taken and the testing is done on the 3rd sample Results
  • 77. 77 Our work presents a meaningful way of mapping gestures to character (cluster) sequence which could be beneficial for people with disabilities. Also, our dataset is publically available which could help improve the current system. In the future, the aim is to improve the performance issue by: Using more training data containing a variety of meaningful sequences, and 1. Combining video feeds from multiple cameras, brainwaves recorded via EEG sensors, acceleration, and rotation of the user’s head recorded via accelerometer and gyroscope. Other future applications could also work in the direction of integrating the interface with wearable devices and mobile computing. This will bring together a newer set of applications like browsing from wearable glasses. Conclusion and Future Work
  • 78. Information Retrieval through Soft Biometrics https://arxiv.org/pdf/2001.09134.pdf
  • 79. SeekSuspect: Retrieving Suspects from Criminal Datasets using Visual Memory Aayush Jain*, Meet Shah*, Suraj Pandey*, Mansi Agarwal*, Rajiv Ratn Shah, Yifang Yin ● Police maintain a crime dossier system that entails information like photographs and physical details. ● Finding suspects by name is possible, but fails when we only have informant's visual memory. ● Law enforcement agencies used to hire sketch artists, but they are limited in number. ● We propose SeekSuspect, a fast interactive suspect retrieval system. ● SeekSuspect employs sophisticated deep learning and computer vision techniques ○ to modify the search space and ○ find the envisioned image effectively and efficiently I do not exactly remember who she was Is this the person you wish to search for?Female, fair, black hair... Relevant images SeekSuspect Similar images
  • 82. Team • Director: Dr. Rajiv Ratn Shah • PhD Students: Hitkul, Shivangi, Ritwik, Mohit, Yaman, Hemant, Kriti, Astha • MTech Students: Abhishek, Suraj, Meet, Aayush, William, Subhani, etc. • Research Assistants: Manraj, Pakhi, Karmanya, Mehar, Saket, Anuj, etc. • BTech Students (both full-time and remote students): • DTU: Maitree Leekha, Mansi Agarwal, Shivang Chopra, Rohan Mishra, Himanshu, etc. • NSUT: Ramit Sahwney, Puneet Mathur, Avinash Swaminathan, Rohit Jain, Hritwik, etc. • IIT: Pradyumn Gupta, Abhigyan Khaund, Palak Goenka, Amit Jindal, Prateek Manocha, etc. • IIIT: Vedant Bhatia, Raj K Gupta, Shagun Uppal, Osheen Sachdev, Siddharth Dhawan, etc. • Alumnus (Placements, Internship, MS Admissions): • Companies: Google, Microsoft, Amazon, Adobe, Tower Research, Walmart, Qualcomm, Goldman Sachs. Bloomberg, IBM Research, Wadhwani AI, Samsung Research, etc. • Academia: CMU, Columbia University, University of Pennsylvania, University of Maryland, University of Southern California, Erasmus Mundus, University of Virginia, Georgia Tech, etc.
  • 83. Collaborators • Prof Roger Zimmermann, National University of Singapore, Singapore • Prof Changyou Chen, State University of New York at Buffalo, USA • Prof Mohan Kankanhalli, National University of Singapore, Singapore • Prof Ponnurangam Kumaraguru (PK), IIIT Delhi, India • Dr. Amanda Stent, Bloomberg, New York, USA • Dr. Debanjan Mahata, Bloomberg, New York, USA • Prof. Rada Mihalcea, University of Michigan, USA • Prof. Shin'ichi Satoh, National Institute of Informatics, Japan • Prof. Jessy Li, University of Texas at Austin, USA • Prof. Huan Liu, Arizona State University, USA • Prof. Naimul Khan, Ryerson University, Canada • Prof. Diyi Yang, Georgia Institute of Technology, USA • Prof Payman Vafaee, Columbia University, USA • Prof Cornelia Caragea, University of Illinois at Chicago, USA • Dr. Mika Hama, SLTI, USA, and many more...
  • 84. Research (AI for Social Good) • NLP and Multimedia based systems for society (education, healthcare, etc.) • Automatic speech recognition (ASR) for different domains and accents (e.g., Indian, African) • Visual speech recognition/reconstruction (VSR) such as lipreading and speech reconstruction • Hate speech and malicious user detection in code-switched scenarios on social media • Mental health problems such as suicidal ideation and depression detection on social media • Building multimodal information retrieval and information extraction systems • Knowledge graph construction for different domains, e.g., medical, e-commerce, defence. etc. • Automated systems for number plate and damage detection, car insurance claim, e-challan, etc. • Multimodal sentiment analysis and its applications in education, policy making, etc. • Detecting, analyzing, and recommending advertisements in videos streams • Fake news detection and propagation, suspect detection, personality detection, etc. • Publications (but are not limited to) • AAAI, CIKM, ACL, EMNLP, WSDM, COLING, ACM Multimedia, ICDM, INTERSPEECH, WWW, ICASSP, WACV, BigMM, IEEE ISM, NAACL, ACM Hypertext, ACM SIGSPATIAL, Elsevier KBS, IEEE Intelligent Systems, IEEE MIPR, ACM MM Asia, AACL, Springer book chapters, etc.
  • 85. Research (AI for Social Good) • Awards (but are not limited to) • Won the outstanding paper award as COLING 2020 • Got selected to Heidelberg Laureate Forum (HLF) in 2018, 2019, 2020 • Best student poster in AAAI 2019, Honolulu, Hawai, USA • Best poster and best industrial paper in IEEE BigMM 2019, Singapore • Winner of the ACM INDIA Chapters Technology Solution Contest 2019 in Jaipur, India • Won the honorable mention award in ICDM Knowledge Graph Contest 2019 in Beijing, China • Won the best poster runner-up award at IEEE ISM 2018 conference in Taichung, Taiwan • Skills, Tools, and Frameworks (but are not limited to) • Natural Language Processing, Image Processing, Speech Processing • Multimodal Computing • Python, Java Script, Java • AI/ Machine Learning/ Deep Learning • Tensorflow, PyTorch, Keras, etc.
  • 87. References 1. Conversational Systems and the Marriage of Speech & Language by Mari Ostendorf (University of Washington) 2. Speech 101 by Robert Moore The University of Sheffield 3. https://www.youtube.com/watch?v=PWGeUztTkRA&ab_channel=Mark_Mitton 4. The Two Ronnies Show 5. Preliminaries to a Theory of Speech Disfluencies (Elizabeth Shriberg, 1994) 6. A Short Analysis of Discourse Coherence (Wang and Guo, 2014) 7. A. Nowosielski, “Two-letters-key keyboard for predictive touchless typing with head movements,” 07 2017, pp. 68–79 8. J. Tu, H. Tao, and T. Huang, “Face as mouse through visual face tracking,” Comput. Vis. Image Underst., vol. 108, no. 1–2, p. 35–40, Oct. 2007. [Online]. Available: https://doi.org/10.1016/j.cviu.2006.11.007 9. 3d head pose estimation and camera mouse implementation using a monocular video camera,” Signal, Image and Video Processing, vol. 9, 01 2012. 10. A. Markussen, M. R. Jakobsen, and K. Hornbundefinedk, “Vulture: A mid-air word-gesture keyboard,” in CHI ’14, 2014. 11. C. Yu, Y. Gu, Z. Yang, X. Yi, H. Luo, and Y. Shi, “Tap, dwell or gesture? exploring head-based text entry techniques for hmds,” in CHI ’17, 2017. 12. Zhao G, Barnard M & Pietikäinen M (2009) Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia 11(7):1254-1265. 13. Garofolo, J. & Lamel, Lori & Fisher, W. & Fiscus, Jonathan & Pallett, D. & Dahlgren, N. & Zue, V.. (1992). TIMIT Acoustic-phonetic Continuous Speech Corpus. Linguistic Data Consortium.
  • 88. 1. Gandharv Mohan, MIDAS Lab IIITD, Btech 2021 2. Akash Sharma, MIDAS Lab IIITD, Btech 2021 3. Rajaswa Patil, MIDAS Lab IIITD, Btech 2021 4. Avyakt Gupta, MIDAS Lab IIITD, Btech 2021 5. Gaurav Aggarwal, MIDAS Lab IIITD, Btech 2021 6. Devansh Batra, MIDAS Lab IIITD, Btech 2021 7. Aradhya Neeraj Mathur, MIDAS Lab IIITD, PhD Student 8. Maitree Leekha, MIDAS Lab IIITD, Btech 2020 9. Jainendra Shukla, HMI Lab IIITD, Assistant Professor 10. Vidit Jain, MIDAS and HMI Lab, Btech 2021 11. Rajesh Kumar, Haverford College USA, Assistant Professor 12. Shivam, Akash, Mohit, Vishaal, Mansi, Aayush, Meet, Suraj, and many other MIDAS members Acknowledgements