Marriage of speech, vision and natural language processing

Marriage of Computer Vision, Speech and
Natural Language
- Yaman Kumar (MIDAS Lab-IIITD, SUNY at Buffalo)
- Rajiv Ratn Shah (MIDAS Lab-IIITD)

What is Speech?
Text Part of Speech Vision Part of Speech Aural Part of Speech

Marriage of
Speech &
Language
Conversational speech
Information in the acoustic
signal beyond the words
Interactive nature of
conversations

Speech is ... more than spoken words
• Rich in ‘extra-linguistic’ information
• breathing noises
• lip-smacks
• Hand movements
• Facial Expressions
• Rich in ‘para-linguistic’ information
• Personality
• Attitude
• Emotion
• Individuality

Some Examples
• Disfluency
• I am uh uh very …. I am very excited to see you
• He is my em …… Yaman is my best friend
• Intonation and Stress
1. *This* is my laptop (and not that)
• This is *my* laptop (and not yours)
• This is my *laptop* (and not book)
2. He found it on the street?
• And in reply, He found it on the street
• No punctuation and very open grammar
• ASR errors

• to the listener
• a child (‘parentese’)
• a non-native person
• a hearing-impaired
individual
• an animal
• a machine(!)
• to the cognitive load
• Interaction with other
tasks
• stressful/emotional
situations
• to the
environment
• noise
• reverberation
Speech is
Adaptive
• to the task
• Casual
conversation
• Reading out
loud
• Public
speaking

Content
• Content in spoken medium is the "information or experiences directed towards end-users or an
audience".
Why is Content Important?
Whom do you prefer?
• A speaker with style, elegance, panache but with a weak content (talking
too much off-topic, not providing enough details about facts).
OR
• An average speaker but with a good content (ideas stick to the main
topic, provides interesting/required background information).

Content
What defines a Good Content? ( High Relevance and High Sufficiency )
Relevance
• Related to the topic
• Connected to the prompt in a
bigger story.
• No Unwanted information or off
topic.
Sufficiency
• Adequate details (which are also
relevant)
• All points covered
• No Missing parts

Response: IVE ACCOMPLISHED UM MANY THINGS IN LIFE ONE OF
THEM IS IS BEING A PHILANTHROPIST IVE HELPED A LOT OF PEOPLE
MOST SPECIALLY CHILDREN I GO TO SOME UM POOR AREAS AND
WE TEACH LIKE THOSE CHILDREN SOME KNOWLEDGE THAT THEY
DONT KNOW YET LIKE FOR EXAMPLE IM GOING TO BE THEIR
TEACHER AND I I INFORM THEM ALL THE THINGS LIKE UM WHAT TO
WRITE HOW TO READ HOW TO DESCRIBE SOMETHING AND THIS IS
REALLY IMPORTANT IN MY LIFE BECAUSE BEING A TEACHER IS
REALLY GOOD FOR ME AND I THINK IT WILL REALLY HELP ME GROW
MY ABILITY TO HELP PEOPLE MOST SPECIALLY CHILDREN
Response: IT IS IMPORTANT TO CHOOSE WISELY FOR YOUR CAREER
AND ITS ALSO IMPORTANT THAT YOU CHOOSE THAT CAREER
BECAUSE UH THIS IS YOUR PASSION AND THIS IS YOUR REALLY ONE
JOB AND BECAUSE IF YOU DONT WANT THAT JOB OR CAR CAREER
BUT YOU CHOOSE IT UH YOU WILL AT THE END OF THE DAY YOU
WILL NOT BE UH MOTIVATED TO WORK WITH IT AND YOU WILL NOT
BE YOU ARE UH THERES A TENDENCY THAT YOU WILL NOT ACHIEVE
YOUR GOAL OR DESIRE IN YOUR IN THAT CAREER AND YOURE NOT
BE WILL BE SUCCESSFUL IN THAT CAREER IT IS IMPORTANT TO
CHOOSE WISELY YOUR CAREER AND UH CONSIDER THAT THIS IS
YOUR UH THIS IS WHAT YOU REALLY WANT AND THIS IS YOUR
PASSIONS AND ARE IT IS UH IF YOU CHOOSE YOUR CAREER BE SURE
YOU ARE ENJOYING IT NOT DOING IT
Relevance: High
Speaker sticks to the things asked in prompt.
(Being philanthropist or teacher as
accomplishments, important of the same.)
Sufficiency: High
Explains in detail about how he helped children
as a teacher, how did he help and importance
of the same
Relevance: Low
Speaker goes too off topic from what is being
asked. (About car, being successful, what good
career is, instead of talking about
accomplishments.)
Sufficiency: Low
Provides no information that addresses the
points in the prompt.
Prompt: You have to narrate to a career advisor 1 thing you accomplished which you are proud of and how it was
important for you.

D…Di….Disfluencies
• Interruptions in the smooth flow of speech
• These interruptions often occur in spoken communication. They usually help the speakers to buy more time
while they express their thought process.
• Reparandum (RM) - Refers to the unintended and
unnecessary part of the disfluency span
(This span can be deleted in order to obtain fluency)
• Interregnum (IM) - Refers to the part that lies
between RM and RR.
(This span helps the speaker to fill the intermediate gap)
• Repair (RR) - Refers to the corrected span of the RM.
(This span should maintain the context of RM)

D…Di….Disfluencies
• Examples
• Filled pauses : "This is a uhmm … good example"
• Discourse Markers : " It's really nice to .. you know .. play outside sometimes."
• Self-Correction : " So we will... we can go there."
• Repetitions : "The... the... the decision was not mine to make"
• Restart : "We would like to eat ... let’s go to the park"
• Why can't we recognize these disfluencies solely by looking at the words ? 🤔
• Consideration of the audio helps in understanding the intention of speaker and hence deciding if
there is a disfluency or not.
• Can get confused with some fluently done repetitions - "Superman is the most most most
powerful superhero ! "
• Can also get confused from various other interruptions like non-verbal sounds and even silence !

Pronunciation
/prəˌnʌnsɪˈeɪʃ(ə)n/
Mispronunciation Detection: Problem where the perceived pronunciation
doesn't match with intended pronunciation, but we can understand the
meaning. Example. Pronunciation of word park.
• Phoneme Recognition Problem: State of the art phoneme (sounds in a
language) recognition systems has a phoneme error rate of 18% for
native speech data.
• Non-native accent: Phonemes might be recognized correctly but acoustic
models (models used to detect phonemes) are often confused by non-
native speech. Some phonemes (sounds) exist in the native
language which do not have an alternative in the non-native language.
E.g. Je sound in French has no English mapping which confuses the
acoustic model to predict wrong sequences of phonemes.

Pronunciation
Intelligibility: There is a lot of
difference between the intended
speech and spoken speech.
Example: Pronunciation of word
mEssage is incorrect. A good ASR
system will perceive it as mAssage and
rate it correctly pronounced. However,
the user meant to say mEssage.

Discourse Coherence
• Discourse is a coherent combination of spoken (or written) utterances
communicated between a speaker (or writer) and a listener (or
reader).
• Discourse is a PRODUCT? ✍️ (linguistic perspective)
• Discourse is a PROCESS!! 🤔🤔 (cognitive perspective)
• Discourse coherence is the semantic relationship between
propositions or communicative events in discourse.
• It is a feature of the perception 👀👂 of discourse rather than the
content of discourse itself.

Discourse Coherence
Discourse as Product ✍
• A well written speech.
• How the discourse content is
structured and organized by the
speaker.
• Cohesion in text, use of discourse
markers, connectives, etc.
• How readable is the text, how
complex is the text, etc.
Discourse as Process 🤔
• A well delivered speech.
• How the discourse content is
delivered efficiently to the
listener.
• Prosodic variation, use of stress,
intonation, pauses, etc.
• How intelligible is the
speech, how focused is the
listener, etc.

Prosody
• Prosodic features span...
• several speech segments
• several syllables
• whole utterances
• Such ‘suprasegmental’ behaviour includes ...
• lexical stress (Prominence of Syllables)
• lexical tone (Pitch pattern to distinguish words)
• rhythmic stress (Emphasis)
• intonation (Difference of Expressive meaning)

It’s not what you say, but how you say it.

The Two
Ronnies
- Four Candles vs Fork
Handles
Speech is Ambiguous

Silent Speech is Even More Ambiguous
• Elephant Juice vs I Love You
• Million vs Billion
• Pet vs Bell vs Men
Speak Them To Yourself!
You lip movements are exactly same!

Exploring Semi-Supervised Learning
for Predicting Listener Backchannels
Accepted at CHI’21!
Vidit Jain, Maitree Leekha,
Jainendra Shukla, Rajiv Ratn Shah

Introduction
● Developing human-like conversational agents is important!
○ Applications in education and healthcare
● Challenge: how to make them seem natural?
○ Human conversations are complex!
● Listener backchannels: a crucial element of human conversation:
○ Listener’s “regular” feedback to the speaker, indicating presence
○ Verbal: e.g., short utterances
○ Non-verbal: e.g., head shake, nod, smile etc.
● We focus on modelling these backchannels as a step towards natural
Human Robot Interactions (HRIs).

Research Questions
Key Research Gaps:
● Prior works [1, 2 and more] relied on large amounts of manually
annotated data to train listener backchannel prediction (LBP) models
○ This is expensive in terms of man hours
● In addition, all previous works have focused on only English
conversations
Major Contributions:
● Validating the use of semi-supervised techniques for LBP
○ Models using only 25% of manual annotation performed at par!
● Unlike past works, we use Hindi conversations
[1] Park, Hae Won, et al. "Telling stories to robots: The effect of backchanneling on a child's storytelling." 2017 12th ACM/IEEE
International Conference on Human-Robot Interaction (HRI. IEEE, 2017.
[2] Goswami, Mononito, Minkush Manuja, and Maitree Leekha. "Towards Social & Engaging Peer Learning: Predicting Backchanneling
and Disengagement in Children." arXiv preprint arXiv:2007.11346 (2020).

Dataset
● We use the multimodal Hindi based Vyaktitv dataset [3]
○ 25 conversations, each ~16 min long
○ Video and audio feeds available for each participant (50 recordings)
● Annotations Done:
○ 3 annotators
○ Signal (kappa): Nod (0.7), Head-shake (0.6), Mouth (0.6), Eyebrow (0.5),
Utterances (0.5)
● Features Extracted:
○ OpenFace - visual features: 18 facial action units (FAU), gaze velocities & accelerations,
translational and rotational head velocities & accelerations, blink rate, pupil location, and smile ratio
○ pyAudioAnalysis - audio features: voice activity, MFCC, F0, energy
[3] Khan, Shahid Nawaz, et al. "Vyaktitv: A Multimodal Peer-to-Peer Hindi Conversations based Dataset for Personality Assessment."
2020 IEEE Sixth International Conference on Multimedia Big Data (BigMM). IEEE, 2020.

System Architecture
Methodology: (i) Semi-supervised learning for identifying backchannels and type of
signals emitted using a subset of labeled data. (ii) Learning to predict these instances
and signals using the speaker's context.

Task Formulations
Identification
Given a listener’s audio and video feeds, identify
when he backchannels?
These are the true labels in the prediction task
We use semi-supervision here to generate these
pseudo-labels (instance & type)
Prediction
Given a speaker’s context (~3-7 sec long), predict
whether the listener will backchannel
immediately after it.
Use only speaker’s features to predict the
instance & type of backchannel (verbal/visual)

Key Findings
● The semi-supervised process was able to identify backchannel instances
and signal types very well
○ Respective accuracies- 0.90 (ResNet) & 0.85 (RF)- only 25% manual annotation
as seed!
● Comparing prediction models trained using manually annotated vs semi
supervised pseudo labels:
○ Using semi-supervision, we reach ~94% of the baseline performance!
● Qualitative Study: Majority participants could not distinguish between
the two prediction models!

Demo
Our final system trained using semi-supervision

Lip Movement as Inputs for Information Retrieval
https://www.aaai.org/ojs/index.php/AAAI/article/view/5649
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/3269.pdf
https://www.isca-speech.org/archive/Interspeech_2019/abstracts/3273.html
https://www.youtube.com/watch?v=3BqQQnTfnlE&list=PL9rvax0EIUA6PDoiDT2Wp462GsT
nikrvY

Let’s put your lip reading abilities
to test
(SHOW OF HANDS)

CONFIDENCE CONFERENCE
CONCERNS CONFLICT

MOBIVSR
Predictions:
•CONFERENCE 65%
•CONFLICT 20%
•OFFICERS 10%
•OFFICE 5%

SPECIAL
SPONGE
DESPERATION
SPEECH

SPECIAL SPONGE
DESPERATION SPEECH

MOBIVSR
Predictions:
•SPEECH 85%
•BRITISH 10%
•PRESSURE 2%
•INFLATION 1%

Let’s jump to MobiVSR difficulty level.
Your options:
ABOUT
ABSOLUTELY
ABUSE
ACCESS
ACCORDING
ACCUSED
ACROSS
ACTION
ACTUALLY
AFFAIRS
AFFECTED
AFRICA
AFTER
AFTERNOON
AGAIN
AGAINST
AGREE
AGREEMENT
AHEAD
ALLEGATION
S
ALLOW
ALLOWED
ALMOST
ALREADY
ALWAYS
AMERICA
AMERICAN
AMONG
AMOUNT
ANNOUNCED
ANOTHER
ANSWER
ANYTHING
AREAS
AROUND
ARRESTED
ASKED
ASKING
ATTACK
ATTACKS
AUTHORITIE
S
BANKS
BECAUSE
BECOME
BEFORE
BEHIND
BEING
BELIEVE
BENEFIT
BENEFITS
BETTER
BETWEEN
BIGGEST
BILLION
BLACK
BORDER
BRING
BRITAIN
BRITISH
BROUGHT
BUDGET
BUILD
BUILDING
BUSINESS
BUSINESSES
CALLED
CAMERON
CAMPAIGN
CANCER
CANNOT
CAPITAL
CASES
CENTRAL
CERTAINLY
CHALLENGE
CHANCE
CHANGE
CHANGES
CHARGE
CHARGES
CHIEF
CHILD
CHILDREN
CHINA
CLAIMS
CLEAR
CLOSE
CLOUD
COMES
COMING
COMMUNITY
COMPANIES
COMPANY
CONCERNS
CONFERENCE
CONFLICT
CONSERVATIV
E
CONTINUE
CONTROL
COULD
COUNCIL
COUNTRIES
COUNTRY
COUPLE
COURSE
COURT
CRIME
CRISIS
CURRENT
CUSTOMERS
DAVID
DEATH
DEBATE
DECIDED
DECISION
DEFICIT
DEGREES
DESCRIBED
DESPITE
DETAILS
DIFFERENCE
DIFFERENT
DIFFICULT
DOING
DURING
EARLY
EARLY
EASTERN
ECONOMIC
ECONOMY
EDITOR
EDUCATION
ELECTION
EMERGENCY
ENERGY
ENGLAND
ENOUGH
EUROPE
EUROPEAN
EVENING
EVENTS
EVERY
EVERYBODY
EVERYONE
EVERYTHING
EVIDENCE
EXACTLY
EXAMPLE
EXPECT
EXPECTED
EXTRA
FACING
FAMILIES
FAMILY
FIGHT
FIGHTING
FIGURES
FINAL
FINANCIAL
FIRST
FOCUS
FOLLOWING
FOOTBALL
FORCE
FORCES
FOREIGN
FORMER
FORWARD
FOUND
FRANCE
FRENCH
FRIDAY
FRONT
FURTHER
FUTURE
GAMES
GENERAL
GEORGE
GERMANY
GETTING
GIVEN
GIVING
GLOBAL
GOING
GOVERNMENT
GREAT
GREECE
GROUND
GROUP
GROWING
GROWTH
GUILTY
HAPPEN
HAPPENED
HAPPENING
HAVING
HEALTH
HEARD
HEART
HEAVY
HIGHER
HISTORY
HOMES
HOSPITAL
HOURS
HOUSE
HOUSING
HUMAN
HUNDREDS
IMMIGRATION
IMPACT
IMPORTANT
INCREASE
INDEPENDENT
INDUSTRY
INFLATION
INFORMATION
INQUIRY
INSIDE
INTEREST
INVESTMENT
INVOLVED
IRELAND
ISLAMIC
ISSUE
ISSUES
ITSELF
JAMES
JUDGE
JUSTICE
KILLED
KNOWN
LABOUR
LARGE
LATER
LATEST
LEADER
LEADERS
LEADERSHIP
LEAST
LEAVE
LEGAL
LEVEL
LEVELS
LIKELY
LITTLE
LIVES
LIVING
LOCAL
LONDON
LONGER
LOOKING

MAJOR
MAJORITY
MAKES
MAKING
MANCHESTER
MARKET
MASSIVE
MATTER
MAYBE
MEANS
MEASURES
MEDIA
MEDICAL
MEETING
MEMBER
MEMBERS
MESSAGE
MIDDLE
MIGHT
MIGRANTS
MILITARY
MILLION
MILLIONS
MINISTER
MINISTERS
MINUTES
MISSING
MOMENT
MONEY
MONTH
MONTHS
MORNING
MOVING
MURDER
NATIONAL
NEEDS
NEVER
NIGHT
NORTH
NORTHERN
NOTHING
NUMBER
NUMBERS
OBAMA
OFFICE
OFFICERS
OFFICIALS
OFTEN
OPERATION
OPPOSITION
ORDER
OTHER
OTHERS
OUTSIDE
PARENTS
PARLIAMENT
PARTIES
PARTS
PARTY
PATIENTS
PAYING
PEOPLE
PERHAPS
PERIOD
PERSON
PERSONAL
PHONE
PLACE
PLACES
PLANS
POINT
POLICE
POLICY
POLITICAL
POLITICIANS
POLITICS
POSITION
POSSIBLE
POTENTIAL
POWER
POWERS
PRESIDENT
PRESS
PRESSURE
PRETTY
PRICE
PRICES
PRIME
PRISON
PRIVATE
PROBABLY
PROBLEM
PROBLEMS
PROCESS
PROTECT
PROVIDE
PUBLIC
QUESTION
QUESTIONS
QUITE
RATES
RATHER
REALLY
REASON
RECENT
RECORD
REFERENDUM
REMEMBER
REPORT
REPORTS
RESPONSE
RESULT
RETURN
RIGHT
RIGHTS
RULES
RUNNING
RUSSIA
RUSSIAN
SAYING
SCHOOL
SCHOOLS
SCOTLAND
SCOTTISH
SECOND
SECRETARY
SECTOR
SECURITY
SEEMS
SENIOR
SENSE
SERIES
SERIOUS
SERVICE
SERVICES
SEVEN
SEVERAL
SHORT
SHOULD
SIDES
SIGNIFICANT
SIMPLY
SINCE
SINGLE
SITUATION
SMALL
SOCIAL
SOCIETY
SOMEONE
SOMETHING
SOUTH
SOUTHERN
SPEAKING
SPECIAL
SPEECH
SPEND
SPENDING
SPENT
STAFF
STAGE
STAND
START
STARTED
STATE
STATEMENT
STATES
STILL
STORY
STREET
STRONG
SUNDAY
SUNSHINE
SUPPORT
SYRIA
SYRIAN
SYSTEM
TAKEN
TAKING
TALKING
TALKS
TEMPERATURE
S
TERMS
THEIR
THEMSELVES
THERE
THESE
THING
THINGS
THINK
THIRD
THOSE
THOUGHT
THOUSANDS
THREAT
THREE
THROUGH
TIMES
TODAY
TOGETHER
TOMORROW
TONIGHT
TOWARDS
TRADE
TRIAL
TRUST
TRYING
UNDER
UNDERSTAND
UNION
UNITED
UNTIL
USING
VICTIMS
VIOLENCE
VOTERS
WAITING
WALES
WANTED
WANTS
WARNING
WATCHING
WATER
WEAPONS
WEATHER
WEEKEND
WEEKS
WELCOME
WELFARE
WESTERN
WESTMINSTE
R
WHERE
WHETHER
WHICH
WHILE
WHOLE
WINDS
WITHIN
WITHOUT
WOMEN
WORDS
WORKERS
WORKING
WORLD
WORST
WOULD
WRONG
YEARS
YESTERDAY
YOUNG

MOBIVSR
Predictions:
•DIFFICULT 40%
•GIVING 20%
•GIVEN 10%
•EVERYTHING 5%

Speech as Inputs for Information Retrieval

https://www.aaai.org
/ojs/index.php/AAAI/
article/view/4106
https://www.isca-
speech.org/archive/In
terspeech_2019/pdfs/
3269.pdf
https://www.isca-
speech.org/archive/Interspeec
h_2019/abstracts/3273.html
https://www.youtube.com/wa
tch?v=3BqQQnTfnlE&list=PL9r
vax0EIUA6PDoiDT2Wp462GsT
nikrvY
article/view/5649
article/view/5148
Lip Movement Speech for
Information Retrieval

Note: During speech reconstruction, the sex of the speaker is preserved
Demonstration: English Video to Speech Reconstruction

Demo: Chinese
Video to Speech
Reconstruction

Demonstration:
Hindi Video to
Speech
Reconstruction

Example
person with
dysarthria

https://ww
w.aaai.org/
ojs/index.p
hp/AAAI/ar
ticle/view/
4106
https://ww
w.isca-
speech.org
/archive/In
terspeech_
2019/pdfs/
3269.pdf
https://www.aaai.org/ojs/index
.php/AAAI/article/view/5649
https://www.aaai.org/ojs/index
.php/AAAI/article/view/5148
https://www.isca-
speech.org/archive/Interspee
ch_2019/abstracts/3273.htm
l
https://www.youtube.com/w
atch?v=3BqQQnTfnlE&list=PL
9rvax0EIUA6PDoiDT2Wp462
GsTnikrvY
Lip
Movement
Speech Video
for
Information
Retrieval

GAN output for an English phrase

Viseme concatenation TC GAN Generated Output with Inter-Visemes
Output for an English phrase, Good Bye

Viseme concatenation
TC GAN Generated Output with Inter-
Visemes
Output for an Hindi phrase, Aap Kaise hai
(How are you)

LIFI: Towards Linguistically
Informed Frame Interpolation
Aradhya Neeraj Mathur¹, Devansh Batra², Yaman Kumar¹, Rajiv Ratn Shah¹, Roger Zimmermann³
Indraprastha Institute of Information Technology Delhi, India¹
Netaji Subhas University of Technology, Delhi²
National University of Singapore (NUS)³

Motivation
56
• Speech videos are extremely common across the internet (lectures, YouTube videos and even video calling apps),
but no video interpolation methods pay heed to nuances of speech videos.
• Visual Modality of speech is complicated. While uttering a single sentence, our lips cycle through dozens of visemes.
• First 30 frames of a speaker speaking the sentence "I don't exactly walk around with a
hundred and thirty five million dollars in my wallet". Notice the rich lip movement with
opening and closing of the mouth.

Motivation
57
We try to reconstruct this speech video by interpolating the intermediate
frames from the first and last frames using state of the art models.
Expected
Original frames
(with rich mouth
movements)
Observed
Interpolated frames
(with virtually no
mouth movements)
Some Surprising metrics
L1 = 0.0498,
MSE = 0.0088,
SSIM = 0.9521,
PSNR = 20.5415
Which are surprisingly good!!?
This means that we need better
evaluation criteria for interpolation or
reconstruction of speech videos.

Proposed Work
58
1. Challenge Datasets for Speech Video Reconstruction (based on LRS3-TED)
Guess the words spoken?
……………
"Well the short answer to the
question is no, it's not the
same thing"
Random Frame Corruption (40%) Extreme Sparsity Corruption (75%) Prefix Corruption Suffix Corruption

Proposed Work
59
1. Challenge Datasets for Speech Video
Reconstruction (based on LRS3-TED)
Visemic Corruption
(visemes of a particular type being corrupted
and requiring regeneration)
Intra Word Corruption
(Corruption of frames within the occurrence of
a large word)
Inter Word Corruption
(Corruption of frames across word boundaries)

Proposed Work
60
2. Visemic reconstruction with ROI Loss unit
A modified FCN3D with ROI extraction unit to
calculate ROI loss.
Instead of training the reconstruction network
with only the L1 loss between reconstructed
and original images, we introduce an ROI Loss
which measures the similarity between visemic
regions of interests between observed and
generated facial images.
To accomplish this, we develop an ROI unit as
shown on the left.

Proposed Work
61
Key Findings
We evaluate a Fully Convolutinal Network (FCN3D), a convolutional bi-
directional LSTM and the original FCN3D network after addition of the
ROI unit and Visemic Loss during training.
We observe:
1. In different types of corruptions different networks perform
differently.
2. While SuperSloMo performs very well in random frame corruption,
we see that it performs much poorly on other types of
corruptions.
3. As expected, a sequential LSTM based generator works much
better than a fully connected convolutional network when there
are corruptions in consecutive frames as shown in prefix and suffix
corruption
4. Most Importantly, addition of an ROI loss also helps a network
perform better on all forms of corruption and non-ROI based
metrics, as shown by the results for (FCN3D+ROI)
Performance of different models over datasets
containing random corruptions, prefix
corruptions and suffix corruption
Performance of different models over
datasets containing corruptions on
different visemes

Touchless Typing Using Head
Movement-based Gestures
Shivam Rustagi¹, Aakash Garg¹, Pranay Raj Anand², Rajesh Kumar³, Yaman Kumar², Rajiv Ratn Shah²
Delhi Technological University, India¹
Indraprastha Institute of Information Technology Delhi, India²
Haverford College, USA³

Motivation
63
Traditional Input DevicesDiseases which render these devices useless
● Upper limb paralysis
● Deformed limb
● Damaged fingers/hand
● Various other disabilities

Related Work
[1] A. Nowosielski, “Two-
letters-key keyboard for
predictive touchless typing
with head movements
[2] J. Tu, H. Tao, and T.
Huang, “Face as mouse
through visual face
tracking,”
[3] M. Nabati and A. Behrad, “3d
head pose estimation and
camera mouse implementation
using a monocular video camera

Related Work
66
MID AIR TOUCHLESS TYPING TECHNIQUES
[4] A. Markussen, M. R. Jakobsen, and
K. Hornbundefinedk,
“Vulture: A mid-air word-gesture
keyboard
Using fingers
[5] C. Yu, Y. Gu, Z. Yang, X. Yi, H. Luo, and
Y. Shi, “Tap,dwell or gesture? exploring
head-based text entry techniques for
hmds,”
Using head

67
Proposed Work
for the 10,000 most common English words there are 8529 unique cluster sequences with each sequence
having on an average 1.17 different words. So once we predict the cluster sequence, it can be translated to
1-2 valid words on an average..

Data Collection: Setup
68
Equipment Configuration Purpose
Monitor and
Keyboard
17 inches monitor and standard keyboard ● The color coded QWERTY
keyboard was displayed on
monitor.
● Keyboard used to start and
stop recording.
Camera (on
tripods)
● 3 Samsung M10 mobile cameras which recorded
videos at 30fps, 1920 x 1080 resolution
● All 3 mobiles had OpenCamera app installed
● 1 Samsung M10 mobile with MUSE2 app
The 3 cameras were kept at angles -
45, 0 and 45 degrees respectively to
record the head movements.
MUSE2
headband
Sensors such as accelerometer and gyroscope The sensors recorded the
acceleration and rotation of head.
Moderator’s
laptop
Standard The python script on laptop was
responsible to start and stop
cameras simultaneously.
*Note: For our research we have used only the
central view (the Camera-2) recordings.

❑ Total number of users volunteered = 25 ( 16 male; 9 female; 3 user data discarded on manual inspection)
❑ Each user recorded 3 samples of video each for 35 (words: 20, phrases: 10, sentences: 5 as per Table 1)
❑ Total number of video samples = 2310 (22 x 35 x 3)
69
Category Text
Words locate, single, family, would, place, large, work, take, live,box, method, listen, house, learn, come,
some, ice, old, fly, leg
Phrases hello, excuse me, i am sorry, thank you, good bye, see you,nice to meet you, you are welcome,
how are you, have a good time
Sentences i never gave up, best time to live, catch the trade winds, hear a voice within you, he will forget it
Table 1. List of 20 words, 10 phrases and 5 sentences that was typed by each user. Each of these was iterated for 3 times.
Data Collection: Description

Data Collection: Procedure
70
Camera-1 Camera-2

Data Collection: Statistics
71
Category Avg. Number of letters per Entry
Words 4.33
Phrases 10.6
Sentences 18.6
❏ The words were selected to have proper
cluster-coverage.
❏ The phrases and sentences were selected
from OuluVS[6] and TIMIT[7] dataset
respectively.
Fig. Coverage of each cluster across dataset
Fig. Avg Gesture per minute for each user ( avg = 49.26, std = 5.3 )

72
The proposed method is based on a CNN-RNN architecture, the feature extractor part, as shown above, is
based on HopeNet architecture that predicts the yaw, pitch and roll features for the input image. The
network is trained using a multi-task classification scheme. We utilize the available pretrained model on
large-pose face images from 300W dataset.
Hopenet Architecture

73
HopeNet output visualized on a user. The three vectors are
constructed from the euler angles (features) predicted by the
network.
Working of Hopenet

74
The features from the HopeNet are passed into a multi-layered BiGRU network, which is then trained using a CTC loss
function. During the inference phase we used beam search to decode the cluster sequence.
CNN-RNN architecture

76
The method is evaluated on two scenarios:
● Inter-User: Training on user set S1, Testing on user set S2 such that S1 and S2 are mutually exclusive. Cluster
sequences are kept the same for training and testing.
● Intra-user: For every user, i.e set S = {S1 U S2}, we record 3 samples per sequence. For training, 2 samples were
taken and the testing is done on the 3rd sample
Results

77
Our work presents a meaningful way of mapping gestures to character (cluster) sequence which could be beneficial for
people with disabilities.
Also, our dataset is publically available which could help improve the current system.
In the future, the aim is to improve the performance issue by:
Using more training data containing a variety of meaningful sequences, and
1. Combining video feeds from multiple cameras, brainwaves recorded via EEG sensors, acceleration, and rotation of
the user’s head recorded via accelerometer and gyroscope.
Other future applications could also work in the direction of integrating the interface with wearable
devices and mobile computing. This will bring together a newer set of applications like browsing from wearable
glasses.
Conclusion and Future Work

Information Retrieval through Soft Biometrics
https://arxiv.org/pdf/2001.09134.pdf

SeekSuspect: Retrieving Suspects from Criminal Datasets using Visual Memory
Aayush Jain*, Meet Shah*, Suraj Pandey*, Mansi Agarwal*, Rajiv Ratn Shah,
Yifang Yin
● Police maintain a crime dossier system that entails information like photographs and physical details.
● Finding suspects by name is possible, but fails when we only have informant's visual memory.
● Law enforcement agencies used to hire sketch artists, but they are limited in number.
● We propose SeekSuspect, a fast interactive suspect retrieval system.
● SeekSuspect employs sophisticated deep learning and computer vision techniques
○ to modify the search space and
○ find the envisioned image effectively and efficiently
I do not exactly
remember
who she was
Is this the
person you
wish to
search for?Female, fair, black hair...
Relevant images
SeekSuspect
Similar images

https://midas.iiitd.edu.in/ https://facebook.com/midasiiitd/
https://twitter.com/midasiiitd/ https://linkedin.com/company/midasiiitd/

Team
• Director: Dr. Rajiv Ratn Shah
• PhD Students: Hitkul, Shivangi, Ritwik, Mohit, Yaman, Hemant, Kriti, Astha
• MTech Students: Abhishek, Suraj, Meet, Aayush, William, Subhani, etc.
• Research Assistants: Manraj, Pakhi, Karmanya, Mehar, Saket, Anuj, etc.
• BTech Students (both full-time and remote students):
• DTU: Maitree Leekha, Mansi Agarwal, Shivang Chopra, Rohan Mishra, Himanshu, etc.
• NSUT: Ramit Sahwney, Puneet Mathur, Avinash Swaminathan, Rohit Jain, Hritwik, etc.
• IIT: Pradyumn Gupta, Abhigyan Khaund, Palak Goenka, Amit Jindal, Prateek Manocha, etc.
• IIIT: Vedant Bhatia, Raj K Gupta, Shagun Uppal, Osheen Sachdev, Siddharth Dhawan, etc.
• Alumnus (Placements, Internship, MS Admissions):
• Companies: Google, Microsoft, Amazon, Adobe, Tower Research, Walmart, Qualcomm,
Goldman Sachs. Bloomberg, IBM Research, Wadhwani AI, Samsung Research, etc.
• Academia: CMU, Columbia University, University of Pennsylvania, University of Maryland,
University of Southern California, Erasmus Mundus, University of Virginia, Georgia Tech, etc.

Collaborators
• Prof Roger Zimmermann, National University of Singapore, Singapore
• Prof Changyou Chen, State University of New York at Buffalo, USA
• Prof Mohan Kankanhalli, National University of Singapore, Singapore
• Prof Ponnurangam Kumaraguru (PK), IIIT Delhi, India
• Dr. Amanda Stent, Bloomberg, New York, USA
• Dr. Debanjan Mahata, Bloomberg, New York, USA
• Prof. Rada Mihalcea, University of Michigan, USA
• Prof. Shin'ichi Satoh, National Institute of Informatics, Japan
• Prof. Jessy Li, University of Texas at Austin, USA
• Prof. Huan Liu, Arizona State University, USA
• Prof. Naimul Khan, Ryerson University, Canada
• Prof. Diyi Yang, Georgia Institute of Technology, USA
• Prof Payman Vafaee, Columbia University, USA
• Prof Cornelia Caragea, University of Illinois at Chicago, USA
• Dr. Mika Hama, SLTI, USA, and many more...

Research (AI for Social Good)
• NLP and Multimedia based systems for society (education, healthcare, etc.)
• Automatic speech recognition (ASR) for different domains and accents (e.g., Indian, African)
• Visual speech recognition/reconstruction (VSR) such as lipreading and speech reconstruction
• Hate speech and malicious user detection in code-switched scenarios on social media
• Mental health problems such as suicidal ideation and depression detection on social media
• Building multimodal information retrieval and information extraction systems
• Knowledge graph construction for different domains, e.g., medical, e-commerce, defence. etc.
• Automated systems for number plate and damage detection, car insurance claim, e-challan, etc.
• Multimodal sentiment analysis and its applications in education, policy making, etc.
• Detecting, analyzing, and recommending advertisements in videos streams
• Fake news detection and propagation, suspect detection, personality detection, etc.
• Publications (but are not limited to)
• AAAI, CIKM, ACL, EMNLP, WSDM, COLING, ACM Multimedia, ICDM, INTERSPEECH, WWW, ICASSP, WACV,
BigMM, IEEE ISM, NAACL, ACM Hypertext, ACM SIGSPATIAL, Elsevier KBS, IEEE Intelligent Systems, IEEE MIPR,
ACM MM Asia, AACL, Springer book chapters, etc.

Research (AI for Social Good)
• Awards (but are not limited to)
• Won the outstanding paper award as COLING 2020
• Got selected to Heidelberg Laureate Forum (HLF) in 2018, 2019, 2020
• Best student poster in AAAI 2019, Honolulu, Hawai, USA
• Best poster and best industrial paper in IEEE BigMM 2019, Singapore
• Winner of the ACM INDIA Chapters Technology Solution Contest 2019 in Jaipur, India
• Won the honorable mention award in ICDM Knowledge Graph Contest 2019 in Beijing, China
• Won the best poster runner-up award at IEEE ISM 2018 conference in Taichung, Taiwan
• Skills, Tools, and Frameworks (but are not limited to)
• Natural Language Processing, Image Processing, Speech Processing
• Multimodal Computing
• Python, Java Script, Java
• AI/ Machine Learning/ Deep Learning
• Tensorflow, PyTorch, Keras, etc.

References
1. Conversational Systems and the Marriage of Speech & Language by Mari Ostendorf (University of Washington)
2. Speech 101 by Robert Moore The University of Sheffield
3. https://www.youtube.com/watch?v=PWGeUztTkRA&ab_channel=Mark_Mitton
4. The Two Ronnies Show
5. Preliminaries to a Theory of Speech Disfluencies (Elizabeth Shriberg, 1994)
6. A Short Analysis of Discourse Coherence (Wang and Guo, 2014)
7. A. Nowosielski, “Two-letters-key keyboard for predictive touchless typing with head movements,” 07 2017, pp. 68–79
8. J. Tu, H. Tao, and T. Huang, “Face as mouse through visual face tracking,” Comput. Vis. Image Underst., vol. 108, no. 1–2, p. 35–40, Oct. 2007.
[Online]. Available: https://doi.org/10.1016/j.cviu.2006.11.007
9. 3d head pose estimation and camera mouse implementation using a monocular video camera,” Signal, Image and Video Processing, vol. 9, 01
2012.
10. A. Markussen, M. R. Jakobsen, and K. Hornbundefinedk, “Vulture: A mid-air word-gesture keyboard,” in CHI ’14, 2014.
11. C. Yu, Y. Gu, Z. Yang, X. Yi, H. Luo, and Y. Shi, “Tap, dwell or gesture? exploring head-based text entry techniques for hmds,” in CHI ’17, 2017.
12. Zhao G, Barnard M & Pietikäinen M (2009) Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia 11(7):1254-1265.
13. Garofolo, J. & Lamel, Lori & Fisher, W. & Fiscus, Jonathan & Pallett, D. & Dahlgren, N. & Zue, V.. (1992). TIMIT Acoustic-phonetic Continuous
Speech Corpus. Linguistic Data Consortium.

1. Gandharv Mohan, MIDAS Lab IIITD, Btech 2021
2. Akash Sharma, MIDAS Lab IIITD, Btech 2021
3. Rajaswa Patil, MIDAS Lab IIITD, Btech 2021
4. Avyakt Gupta, MIDAS Lab IIITD, Btech 2021
5. Gaurav Aggarwal, MIDAS Lab IIITD, Btech 2021
6. Devansh Batra, MIDAS Lab IIITD, Btech 2021
7. Aradhya Neeraj Mathur, MIDAS Lab IIITD, PhD Student
8. Maitree Leekha, MIDAS Lab IIITD, Btech 2020
9. Jainendra Shukla, HMI Lab IIITD, Assistant Professor
10. Vidit Jain, MIDAS and HMI Lab, Btech 2021
11. Rajesh Kumar, Haverford College USA, Assistant Professor
12. Shivam, Akash, Mohit, Vishaal, Mansi, Aayush, Meet, Suraj, and many other MIDAS members
Acknowledgements

Marriage of speech, vision and natural language processing

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Marriage of speech, vision and natural language processing

Similar a Marriage of speech, vision and natural language processing (20)

Último

Último (20)

Marriage of speech, vision and natural language processing