Speech generally is considered to have three parts to it: vision, aural, and the social construct. In recent years, although the field has been moving at a dramatic pace, progress is being made in silos. The primary reason for this being that speech is considered "spoken text" by practitioners and researchers alike. Most open-source datasets due to their distance from real-world conditions help in spreading this false impression. In this condition, it is not surprising that common and important features of speech like intonation and disfluency do not get captured by this intent. This tutorial aims to provide an appreciation of the "full-stack" of speech - aural, vision and the textual (or social construct) parts with a special emphasis on aspects that may have significance for current and future research.
5. Speech is ... more than spoken words
• Rich in ‘extra-linguistic’ information
• breathing noises
• lip-smacks
• Hand movements
• Facial Expressions
• Rich in ‘para-linguistic’ information
• Personality
• Attitude
• Emotion
• Individuality
6. Some Examples
• Disfluency
• I am uh uh very …. I am very excited to see you
• He is my em …… Yaman is my best friend
• Intonation and Stress
1. *This* is my laptop (and not that)
• This is *my* laptop (and not yours)
• This is my *laptop* (and not book)
2. He found it on the street?
• And in reply, He found it on the street
• No punctuation and very open grammar
• ASR errors
7. • to the listener
• a child (‘parentese’)
• a non-native person
• a hearing-impaired
individual
• an animal
• a machine(!)
• to the cognitive load
• Interaction with other
tasks
• stressful/emotional
situations
• to the
environment
• noise
• reverberation
Speech is
Adaptive
• to the task
• Casual
conversation
• Reading out
loud
• Public
speaking
8. Content
• Content in spoken medium is the "information or experiences directed towards end-users or an
audience".
Why is Content Important?
Whom do you prefer?
• A speaker with style, elegance, panache but with a weak content (talking
too much off-topic, not providing enough details about facts).
OR
• An average speaker but with a good content (ideas stick to the main
topic, provides interesting/required background information).
9. Content
What defines a Good Content? ( High Relevance and High Sufficiency )
Relevance
• Related to the topic
• Connected to the prompt in a
bigger story.
• No Unwanted information or off
topic.
Sufficiency
• Adequate details (which are also
relevant)
• All points covered
• No Missing parts
10. Response: IVE ACCOMPLISHED UM MANY THINGS IN LIFE ONE OF
THEM IS IS BEING A PHILANTHROPIST IVE HELPED A LOT OF PEOPLE
MOST SPECIALLY CHILDREN I GO TO SOME UM POOR AREAS AND
WE TEACH LIKE THOSE CHILDREN SOME KNOWLEDGE THAT THEY
DONT KNOW YET LIKE FOR EXAMPLE IM GOING TO BE THEIR
TEACHER AND I I INFORM THEM ALL THE THINGS LIKE UM WHAT TO
WRITE HOW TO READ HOW TO DESCRIBE SOMETHING AND THIS IS
REALLY IMPORTANT IN MY LIFE BECAUSE BEING A TEACHER IS
REALLY GOOD FOR ME AND I THINK IT WILL REALLY HELP ME GROW
MY ABILITY TO HELP PEOPLE MOST SPECIALLY CHILDREN
Response: IT IS IMPORTANT TO CHOOSE WISELY FOR YOUR CAREER
AND ITS ALSO IMPORTANT THAT YOU CHOOSE THAT CAREER
BECAUSE UH THIS IS YOUR PASSION AND THIS IS YOUR REALLY ONE
JOB AND BECAUSE IF YOU DONT WANT THAT JOB OR CAR CAREER
BUT YOU CHOOSE IT UH YOU WILL AT THE END OF THE DAY YOU
WILL NOT BE UH MOTIVATED TO WORK WITH IT AND YOU WILL NOT
BE YOU ARE UH THERES A TENDENCY THAT YOU WILL NOT ACHIEVE
YOUR GOAL OR DESIRE IN YOUR IN THAT CAREER AND YOURE NOT
BE WILL BE SUCCESSFUL IN THAT CAREER IT IS IMPORTANT TO
CHOOSE WISELY YOUR CAREER AND UH CONSIDER THAT THIS IS
YOUR UH THIS IS WHAT YOU REALLY WANT AND THIS IS YOUR
PASSIONS AND ARE IT IS UH IF YOU CHOOSE YOUR CAREER BE SURE
YOU ARE ENJOYING IT NOT DOING IT
Relevance: High
Speaker sticks to the things asked in prompt.
(Being philanthropist or teacher as
accomplishments, important of the same.)
Sufficiency: High
Explains in detail about how he helped children
as a teacher, how did he help and importance
of the same
Relevance: Low
Speaker goes too off topic from what is being
asked. (About car, being successful, what good
career is, instead of talking about
accomplishments.)
Sufficiency: Low
Provides no information that addresses the
points in the prompt.
Prompt: You have to narrate to a career advisor 1 thing you accomplished which you are proud of and how it was
important for you.
11. D…Di….Disfluencies
• Interruptions in the smooth flow of speech
• These interruptions often occur in spoken communication. They usually help the speakers to buy more time
while they express their thought process.
• Reparandum (RM) - Refers to the unintended and
unnecessary part of the disfluency span
(This span can be deleted in order to obtain fluency)
• Interregnum (IM) - Refers to the part that lies
between RM and RR.
(This span helps the speaker to fill the intermediate gap)
• Repair (RR) - Refers to the corrected span of the RM.
(This span should maintain the context of RM)
12. D…Di….Disfluencies
• Examples
• Filled pauses : "This is a uhmm … good example"
• Discourse Markers : " It's really nice to .. you know .. play outside sometimes."
• Self-Correction : " So we will... we can go there."
• Repetitions : "The... the... the decision was not mine to make"
• Restart : "We would like to eat ... let’s go to the park"
• Why can't we recognize these disfluencies solely by looking at the words ? 🤔
• Consideration of the audio helps in understanding the intention of speaker and hence deciding if
there is a disfluency or not.
• Can get confused with some fluently done repetitions - "Superman is the most most most
powerful superhero ! "
• Can also get confused from various other interruptions like non-verbal sounds and even silence !
13.
14. Pronunciation
/prəˌnʌnsɪˈeɪʃ(ə)n/
Mispronunciation Detection: Problem where the perceived pronunciation
doesn't match with intended pronunciation, but we can understand the
meaning. Example. Pronunciation of word park.
• Phoneme Recognition Problem: State of the art phoneme (sounds in a
language) recognition systems has a phoneme error rate of 18% for
native speech data.
• Non-native accent: Phonemes might be recognized correctly but acoustic
models (models used to detect phonemes) are often confused by non-
native speech. Some phonemes (sounds) exist in the native
language which do not have an alternative in the non-native language.
E.g. Je sound in French has no English mapping which confuses the
acoustic model to predict wrong sequences of phonemes.
15. Pronunciation
Intelligibility: There is a lot of
difference between the intended
speech and spoken speech.
Example: Pronunciation of word
mEssage is incorrect. A good ASR
system will perceive it as mAssage and
rate it correctly pronounced. However,
the user meant to say mEssage.
16. Discourse Coherence
• Discourse is a coherent combination of spoken (or written) utterances
communicated between a speaker (or writer) and a listener (or
reader).
• Discourse is a PRODUCT? ✍️ (linguistic perspective)
• Discourse is a PROCESS!! 🤔🤔 (cognitive perspective)
• Discourse coherence is the semantic relationship between
propositions or communicative events in discourse.
• It is a feature of the perception 👀👂 of discourse rather than the
content of discourse itself.
17. Discourse Coherence
Discourse as Product ✍
• A well written speech.
• How the discourse content is
structured and organized by the
speaker.
• Cohesion in text, use of discourse
markers, connectives, etc.
• How readable is the text, how
complex is the text, etc.
Discourse as Process 🤔
• A well delivered speech.
• How the discourse content is
delivered efficiently to the
listener.
• Prosodic variation, use of stress,
intonation, pauses, etc.
• How intelligible is the
speech, how focused is the
listener, etc.
18. Prosody
• Prosodic features span...
• several speech segments
• several syllables
• whole utterances
• Such ‘suprasegmental’ behaviour includes ...
• lexical stress (Prominence of Syllables)
• lexical tone (Pitch pattern to distinguish words)
• rhythmic stress (Emphasis)
• intonation (Difference of Expressive meaning)
21. Silent Speech is Even More Ambiguous
• Elephant Juice vs I Love You
• Million vs Billion
• Pet vs Bell vs Men
Speak Them To Yourself!
You lip movements are exactly same!
22.
23. Exploring Semi-Supervised Learning
for Predicting Listener Backchannels
Accepted at CHI’21!
Vidit Jain, Maitree Leekha,
Jainendra Shukla, Rajiv Ratn Shah
24. Introduction
● Developing human-like conversational agents is important!
○ Applications in education and healthcare
● Challenge: how to make them seem natural?
○ Human conversations are complex!
● Listener backchannels: a crucial element of human conversation:
○ Listener’s “regular” feedback to the speaker, indicating presence
○ Verbal: e.g., short utterances
○ Non-verbal: e.g., head shake, nod, smile etc.
● We focus on modelling these backchannels as a step towards natural
Human Robot Interactions (HRIs).
25. Research Questions
Key Research Gaps:
● Prior works [1, 2 and more] relied on large amounts of manually
annotated data to train listener backchannel prediction (LBP) models
○ This is expensive in terms of man hours
● In addition, all previous works have focused on only English
conversations
Major Contributions:
● Validating the use of semi-supervised techniques for LBP
○ Models using only 25% of manual annotation performed at par!
● Unlike past works, we use Hindi conversations
[1] Park, Hae Won, et al. "Telling stories to robots: The effect of backchanneling on a child's storytelling." 2017 12th ACM/IEEE
International Conference on Human-Robot Interaction (HRI. IEEE, 2017.
[2] Goswami, Mononito, Minkush Manuja, and Maitree Leekha. "Towards Social & Engaging Peer Learning: Predicting Backchanneling
and Disengagement in Children." arXiv preprint arXiv:2007.11346 (2020).
26. Dataset
● We use the multimodal Hindi based Vyaktitv dataset [3]
○ 25 conversations, each ~16 min long
○ Video and audio feeds available for each participant (50 recordings)
● Annotations Done:
○ 3 annotators
○ Signal (kappa): Nod (0.7), Head-shake (0.6), Mouth (0.6), Eyebrow (0.5),
Utterances (0.5)
● Features Extracted:
○ OpenFace - visual features: 18 facial action units (FAU), gaze velocities & accelerations,
translational and rotational head velocities & accelerations, blink rate, pupil location, and smile ratio
○ pyAudioAnalysis - audio features: voice activity, MFCC, F0, energy
[3] Khan, Shahid Nawaz, et al. "Vyaktitv: A Multimodal Peer-to-Peer Hindi Conversations based Dataset for Personality Assessment."
2020 IEEE Sixth International Conference on Multimedia Big Data (BigMM). IEEE, 2020.
27. System Architecture
Methodology: (i) Semi-supervised learning for identifying backchannels and type of
signals emitted using a subset of labeled data. (ii) Learning to predict these instances
and signals using the speaker's context.
28. Task Formulations
Identification
Given a listener’s audio and video feeds, identify
when he backchannels?
These are the true labels in the prediction task
We use semi-supervision here to generate these
pseudo-labels (instance & type)
Prediction
Given a speaker’s context (~3-7 sec long), predict
whether the listener will backchannel
immediately after it.
Use only speaker’s features to predict the
instance & type of backchannel (verbal/visual)
29. Key Findings
● The semi-supervised process was able to identify backchannel instances
and signal types very well
○ Respective accuracies- 0.90 (ResNet) & 0.85 (RF)- only 25% manual annotation
as seed!
● Comparing prediction models trained using manually annotated vs semi
supervised pseudo labels:
○ Using semi-supervision, we reach ~94% of the baseline performance!
● Qualitative Study: Majority participants could not distinguish between
the two prediction models!
31. Lip Movement as Inputs for Information Retrieval
https://www.aaai.org/ojs/index.php/AAAI/article/view/5649
https://www.aaai.org/ojs/index.php/AAAI/article/view/5148
https://www.aaai.org/ojs/index.php/AAAI/article/view/4106
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/3269.pdf
https://www.isca-speech.org/archive/Interspeech_2019/abstracts/3273.html
https://www.youtube.com/watch?v=3BqQQnTfnlE&list=PL9rvax0EIUA6PDoiDT2Wp462GsT
nikrvY
40. Let’s jump to MobiVSR difficulty level.
Your options:
ABOUT
ABSOLUTELY
ABUSE
ACCESS
ACCORDING
ACCUSED
ACROSS
ACTION
ACTUALLY
AFFAIRS
AFFECTED
AFRICA
AFTER
AFTERNOON
AGAIN
AGAINST
AGREE
AGREEMENT
AHEAD
ALLEGATION
S
ALLOW
ALLOWED
ALMOST
ALREADY
ALWAYS
AMERICA
AMERICAN
AMONG
AMOUNT
ANNOUNCED
ANOTHER
ANSWER
ANYTHING
AREAS
AROUND
ARRESTED
ASKED
ASKING
ATTACK
ATTACKS
AUTHORITIE
S
BANKS
BECAUSE
BECOME
BEFORE
BEHIND
BEING
BELIEVE
BENEFIT
BENEFITS
BETTER
BETWEEN
BIGGEST
BILLION
BLACK
BORDER
BRING
BRITAIN
BRITISH
BROUGHT
BUDGET
BUILD
BUILDING
BUSINESS
BUSINESSES
CALLED
CAMERON
CAMPAIGN
CANCER
CANNOT
CAPITAL
CASES
CENTRAL
CERTAINLY
CHALLENGE
CHANCE
CHANGE
CHANGES
CHARGE
CHARGES
CHIEF
CHILD
CHILDREN
CHINA
CLAIMS
CLEAR
CLOSE
CLOUD
COMES
COMING
COMMUNITY
COMPANIES
COMPANY
CONCERNS
CONFERENCE
CONFLICT
CONSERVATIV
E
CONTINUE
CONTROL
COULD
COUNCIL
COUNTRIES
COUNTRY
COUPLE
COURSE
COURT
CRIME
CRISIS
CURRENT
CUSTOMERS
DAVID
DEATH
DEBATE
DECIDED
DECISION
DEFICIT
DEGREES
DESCRIBED
DESPITE
DETAILS
DIFFERENCE
DIFFERENT
DIFFICULT
DOING
DURING
EARLY
EARLY
EASTERN
ECONOMIC
ECONOMY
EDITOR
EDUCATION
ELECTION
EMERGENCY
ENERGY
ENGLAND
ENOUGH
EUROPE
EUROPEAN
EVENING
EVENTS
EVERY
EVERYBODY
EVERYONE
EVERYTHING
EVIDENCE
EXACTLY
EXAMPLE
EXPECT
EXPECTED
EXTRA
FACING
FAMILIES
FAMILY
FIGHT
FIGHTING
FIGURES
FINAL
FINANCIAL
FIRST
FOCUS
FOLLOWING
FOOTBALL
FORCE
FORCES
FOREIGN
FORMER
FORWARD
FOUND
FRANCE
FRENCH
FRIDAY
FRONT
FURTHER
FUTURE
GAMES
GENERAL
GEORGE
GERMANY
GETTING
GIVEN
GIVING
GLOBAL
GOING
GOVERNMENT
GREAT
GREECE
GROUND
GROUP
GROWING
GROWTH
GUILTY
HAPPEN
HAPPENED
HAPPENING
HAVING
HEALTH
HEARD
HEART
HEAVY
HIGHER
HISTORY
HOMES
HOSPITAL
HOURS
HOUSE
HOUSING
HUMAN
HUNDREDS
IMMIGRATION
IMPACT
IMPORTANT
INCREASE
INDEPENDENT
INDUSTRY
INFLATION
INFORMATION
INQUIRY
INSIDE
INTEREST
INVESTMENT
INVOLVED
IRELAND
ISLAMIC
ISSUE
ISSUES
ITSELF
JAMES
JUDGE
JUSTICE
KILLED
KNOWN
LABOUR
LARGE
LATER
LATEST
LEADER
LEADERS
LEADERSHIP
LEAST
LEAVE
LEGAL
LEVEL
LEVELS
LIKELY
LITTLE
LIVES
LIVING
LOCAL
LONDON
LONGER
LOOKING
54. Viseme concatenation
TC GAN Generated Output with Inter-
Visemes
Output for an Hindi phrase, Aap Kaise hai
(How are you)
55. LIFI: Towards Linguistically
Informed Frame Interpolation
Aradhya Neeraj Mathur¹, Devansh Batra², Yaman Kumar¹, Rajiv Ratn Shah¹, Roger Zimmermann³
Indraprastha Institute of Information Technology Delhi, India¹
Netaji Subhas University of Technology, Delhi²
National University of Singapore (NUS)³
56. Motivation
56
• Speech videos are extremely common across the internet (lectures, YouTube videos and even video calling apps),
but no video interpolation methods pay heed to nuances of speech videos.
• Visual Modality of speech is complicated. While uttering a single sentence, our lips cycle through dozens of visemes.
• First 30 frames of a speaker speaking the sentence "I don't exactly walk around with a
hundred and thirty five million dollars in my wallet". Notice the rich lip movement with
opening and closing of the mouth.
57. Motivation
57
We try to reconstruct this speech video by interpolating the intermediate
frames from the first and last frames using state of the art models.
Expected
Original frames
(with rich mouth
movements)
Observed
Interpolated frames
(with virtually no
mouth movements)
Some Surprising metrics
L1 = 0.0498,
MSE = 0.0088,
SSIM = 0.9521,
PSNR = 20.5415
Which are surprisingly good!!?
This means that we need better
evaluation criteria for interpolation or
reconstruction of speech videos.
58. Proposed Work
58
1. Challenge Datasets for Speech Video Reconstruction (based on LRS3-TED)
Guess the words spoken?
……………
"Well the short answer to the
question is no, it's not the
same thing"
Random Frame Corruption (40%) Extreme Sparsity Corruption (75%) Prefix Corruption Suffix Corruption
59. Proposed Work
59
1. Challenge Datasets for Speech Video
Reconstruction (based on LRS3-TED)
Visemic Corruption
(visemes of a particular type being corrupted
and requiring regeneration)
Intra Word Corruption
(Corruption of frames within the occurrence of
a large word)
Inter Word Corruption
(Corruption of frames across word boundaries)
60. Proposed Work
60
2. Visemic reconstruction with ROI Loss unit
A modified FCN3D with ROI extraction unit to
calculate ROI loss.
Instead of training the reconstruction network
with only the L1 loss between reconstructed
and original images, we introduce an ROI Loss
which measures the similarity between visemic
regions of interests between observed and
generated facial images.
To accomplish this, we develop an ROI unit as
shown on the left.
61. Proposed Work
61
Key Findings
We evaluate a Fully Convolutinal Network (FCN3D), a convolutional bi-
directional LSTM and the original FCN3D network after addition of the
ROI unit and Visemic Loss during training.
We observe:
1. In different types of corruptions different networks perform
differently.
2. While SuperSloMo performs very well in random frame corruption,
we see that it performs much poorly on other types of
corruptions.
3. As expected, a sequential LSTM based generator works much
better than a fully connected convolutional network when there
are corruptions in consecutive frames as shown in prefix and suffix
corruption
4. Most Importantly, addition of an ROI loss also helps a network
perform better on all forms of corruption and non-ROI based
metrics, as shown by the results for (FCN3D+ROI)
Performance of different models over datasets
containing random corruptions, prefix
corruptions and suffix corruption
Performance of different models over
datasets containing corruptions on
different visemes
62. Touchless Typing Using Head
Movement-based Gestures
Shivam Rustagi¹, Aakash Garg¹, Pranay Raj Anand², Rajesh Kumar³, Yaman Kumar², Rajiv Ratn Shah²
Delhi Technological University, India¹
Indraprastha Institute of Information Technology Delhi, India²
Haverford College, USA³
65. Related Work
[1] A. Nowosielski, “Two-
letters-key keyboard for
predictive touchless typing
with head movements
[2] J. Tu, H. Tao, and T.
Huang, “Face as mouse
through visual face
tracking,”
[3] M. Nabati and A. Behrad, “3d
head pose estimation and
camera mouse implementation
using a monocular video camera
66. Related Work
66
MID AIR TOUCHLESS TYPING TECHNIQUES
[4] A. Markussen, M. R. Jakobsen, and
K. Hornbundefinedk,
“Vulture: A mid-air word-gesture
keyboard
Using fingers
[5] C. Yu, Y. Gu, Z. Yang, X. Yi, H. Luo, and
Y. Shi, “Tap,dwell or gesture? exploring
head-based text entry techniques for
hmds,”
Using head
67. 67
Proposed Work
for the 10,000 most common English words there are 8529 unique cluster sequences with each sequence
having on an average 1.17 different words. So once we predict the cluster sequence, it can be translated to
1-2 valid words on an average..
68. Data Collection: Setup
68
Equipment Configuration Purpose
Monitor and
Keyboard
17 inches monitor and standard keyboard ● The color coded QWERTY
keyboard was displayed on
monitor.
● Keyboard used to start and
stop recording.
Camera (on
tripods)
● 3 Samsung M10 mobile cameras which recorded
videos at 30fps, 1920 x 1080 resolution
● All 3 mobiles had OpenCamera app installed
● 1 Samsung M10 mobile with MUSE2 app
The 3 cameras were kept at angles -
45, 0 and 45 degrees respectively to
record the head movements.
MUSE2
headband
Sensors such as accelerometer and gyroscope The sensors recorded the
acceleration and rotation of head.
Moderator’s
laptop
Standard The python script on laptop was
responsible to start and stop
cameras simultaneously.
*Note: For our research we have used only the
central view (the Camera-2) recordings.
69. ❑ Total number of users volunteered = 25 ( 16 male; 9 female; 3 user data discarded on manual inspection)
❑ Each user recorded 3 samples of video each for 35 (words: 20, phrases: 10, sentences: 5 as per Table 1)
❑ Total number of video samples = 2310 (22 x 35 x 3)
69
Category Text
Words locate, single, family, would, place, large, work, take, live,box, method, listen, house, learn, come,
some, ice, old, fly, leg
Phrases hello, excuse me, i am sorry, thank you, good bye, see you,nice to meet you, you are welcome,
how are you, have a good time
Sentences i never gave up, best time to live, catch the trade winds, hear a voice within you, he will forget it
Table 1. List of 20 words, 10 phrases and 5 sentences that was typed by each user. Each of these was iterated for 3 times.
Data Collection: Description
71. Data Collection: Statistics
71
Category Avg. Number of letters per Entry
Words 4.33
Phrases 10.6
Sentences 18.6
❏ The words were selected to have proper
cluster-coverage.
❏ The phrases and sentences were selected
from OuluVS[6] and TIMIT[7] dataset
respectively.
Fig. Coverage of each cluster across dataset
Fig. Avg Gesture per minute for each user ( avg = 49.26, std = 5.3 )
72. 72
The proposed method is based on a CNN-RNN architecture, the feature extractor part, as shown above, is
based on HopeNet architecture that predicts the yaw, pitch and roll features for the input image. The
network is trained using a multi-task classification scheme. We utilize the available pretrained model on
large-pose face images from 300W dataset.
Hopenet Architecture
73. 73
HopeNet output visualized on a user. The three vectors are
constructed from the euler angles (features) predicted by the
network.
Working of Hopenet
74. 74
The features from the HopeNet are passed into a multi-layered BiGRU network, which is then trained using a CTC loss
function. During the inference phase we used beam search to decode the cluster sequence.
CNN-RNN architecture
76. 76
The method is evaluated on two scenarios:
● Inter-User: Training on user set S1, Testing on user set S2 such that S1 and S2 are mutually exclusive. Cluster
sequences are kept the same for training and testing.
● Intra-user: For every user, i.e set S = {S1 U S2}, we record 3 samples per sequence. For training, 2 samples were
taken and the testing is done on the 3rd sample
Results
77. 77
Our work presents a meaningful way of mapping gestures to character (cluster) sequence which could be beneficial for
people with disabilities.
Also, our dataset is publically available which could help improve the current system.
In the future, the aim is to improve the performance issue by:
Using more training data containing a variety of meaningful sequences, and
1. Combining video feeds from multiple cameras, brainwaves recorded via EEG sensors, acceleration, and rotation of
the user’s head recorded via accelerometer and gyroscope.
Other future applications could also work in the direction of integrating the interface with wearable
devices and mobile computing. This will bring together a newer set of applications like browsing from wearable
glasses.
Conclusion and Future Work
79. SeekSuspect: Retrieving Suspects from Criminal Datasets using Visual Memory
Aayush Jain*, Meet Shah*, Suraj Pandey*, Mansi Agarwal*, Rajiv Ratn Shah,
Yifang Yin
● Police maintain a crime dossier system that entails information like photographs and physical details.
● Finding suspects by name is possible, but fails when we only have informant's visual memory.
● Law enforcement agencies used to hire sketch artists, but they are limited in number.
● We propose SeekSuspect, a fast interactive suspect retrieval system.
● SeekSuspect employs sophisticated deep learning and computer vision techniques
○ to modify the search space and
○ find the envisioned image effectively and efficiently
I do not exactly
remember
who she was
Is this the
person you
wish to
search for?Female, fair, black hair...
Relevant images
SeekSuspect
Similar images
82. Team
• Director: Dr. Rajiv Ratn Shah
• PhD Students: Hitkul, Shivangi, Ritwik, Mohit, Yaman, Hemant, Kriti, Astha
• MTech Students: Abhishek, Suraj, Meet, Aayush, William, Subhani, etc.
• Research Assistants: Manraj, Pakhi, Karmanya, Mehar, Saket, Anuj, etc.
• BTech Students (both full-time and remote students):
• DTU: Maitree Leekha, Mansi Agarwal, Shivang Chopra, Rohan Mishra, Himanshu, etc.
• NSUT: Ramit Sahwney, Puneet Mathur, Avinash Swaminathan, Rohit Jain, Hritwik, etc.
• IIT: Pradyumn Gupta, Abhigyan Khaund, Palak Goenka, Amit Jindal, Prateek Manocha, etc.
• IIIT: Vedant Bhatia, Raj K Gupta, Shagun Uppal, Osheen Sachdev, Siddharth Dhawan, etc.
• Alumnus (Placements, Internship, MS Admissions):
• Companies: Google, Microsoft, Amazon, Adobe, Tower Research, Walmart, Qualcomm,
Goldman Sachs. Bloomberg, IBM Research, Wadhwani AI, Samsung Research, etc.
• Academia: CMU, Columbia University, University of Pennsylvania, University of Maryland,
University of Southern California, Erasmus Mundus, University of Virginia, Georgia Tech, etc.
83. Collaborators
• Prof Roger Zimmermann, National University of Singapore, Singapore
• Prof Changyou Chen, State University of New York at Buffalo, USA
• Prof Mohan Kankanhalli, National University of Singapore, Singapore
• Prof Ponnurangam Kumaraguru (PK), IIIT Delhi, India
• Dr. Amanda Stent, Bloomberg, New York, USA
• Dr. Debanjan Mahata, Bloomberg, New York, USA
• Prof. Rada Mihalcea, University of Michigan, USA
• Prof. Shin'ichi Satoh, National Institute of Informatics, Japan
• Prof. Jessy Li, University of Texas at Austin, USA
• Prof. Huan Liu, Arizona State University, USA
• Prof. Naimul Khan, Ryerson University, Canada
• Prof. Diyi Yang, Georgia Institute of Technology, USA
• Prof Payman Vafaee, Columbia University, USA
• Prof Cornelia Caragea, University of Illinois at Chicago, USA
• Dr. Mika Hama, SLTI, USA, and many more...
84. Research (AI for Social Good)
• NLP and Multimedia based systems for society (education, healthcare, etc.)
• Automatic speech recognition (ASR) for different domains and accents (e.g., Indian, African)
• Visual speech recognition/reconstruction (VSR) such as lipreading and speech reconstruction
• Hate speech and malicious user detection in code-switched scenarios on social media
• Mental health problems such as suicidal ideation and depression detection on social media
• Building multimodal information retrieval and information extraction systems
• Knowledge graph construction for different domains, e.g., medical, e-commerce, defence. etc.
• Automated systems for number plate and damage detection, car insurance claim, e-challan, etc.
• Multimodal sentiment analysis and its applications in education, policy making, etc.
• Detecting, analyzing, and recommending advertisements in videos streams
• Fake news detection and propagation, suspect detection, personality detection, etc.
• Publications (but are not limited to)
• AAAI, CIKM, ACL, EMNLP, WSDM, COLING, ACM Multimedia, ICDM, INTERSPEECH, WWW, ICASSP, WACV,
BigMM, IEEE ISM, NAACL, ACM Hypertext, ACM SIGSPATIAL, Elsevier KBS, IEEE Intelligent Systems, IEEE MIPR,
ACM MM Asia, AACL, Springer book chapters, etc.
85. Research (AI for Social Good)
• Awards (but are not limited to)
• Won the outstanding paper award as COLING 2020
• Got selected to Heidelberg Laureate Forum (HLF) in 2018, 2019, 2020
• Best student poster in AAAI 2019, Honolulu, Hawai, USA
• Best poster and best industrial paper in IEEE BigMM 2019, Singapore
• Winner of the ACM INDIA Chapters Technology Solution Contest 2019 in Jaipur, India
• Won the honorable mention award in ICDM Knowledge Graph Contest 2019 in Beijing, China
• Won the best poster runner-up award at IEEE ISM 2018 conference in Taichung, Taiwan
• Skills, Tools, and Frameworks (but are not limited to)
• Natural Language Processing, Image Processing, Speech Processing
• Multimodal Computing
• Python, Java Script, Java
• AI/ Machine Learning/ Deep Learning
• Tensorflow, PyTorch, Keras, etc.
87. References
1. Conversational Systems and the Marriage of Speech & Language by Mari Ostendorf (University of Washington)
2. Speech 101 by Robert Moore The University of Sheffield
3. https://www.youtube.com/watch?v=PWGeUztTkRA&ab_channel=Mark_Mitton
4. The Two Ronnies Show
5. Preliminaries to a Theory of Speech Disfluencies (Elizabeth Shriberg, 1994)
6. A Short Analysis of Discourse Coherence (Wang and Guo, 2014)
7. A. Nowosielski, “Two-letters-key keyboard for predictive touchless typing with head movements,” 07 2017, pp. 68–79
8. J. Tu, H. Tao, and T. Huang, “Face as mouse through visual face tracking,” Comput. Vis. Image Underst., vol. 108, no. 1–2, p. 35–40, Oct. 2007.
[Online]. Available: https://doi.org/10.1016/j.cviu.2006.11.007
9. 3d head pose estimation and camera mouse implementation using a monocular video camera,” Signal, Image and Video Processing, vol. 9, 01
2012.
10. A. Markussen, M. R. Jakobsen, and K. Hornbundefinedk, “Vulture: A mid-air word-gesture keyboard,” in CHI ’14, 2014.
11. C. Yu, Y. Gu, Z. Yang, X. Yi, H. Luo, and Y. Shi, “Tap, dwell or gesture? exploring head-based text entry techniques for hmds,” in CHI ’17, 2017.
12. Zhao G, Barnard M & Pietikäinen M (2009) Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia 11(7):1254-1265.
13. Garofolo, J. & Lamel, Lori & Fisher, W. & Fiscus, Jonathan & Pallett, D. & Dahlgren, N. & Zue, V.. (1992). TIMIT Acoustic-phonetic Continuous
Speech Corpus. Linguistic Data Consortium.