Machine translation and computer vision have greatly benefited of the advances in deep learning. The large and diverse amount of textual and visual data have been used to train neural networks whether in a supervised or self-supervised manner. Nevertheless, the convergence of the two field in sign language translation and production is still poses multiple open challenges, like the low video resources, limitations in hand pose estimation, or 3D spatial grounding from poses. This talk will present these challenges and the How2✌️Sign dataset (https://how2sign.github.io) recorded at CMU in collaboration with UPC, BSC, Gallaudet University and Facebook.
https://imatge.upc.edu/web/publications/sign-language-translation-and-production-multimedia-and-multimodal-challenges-all
2. Current & former students
2
Benet
Oriol
Jordi
Aguilar
Cayetana
López
Lucas
Ventura
Amanda
Duarte
Laia
Tarrés
Andrea
Iturralde
Maram A.
Mohamed
Álvaro
Budria
Sandra
Roca
Daniel
Moreno
Janna
Escur
Mireia
Hernández
Peter
Muschick
Pol
Pérez
Görkem
Camli
Jordi
López
Gerard
Gállego
5. Classic Motivation: Accessibility
5
“World Report on Hearing”. World Health Organization 2021.
Number of people and
percentage prevalence
according to grades of
hearing loss.
7. Classic Motivation: Accessibility to basic services
7
“World Report on Hearing”. World Health Organization 2021.
● Sign language interpretation improves
access to education and health services.
○ A survey conducted in 2009 by the World
Federation of the Deaf revealed that 68% of the
93 responding countries did not have access to
professional sign language.
○ Professional sign language interpreters are even
more scarce in developing countries
8. Classic Motivation: Accessibility
8
● New challenges for the deaf community
because of the COVID-19 pandemic.
https://whereistheinterpreter.com/
#whereistheinterpreter
“Due to the pandemic, more and more medical
professionals are treating COVID-19 patients
from behind a barrier, using masks that impede
lip-reading, and not allowing in-person
interpreters,” says the. National Association of
the Deaf.
Summer Epps, “COVID’s Forgotten Victims: The Deaf Community” . Webmd 2021
15. A crash course on Sign Languages (SL)
Cultural diversity of sign languages, similar to spoken languages
○ American (ASL), British (BSL), German (GSL), Chinese (CSL)… sign languages.
15
Irish Sign Language (ISL) Catalan Sign Language (LSC)
16. A crash course on Sign Languages (SL)
Sign languages are NOT a one-to-one mapping from spoken languages.
16
Look-Up
Table
Hi, I’m Amelia and I’m
going to talk to you
about how to remove
gum from hair.
Sign Language
(video)
Spoken Language
(transcription)
��🏼
17. A crash course on Sign Languages (SL)
There exist a textual transcription method named “glosses”.
17
HI, ME FS-AMELIA WILL
EXPLAIN HOW REMOVE
GUM FROM YOUR HAIR
Hi, I’m Amelia and I’m
going to talk to you about
how to remove gum from
hair.
Spoken Language
(transcription)
Sign Language
(transcription)
18. A crash course on Sign Languages (SL)
● Manual features:
○ Handshape
○ Palm
● Non-manual fetaures
○ Head (nod / shake / tilt)
○ Mouth
○ Eyebrows
○ Cheeks
○ Facial grammar (or expressions)
○ Body position
...orientation, movement, location.
18
Stokoe Jr, William C. "Sign language structure: An outline of the visual communication systems of the American deaf." Journal of
deaf studies and deaf education (2005).
Figure: Arizona State University
19. A crash course on Sign Languages (SL)
SLs use persistent spatial grounding (eg. by pointing & placing) !
19
Liddell, Scott K. "Spatial representations in discourse: Comparing spoken and signed language." Lingua (1996).
“Right along here…” ...immobile entity is
located here,
20. A crash course on Sign Languages (SL)
SLs use persistent spatial grounding (eg. by pointing & placing) !
20
Liddell, Scott K. "Spatial representations in discourse: Comparing spoken and signed language." Lingua (1996).
“Not far and to the
right of,
...tall, vertical entity at this place.
22. Sign-to-Spoken Language Tasks
22
SL Translation Hi, I’m Amelia and I’m going to talk to you
about how to remove gum from hair.
GIPHY/SIGNN WITH ROBERT
Isolated SL Recognition
Continuous SL Recognition
Finger-spelling
HI, ME FS-AMELIA WILL EXPLAIN
HOW REMOVE GUM FROM YOUR
HAIR
“I”
A, B, C, D...
24. Sign-Spoken Language Tasks
SL Production
SL Translation
Sign Language
(video)
24
Spoken Language
(transcription)
Hi, I’m Amelia and
I’m going to talk
to you about how
to remove gum
from hair.
25. Neural Machine Translation
25
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." NeurIPS 2014.
Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase
representations using RNN encoder-decoder for statistical machine translation." EMNLP 2014.
Encoder Decoder
Representation
Hi, I’m Amelia and
I’m going to talk to
you about how to
remove gum from
hair.
Dia duit, is mise
Amelia agus beidh
mé ag caint leat faoi
conas guma a bhaint
de ghruaig.
26. Automatic Speech Recognition (ASR)
26
Encoder Decoder
Representation
Hi, I’m Amelia and
I’m going to talk to
you about how to
remove gum from
hair.
Graves, Alex, and Navdeep Jaitly. "Towards end-to-end speech recognition with recurrent neural networks." ICML 2014.
#LAS Chan, William, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. "Listen, attend and spell: A neural network for large vocabulary
conversational speech recognition." ICASSP 2016.
27. Image Captioning
27
Encoder Decoder
Representation
A group of people
shopping at ann
outdoor market.
Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." CVPR 2015.
Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015.
28. Neural Sign Language Translation
28
Encoder Decoder
Representation
Hi, I’m Amelia and
I’m going to talk to
you about how to
remove gum from
hair.
29. Neural Sign Language Translation
29
Camgoz, Necati Cihan, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard Bowden.
"Neural sign language translation." CVPR 2018.
30. Neural Sign Language Translation
30
Camgoz, Necati Cihan, Oscar Koller, Simon Hadfield, and Richard Bowden. "Sign language
transformers: Joint end-to-end sign language recognition and translation." CVPR 2020.
31. Neural Sign Language Production
31
Encoder Decoder
Representation
Hi, I’m Amelia and
I’m going to talk to
you about how to
remove gum from
hair.
32. Neural Sign Language Production
32
Saunders, Ben, Necati Cihan Camgoz, and Richard Bowden. "Mixed SIGNals: Sign Language Production via
a Mixture of Motion Primitives." ICCV 2021.
33. Neural Sign Language Production
33
Encoder Decoder
Representation
Hi, I’m Amelia and
I’m going to talk to
you about how to
remove gum from
hair.
34. Neural Sign Language Production
34
Saunders, Ben, Necati Cihan Camgoz, and Richard Bowden. "Progressive transformers for end-to-end
sign language production." ECCV 2020.
35. Neural Sign Language Production
35
Stoll, Stephanie, Necati Cihan Camgoz, Simon Hadfield, and Richard Bowden. "Text2Sign: Towards sign
language production using neural machine translation and generative adversarial networks." IJCV 2020.
36. Neural Sign Language Production
36
Saunders, Ben, Necati Cihan Camgoz, and Richard Bowden. "Everybody sign now: Translating spoken
language to photo realistic sign language video." arXiv 2020.
39. Challenges in Computer Vision
39
Off-the-shelf pose detectors and generators struggle with hands.
40. 40
��
Zhou, Yuxiao, Marc Habermann, Weipeng Xu, Ikhsanul Habibie, Christian Theobalt, and Feng Xu. "Monocular real-time
hand shape and motion capture using multi-modal data." CVPR 2020.
Challenges in Computer Vision
41. 41
��
Weinzaepfel, Philippe, Romain Brégier, Hadrien Combaluzier, Vincent Leroy, and Grégory Rogez. "Dope: Distillation of
part experts for whole-body 3d pose estimation in the wild." ECCV 2020.
Challenges in Computer Vision
42. 42
��
Saunders, Ben, Necati Cihan Camgoz, and Richard Bowden. "Progressive transformers for end-to-end sign language
production." ECCV 2020.
Challenges in Computer Vision
43. 43
��
Ng, Evonne, Shiry Ginosar, Trevor Darrell, and Hanbyul Joo. "Body2hands: Learning to infer 3d hands from
conversational gesture body dynamics." CVPR 2021.
Challenges in Computer Vision
45. Challenges in NLP
Sign Languages are:
45
🤔
(Very) low-resource
languages…
...in a (very) high
dimensional space (video).
��🏼
��🏼
46. Challenges in NLP
46
Figure: TensorFlow tutorial
Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. "A neural probabilistic language model." Journal of machine learning
research 3, no. Feb (2003): 1137-1155.
🤔
What are “language
models” in sign
language ?
47. Challenges in NLP
47
How to transfer from
large pre-trained
(“foundation”) models ?
#GPT-3 Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Agarwal, S. Language models
are few-shot learners. NeurIPS 2020 (best paper award).
Source: [OpenAI API]
English: My name is Barbara.
ASL: ME NAME fs-B-A-R-B-A-R-A.
English: Is he a teacher?
ASL: HE TEACHER HE
English: Amir is tall.
ASL: fs-A-M-I-R, HE TALL HE
English: I’m not sad.
ASL: ME SAD ME 🤔
49. Challenges in Speech Translation
49
Jia, Ye, Michelle Tadmor Ramanovich, Tal Remez, and Roi Pomerantz. "Translatotron 2: Robust direct speech-to-speech
translation." arXiv preprint arXiv:2107.08661 (2021).
Speech Video
Speech Speech
End-to-end End-to-end
🤔
51. Challenges in Training Data
51
Damen, Dima, and Michael Wray. "Supervision Levels Scale (SLS)." arXiv (2020). [tweet]
Data(X)
Labels(y)
52. Challenges in Training Data
52
Damen, Dima, and Michael Wray. "Supervision Levels Scale (SLS)." arXiv (2020). [tweet]
X
53. Parallel corpus
53
Fully supervised learning requires a large dataset of pairs of sentences in the two
languages to translate.
Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning
phrase representations using RNN encoder-decoder for statistical machine translation." AMNLP 2014.
55. The How2Sign dataset
55
Multi-view RGB videos RGB-D videos
Body-face-hands keypoints
2D keypoints estimation from OpenPose [2]
How2 dataset [1]
Speech Signal
English Transcription
Hi, I’m Amelia and I’m going
to talk to you about how to
remove gum from hair.
Instructional videos
Multi-view VGA and HD videos [3]
Multi-view recordings (only for a subset)
3D keypoints
estimation
Gloss Annotation
HI, ME FS-AMELIA WILL EXPLAIN HOW REMOVE GUM FROM YOUR HAIR
Duarte, A., Palaskar, S., Ventura, L., Ghadiyaram, D., DeHaan, K., Metze, F., ... & Giro-i-Nieto, X.
How2Sign: a large-scale multimodal dataset for continuous American sign language. CVPR 2021.
56. Continuous Sign Language Datasets
56
Duarte, A., Palaskar, S., Ventura, L., Ghadiyaram, D., DeHaan, K., Metze, F., ... & Giro-i-Nieto, X.
How2Sign: a large-scale multimodal dataset for continuous American sign language. CVPR 2021.
57. The How2Sign dataset: Recorded at CMU
57
Duarte, A., Palaskar, S., Ventura, L., Ghadiyaram, D., DeHaan, K., Metze, F., ... & Giro-i-Nieto, X.
How2Sign: a large-scale multimodal dataset for continuous American sign language. CVPR 2021.
58. The largest dataset in ASL
58
Duarte, A., Palaskar, S., Ventura, L., Ghadiyaram, D., DeHaan, K., Metze, F., ... & Giro-i-Nieto, X.
How2Sign: a large-scale multimodal dataset for continuous American sign language. CVPR 2021.
59. 59
Built on top of How2
Duarte, A., Palaskar, S., Ventura, L., Ghadiyaram, D., DeHaan, K., Metze, F., ... & Giro-i-Nieto, X.
How2Sign: a large-scale multimodal dataset for continuous American sign language. CVPR 2021.
60. Built on top of How2
Spoken Language
(speech)
SL Production
SL Translation
Sign Language
(video)
60
Spoken Language
(transcription)
Hi, I’m Amelia and I’m going to
talk to you about how to
remove gum from hair.
Synthesis
ASR
#How2 Sanabria, Ramon, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, Lucia Specia, and Florian Metze. "How2: a large-scale dataset for
multimodal language understanding." arXiv 2018.
61. Built on top of How2
How2 dataset [1]
Speech Signal
English Transcription
Hi, I’m Amelia and I’m going
to talk to you about how to
remove gum from hair.
Instructional videos
[1] Sanabria, Ramon, et al. "How2: a large-scale dataset for multimodal language understanding." arXiv preprint arXiv:1811.00347 (2018).
English Speech
Speech track available for end-to-end English to ASL.
English Transcriptions
Automatically generated subtitles aligned at the
sentence level.
English to Brazilian Translations
Allows multilingual research.
61
64. Green Studio
Multi-view RGB videos
RGB-D videos
Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., Kanade, T., Nobuhara,S.,
Sheikh, Y.: Panoptic studio: A massively multiview system for social motioncapture. In:
ICCV, 2015.
Panoptic Studio
Multi-view recordings (only for a subset)
Multi-view VGA and HD videos
64
65. 2D & 3D pose estimation
65
Duarte, A., Palaskar, S., Ventura, L., Ghadiyaram, D., DeHaan, K., Metze, F., ... & Giro-i-Nieto, X.
How2Sign: a large-scale multimodal dataset for continuous American sign language. CVPR 2021.
66. 2D & 3D pose estimation
Multi-view RGB videos
Body-face-hands keypoints
2D keypoints estimation from OpenPose [1]
Multi-view recordings (only for a subset)
3D keypoints estimation [2]
[1] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei and Y. A. Sheikh, "OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields" in TPAMI, 2019.
[2] Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., Kanade, T., Nobuhara,S., Sheikh, Y.: Panoptic studio: A massively multiview system for social motioncapture. In: ICCV, 2015
Multi-view VGA and HD videos
66
71. Application: Human motion transfer
71
Ventura, Lucas, Amanda Duarte, and Xavier Giró-i-Nieto. "Can everybody sign now? Exploring sign
language video generation from 2D poses." ECCV 2020 SLRTP Workshop.
72. Application: Human motion transfer
72
2D Pose
estimation
[Openpose]
GAN-
generated
[Everybody
dance now]
73. Application: Human motion transfer
73
Ventura, Lucas, Amanda Duarte, and Xavier Giró-i-Nieto. "Can everybody sign now? Exploring sign
language video generation from 2D poses." ECCV 2020 SLRTP Workshop.
74. 74
“Choose one category”
Can ASL signers understand our generated videos ?
Skeleton
GAN-generated
Classification
accuracy
75. 75
Can ASL signers understand our generated videos ?
Skeleton
GAN-generated
Mean Opinion
Score
“How well could you understand the video?”
76. 76
“Translate the ASL signs into written English.”
Can ASL signers understand our generated videos ?
Skeleton
GAN-generated
77. Challenges in Training Data
77
Damen, Dima, and Michael Wray. "Supervision Levels Scale (SLS)." arXiv (2020). [tweet]
X
78. 78
Challenges in Training Data
Yin, Kayo, and Jesse Read. "Better Sign Language Translation with
STMC-Transformer." COLING 2020. [talk]
Moryossef, Amit, Kayo Yin, Graham Neubig, and Yoav Goldberg. "Data
Augmentation for Sign Language Gloss Translation." arXiv 2021.
Generation of gloss pseudo-labels by training a transformer.
Moreno D, Duarte A, Costa-jussà MR, Giró-i-Nieto X.
English to ASL Translator for Speech2Signs. UPC 2018.
79. 79
Challenges in Training Data
Renz, Katrin, Nicolaj C. Stache, Samuel Albanie, and Gül Varol. "Sign language segmentation with temporal convolutional
networks." ICASSP 2021.
Sign segmentation in continuous sign language videos.
80. 80
Challenges in Training Data
Bull, Hannah, Triantafyllos Afouras, Gül Varol, Samuel Albanie, Liliane Momeni, and Andrew Zisserman. "Aligning Subtitles in Sign
Language Videos." ICCV 2021.
Temporal alignment of automatic ASR subtitles with on-screen sign language video
82. 82
Conclusion: Speech2Signs (and Signs2Speech)
End-to-end translation & production
Hi, I’m Amelia and I’m going
to talk to you about how to
remove gum from hair.
HI, ME FS-AMELIA WILL
EXPLAIN HOW REMOVE
GUM FROM YOUR HAIR
Speech Language Gloss [1] Sign transcription [2] Video
3D Poses 2D Poses Segments [3]
Multiple vision, natural language & speech challenges for a societally impactful task.
[1] Yin, Kayo, and Jesse Read. "Better Sign Language Translation with STMC-Transformer." COLING 2020.
[2] Hanke, Thomas. "HamNoSys-representing sign language data in language resources and language processing contexts." In LREC, vol. 4, pp. 1-6. 2004.
[3] Renz, Katrin, Nicolaj C. Stache, Samuel Albanie, and Gül Varol. "Sign language segmentation with temporal convolutional networks." ICASSP 2021.
83. Supported by
Facebook AI
Interested in work with us on SL ?
● @DocXavi
● xavier.giro@upc.edu
● Full list of publications & tech reports.
{Thank You}
Thank you
These slides &
talk
https://how2sign.github.io/