Open challenges in sign language translation and production

Xavier Giro-i-Nieto
@DocXavi
xavier.giro@upc.edu
Associate Professor
Universitat Politècnica de Catalunya
Open Challenges in
Sign Language Translation & Production
UPC Intelligent Data Science
and Artiﬁcial Intelligence)
VASC Seminar
September 8, 2021

Current & former students
2
Benet
Oriol
Jordi
Aguilar
Cayetana
López
Lucas
Ventura
Amanda
Duarte
Laia
Tarrés
Andrea
Iturralde
Maram A.
Mohamed
Álvaro
Budria
Sandra
Roca
Daniel
Moreno
Janna
Escur
Mireia
Hernández
Peter
Muschick
Pol
Pérez
Görkem
Camli
Jordi
López
Gerard
Gállego

Acknowledgements
3
Shruti
Palaskar
Deepti
Ghadiyaram
Kenneth
DeHaan
Florian
Metze
Francesc
Moreno
Jordi
Torres
Marta R.
Costa-jussà
Kevin
McGuinness

Outline
4
Motivation
A crash course on sign languages (SL)
State of the art
Challenges
Conclusion

Classic Motivation: Accessibility
5
“World Report on Hearing”. World Health Organization 2021.
Number of people and
percentage prevalence
according to grades of
hearing loss.

6
Shelly Shadha, “Launch of the World Report on Hearing”. World Health Organization 2021.

Classic Motivation: Accessibility to basic services
7
“World Report on Hearing”. World Health Organization 2021.
● Sign language interpretation improves
access to education and health services.
○ A survey conducted in 2009 by the World
Federation of the Deaf revealed that 68% of the
93 responding countries did not have access to
professional sign language.
○ Professional sign language interpreters are even
more scarce in developing countries

8
● New challenges for the deaf community
because of the COVID-19 pandemic.
https://whereistheinterpreter.com/
#whereistheinterpreter
“Due to the pandemic, more and more medical
professionals are treating COVID-19 patients
from behind a barrier, using masks that impede
lip-reading, and not allowing in-person
interpreters,” says the. National Association of
the Deaf.
Summer Epps, “COVID’s Forgotten Victims: The Deaf Community” . Webmd 2021

9
Amit Moryossef, “Google Translate for Sign Language”. 2021. [talk] [code]

10
Google Home Max Amazon Echo Show 10
Facebook Portal

Novel Motivation: Human-Computer Interaction
11
Samsung, How to use the Gesture Control on Smart TV? (2020)

12

13
Computer Human
Teaching
that scales
Interaction
Interaction
Human

Outline
14
Motivation
State of the art
Challenges
Conclusion

A crash course on Sign Languages (SL)
Cultural diversity of sign languages, similar to spoken languages
○ American (ASL), British (BSL), German (GSL), Chinese (CSL)… sign languages.
15
Irish Sign Language (ISL) Catalan Sign Language (LSC)

Sign languages are NOT a one-to-one mapping from spoken languages.
16
Look-Up
Table
Hi, I’m Amelia and I’m
going to talk to you
about how to remove
gum from hair.
Sign Language
(video)
Spoken Language
(transcription)
��🏼

There exist a textual transcription method named “glosses”.
17
HI, ME FS-AMELIA WILL
EXPLAIN HOW REMOVE
GUM FROM YOUR HAIR
Hi, I’m Amelia and I’m
going to talk to you about
how to remove gum from
hair.
Spoken Language
(transcription)
Sign Language
(transcription)

● Manual features:
○ Handshape
○ Palm
● Non-manual fetaures
○ Head (nod / shake / tilt)
○ Mouth
○ Eyebrows
○ Cheeks
○ Facial grammar (or expressions)
○ Body position
...orientation, movement, location.
18
Stokoe Jr, William C. "Sign language structure: An outline of the visual communication systems of the American deaf." Journal of
deaf studies and deaf education (2005).
Figure: Arizona State University

SLs use persistent spatial grounding (eg. by pointing & placing) !
19
Liddell, Scott K. "Spatial representations in discourse: Comparing spoken and signed language." Lingua (1996).
“Right along here…” ...immobile entity is
located here,

SLs use persistent spatial grounding (eg. by pointing & placing) !
20
Liddell, Scott K. "Spatial representations in discourse: Comparing spoken and signed language." Lingua (1996).
“Not far and to the
right of,
...tall, vertical entity at this place.

Outline
21
Motivation
State of the art
Challenges
Conclusion

Sign-to-Spoken Language Tasks
22
SL Translation Hi, I’m Amelia and I’m going to talk to you
about how to remove gum from hair.
GIPHY/SIGNN WITH ROBERT
Isolated SL Recognition
Continuous SL Recognition
Finger-spelling
HI, ME FS-AMELIA WILL EXPLAIN
HOW REMOVE GUM FROM YOUR
HAIR
“I”
A, B, C, D...

Sign-to-Spoken Language Tasks
23
SL Translation Hi, I’m Amelia and I’m going to talk to you
about how to remove gum from hair.

Sign-Spoken Language Tasks
SL Production
SL Translation
Sign Language
(video)
24
Spoken Language
(transcription)
Hi, I’m Amelia and
I’m going to talk
to you about how
to remove gum
from hair.

Neural Machine Translation
25
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." NeurIPS 2014.
Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase
representations using RNN encoder-decoder for statistical machine translation." EMNLP 2014.
Encoder Decoder
Representation
I’m going to talk to
you about how to
remove gum from
hair.
Dia duit, is mise
Amelia agus beidh
mé ag caint leat faoi
conas guma a bhaint
de ghruaig.

Automatic Speech Recognition (ASR)
26
Encoder Decoder
Representation
you about how to
remove gum from
hair.
Graves, Alex, and Navdeep Jaitly. "Towards end-to-end speech recognition with recurrent neural networks." ICML 2014.
#LAS Chan, William, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. "Listen, attend and spell: A neural network for large vocabulary
conversational speech recognition." ICASSP 2016.

Image Captioning
27
Encoder Decoder
Representation
A group of people
shopping at ann
outdoor market.
Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." CVPR 2015.
Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015.

Neural Sign Language Translation
28
Encoder Decoder
Representation
you about how to
remove gum from
hair.

29
Camgoz, Necati Cihan, Simon Hadﬁeld, Oscar Koller, Hermann Ney, and Richard Bowden.
"Neural sign language translation." CVPR 2018.

30
Camgoz, Necati Cihan, Oscar Koller, Simon Hadﬁeld, and Richard Bowden. "Sign language
transformers: Joint end-to-end sign language recognition and translation." CVPR 2020.

Neural Sign Language Production
31
Encoder Decoder
Representation
you about how to
remove gum from
hair.

32
Saunders, Ben, Necati Cihan Camgoz, and Richard Bowden. "Mixed SIGNals: Sign Language Production via
a Mixture of Motion Primitives." ICCV 2021.

33
Encoder Decoder
Representation
you about how to
remove gum from
hair.

34
Saunders, Ben, Necati Cihan Camgoz, and Richard Bowden. "Progressive transformers for end-to-end
sign language production." ECCV 2020.

35
Stoll, Stephanie, Necati Cihan Camgoz, Simon Hadﬁeld, and Richard Bowden. "Text2Sign: Towards sign
language production using neural machine translation and generative adversarial networks." IJCV 2020.

36
Saunders, Ben, Necati Cihan Camgoz, and Richard Bowden. "Everybody sign now: Translating spoken
language to photo realistic sign language video." arXiv 2020.

Outline
37
Motivation
State of the art
Challenges
Conclusion

Challenges
38
Computer Vision
Speech
NLP
Training Data

Challenges in Computer Vision
39
Off-the-shelf pose detectors and generators struggle with hands.

40
��
Zhou, Yuxiao, Marc Habermann, Weipeng Xu, Ikhsanul Habibie, Christian Theobalt, and Feng Xu. "Monocular real-time
hand shape and motion capture using multi-modal data." CVPR 2020.

41
��
Weinzaepfel, Philippe, Romain Brégier, Hadrien Combaluzier, Vincent Leroy, and Grégory Rogez. "Dope: Distillation of
part experts for whole-body 3d pose estimation in the wild." ECCV 2020.

42
��
Saunders, Ben, Necati Cihan Camgoz, and Richard Bowden. "Progressive transformers for end-to-end sign language
production." ECCV 2020.

43
��
Ng, Evonne, Shiry Ginosar, Trevor Darrell, and Hanbyul Joo. "Body2hands: Learning to infer 3d hands from
conversational gesture body dynamics." CVPR 2021.

Challenges
44
Computer Vision
Speech
NLP
Training Data

Challenges in NLP
Sign Languages are:
45
🤔
(Very) low-resource
languages…
...in a (very) high
dimensional space (video).
��🏼
��🏼

Challenges in NLP
46
Figure: TensorFlow tutorial
Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. "A neural probabilistic language model." Journal of machine learning
research 3, no. Feb (2003): 1137-1155.
🤔
What are “language
models” in sign
language ?

Challenges in NLP
47
How to transfer from
large pre-trained
(“foundation”) models ?
#GPT-3 Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Agarwal, S. Language models
are few-shot learners. NeurIPS 2020 (best paper award).
Source: [OpenAI API]
English: My name is Barbara.
ASL: ME NAME fs-B-A-R-B-A-R-A.
English: Is he a teacher?
ASL: HE TEACHER HE
English: Amir is tall.
ASL: fs-A-M-I-R, HE TALL HE
English: I’m not sad.
ASL: ME SAD ME 🤔

Challenges
48
Computer Vision
Speech
NLP
Training Data

Challenges in Speech Translation
49
Jia, Ye, Michelle Tadmor Ramanovich, Tal Remez, and Roi Pomerantz. "Translatotron 2: Robust direct speech-to-speech
translation." arXiv preprint arXiv:2107.08661 (2021).
Speech Video
Speech Speech
End-to-end End-to-end
🤔

Challenges
50
Computer Vision
Speech
NLP
Training Data

Challenges in Training Data
51
Damen, Dima, and Michael Wray. "Supervision Levels Scale (SLS)." arXiv (2020). [tweet]
Data(X)
Labels(y)

52
X

Parallel corpus
53
Fully supervised learning requires a large dataset of pairs of sentences in the two
languages to translate.
Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning
phrase representations using RNN encoder-decoder for statistical machine translation." AMNLP 2014.

Continuous Sign Language Datasets
54

The How2Sign dataset
55
Multi-view RGB videos RGB-D videos
Body-face-hands keypoints
2D keypoints estimation from OpenPose [2]
How2 dataset [1]
Speech Signal
English Transcription
Hi, I’m Amelia and I’m going
to talk to you about how to
remove gum from hair.
Instructional videos
Multi-view VGA and HD videos [3]
Multi-view recordings (only for a subset)
3D keypoints
estimation
Gloss Annotation
HI, ME FS-AMELIA WILL EXPLAIN HOW REMOVE GUM FROM YOUR HAIR
Duarte, A., Palaskar, S., Ventura, L., Ghadiyaram, D., DeHaan, K., Metze, F., ... & Giro-i-Nieto, X.
How2Sign: a large-scale multimodal dataset for continuous American sign language. CVPR 2021.

Continuous Sign Language Datasets
56

The How2Sign dataset: Recorded at CMU
57

The largest dataset in ASL
58

59
Built on top of How2

Spoken Language
(speech)
SL Production
SL Translation
Sign Language
(video)
60
Spoken Language
(transcription)
Hi, I’m Amelia and I’m going to
talk to you about how to
Synthesis
ASR
#How2 Sanabria, Ramon, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, Lucia Specia, and Florian Metze. "How2: a large-scale dataset for
multimodal language understanding." arXiv 2018.

How2 dataset [1]
Speech Signal
English Transcription
Instructional videos
[1] Sanabria, Ramon, et al. "How2: a large-scale dataset for multimodal language understanding." arXiv preprint arXiv:1811.00347 (2018).
English Speech
Speech track available for end-to-end English to ASL.
English Transcriptions
Automatically generated subtitles aligned at the
sentence level.
English to Brazilian Translations
Allows multilingual research.
61

Front+side RGB, Front Depth & Multi-view RGB
63

Green Studio
Multi-view RGB videos
RGB-D videos
Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., Kanade, T., Nobuhara,S.,
Sheikh, Y.: Panoptic studio: A massively multiview system for social motioncapture. In:
ICCV, 2015.
Panoptic Studio
Multi-view VGA and HD videos
64

2D & 3D pose estimation
65

2D & 3D pose estimation
Multi-view RGB videos
Body-face-hands keypoints
2D keypoints estimation from OpenPose [1]
3D keypoints estimation [2]
[1] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei and Y. A. Sheikh, "OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields" in TPAMI, 2019.
[2] Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., Kanade, T., Nobuhara,S., Sheikh, Y.: Panoptic studio: A massively multiview system for social motioncapture. In: ICCV, 2015
Multi-view VGA and HD videos
66

Dataset hierarchy
68
Camera view
Recording
Video
Clip
Frame
Green studio: Frontal or side
Panoptic: Multi-view
ASL Gloss
English transcription
RGB, Depth
Openpose
Category
Signer
Studio
Green studio
Panoptic (multi-view)

Dataset statistics
Clips length Sentences length
70

Application: Human motion transfer
71
Ventura, Lucas, Amanda Duarte, and Xavier Giró-i-Nieto. "Can everybody sign now? Exploring sign
language video generation from 2D poses." ECCV 2020 SLRTP Workshop.

72
2D Pose
estimation
[Openpose]
GAN-
generated
[Everybody
dance now]

73
Ventura, Lucas, Amanda Duarte, and Xavier Giró-i-Nieto. "Can everybody sign now? Exploring sign
language video generation from 2D poses." ECCV 2020 SLRTP Workshop.

74
“Choose one category”
Can ASL signers understand our generated videos ?
Skeleton
GAN-generated
Classiﬁcation
accuracy

75
Skeleton
GAN-generated
Mean Opinion
Score
“How well could you understand the video?”

76
“Translate the ASL signs into written English.”
Skeleton
GAN-generated

77
X

78
Yin, Kayo, and Jesse Read. "Better Sign Language Translation with
STMC-Transformer." COLING 2020. [talk]
Moryossef, Amit, Kayo Yin, Graham Neubig, and Yoav Goldberg. "Data
Augmentation for Sign Language Gloss Translation." arXiv 2021.
Generation of gloss pseudo-labels by training a transformer.
Moreno D, Duarte A, Costa-jussà MR, Giró-i-Nieto X.
English to ASL Translator for Speech2Signs. UPC 2018.

79
Renz, Katrin, Nicolaj C. Stache, Samuel Albanie, and Gül Varol. "Sign language segmentation with temporal convolutional
networks." ICASSP 2021.
Sign segmentation in continuous sign language videos.

80
Bull, Hannah, Triantafyllos Afouras, Gül Varol, Samuel Albanie, Liliane Momeni, and Andrew Zisserman. "Aligning Subtitles in Sign
Language Videos." ICCV 2021.
Temporal alignment of automatic ASR subtitles with on-screen sign language video

Outline
81
Motivation
State of the art
Challenges
Conclusion

82
Conclusion: Speech2Signs (and Signs2Speech)
End-to-end translation & production
HI, ME FS-AMELIA WILL
EXPLAIN HOW REMOVE
GUM FROM YOUR HAIR
Speech Language Gloss [1] Sign transcription [2] Video
3D Poses 2D Poses Segments [3]
Multiple vision, natural language & speech challenges for a societally impactful task.
[1] Yin, Kayo, and Jesse Read. "Better Sign Language Translation with STMC-Transformer." COLING 2020.
[2] Hanke, Thomas. "HamNoSys-representing sign language data in language resources and language processing contexts." In LREC, vol. 4, pp. 1-6. 2004.
[3] Renz, Katrin, Nicolaj C. Stache, Samuel Albanie, and Gül Varol. "Sign language segmentation with temporal convolutional networks." ICASSP 2021.

Supported by
Facebook AI
Interested in work with us on SL ?
● @DocXavi
● xavier.giro@upc.edu
● Full list of publications & tech reports.
{Thank You}
Thank you
These slides &
talk
https://how2sign.github.io/

Open challenges in sign language translation and production

Recomendados

Recomendados

Más contenido relacionado

Similar a Open challenges in sign language translation and production

Similar a Open challenges in sign language translation and production (14)

Más de Universitat Politècnica de Catalunya

Más de Universitat Politècnica de Catalunya (20)

Último

Último (20)

Open challenges in sign language translation and production