Presentation given at AVI 2012, International Working Conference on Advanced Visual Interfaces, Capri Island, Italy, May 2012
ABSTRACT: We present SpeeG, a multimodal speech- and body gesture-basedtext input system targeting media centres, set-top boxes and game consoles. Our controller-free zoomable user interface combines speech input with a gesture-based real-time correction of the recognised voice input. While the open source CMU Sphinx voice recogniser transforms speech input into written text, Microsoft’s Kinect sensor is used for the hand gesture tracking. A modified version of the zoomable Dasher interface combines the input from Sphinx and the Kinect sensor. In contrast to existing speech error correction solutions with a clear distinction between a detection and correction phase, our innovative SpeeG text input system enables continuous real-time error correction. An evaluation of the SpeeG prototype has revealed that low error rates for a text input speed of about six words per minute can be achieved after a minimal learning phase. Moreover, in a user study SpeeG has been perceived as the fastest of all evaluated user interfaces and therefore represents a promising candidate for future controller-free text input.
Paper: http://vub.academia.edu/BeatSigner/Papers/1484787/SpeeG_A_Multimodal_Speech-_and_Gesture-based_Text_Input_Solution
19. Evaluation
7 (male) users: 23-31y
“this was easy for us”
“he will allow a rare lie”
“did you eat yet”
1-3: DARPA’s TIMIT
“my watch fell in the water”
“the world is a stage”
“peek out the window”
4-6: MacKenzie and Soukoreff
Performed a quantitative (Words per minute and nr of errors)
and qualitative (feedback and preference) evaluation
Vrije Universiteit Brussel SpeeG - Lode Hoste 19
20. Virtual keyboard
6.3 WPM
10
9
8
7
User 1
6
User 2
WPM
5 User 3
User 4
4 User 5
User 6
3
User 7
2
1
0
S1 S2 S3 S4 S5 S6
Sentence
Vrije Universiteit Brussel SpeeG - Lode Hoste 20
21. Kinect Keyboard
1.83 WPM
3.50
3.00
2.50
User 1
2.00 User 2
WPM
User 3
1.50 User 4
User 5
User 6
1.00
*User 7
0.50
0.00
S1 S2 S3 S4 S5 S6
Sentence
Vrije Universiteit Brussel SpeeG - Lode Hoste 21
22. Speech-only
11 WPM
40
35 User
1
30
25 User 1
User 2
WPM
20 User 3
User 4
15 User 5
Speech Recognis
User 6
(CMU Sphinx 4
10 User 7
5
0
S1 S2 S3 S4 S5 S6
Sentence
Vrije Universiteit Brussel SpeeG - Lode Hoste 22
23. SpeeG
5.8 WPM
10
9
8
7
User 2
6
User 1
WPM
5 User 3
User 4
4 User 5
User 6
3
User 7
2
1
0
S1 S2 S3 S4 S5 S6
Sentence
Vrije Universiteit Brussel SpeeG - Lode Hoste 23
24. SpeeG
2.6 7.8 WPM
10
9
8
7
User 2
6
User 1
WPM
5 User 3
User 4
4 User 5
User 6
3
User 7
2
1
0
S1 S2 S3 S4 S5 S6
Sentence
Vrije Universiteit Brussel SpeeG - Lode Hoste 24
25. Mean WPM per sentence
and input device Virtual Keyboard for Xbox 1D Keyboard for Xbox
5
25
Speech-only
User SpeeG
GUI (JDasher)
3
1
4
2
20
Speech Recogniser Hand Tracking
(CMU Sphinx 4) (Microsoft Kinect and NITE)
15
Controller
WPM
Speech only
10 Kinect only
SpeeG
5
0
S1 S2 S3 S4 S5 S6
Sentence
Vrije Universiteit Brussel SpeeG - Lode Hoste 25
26. Errors per sentence
and input device Virtual Keyboard for Xbox 1D Keyboard for Xbox
5
10
Speech-only
User SpeeG
GUI (JDasher)
9 1
3
4
2
8
7 Speech Recogniser
(CMU Sphinx 4)
Hand Tracking
(Microsoft Kinect and NITE)
Mean number of errors
6
Controller
5 Speech only
4 Kinect only
SpeeG
3
2
1
0
S1 S2 S3 S4 S5 S6
Sentence
Vrije Universiteit Brussel SpeeG - Lode Hoste 26
30. SpeeG
A
Mul&modal
Speech-‐
and
Gesture-‐
based
Text
Input
Solu&on
Lode
Hoste,
Bruno
Dumas,
Beat
Signer
Kinect Speech
- Controller-free text input - Non-native speakers
- Real-time correction - Untrained voice recogniser
- Dasher, zoomable interface - 6-12 WPM
- probabilities - Perceived fastest
- alphabetic order - Game-like character
- character-level - Novice and experts
Vrije Universiteit Brussel Special thanks to Jorn De Baerdenmaeker and Keith Vertaenen
SpeeG - Lode Hoste 30