Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Sign language recognition with RNN and Mediapipe

2.993 visualizaciones

Publicado el

Gesture recognition with deep learning

Publicado en: Ingeniería
  • Inicia sesión para ver los comentarios

Sign language recognition with RNN and Mediapipe

  1. 1. SIGN LANGUAGE Translator Team: Argo Jongwook Kim JiHyun Kim Gesture recognition using Deep learning
  2. 2. 01. Data preprocess Convert gesture video to training data
  3. 3. How can we recognize hand in video? Before.. 1. Using OpenCV to get hand silhouette for each frame 2. Use CNN to retrain the silhouette image 3. Use RNN to train sequence of processed image data to each word Limits: too blunt... and 2D approach! <Existing approach> silhouette with OpenCV 01. Data preprocess
  4. 4. Google released Hand Tracking..!! It’s a real time hand skeleton tracking framework that works well on mobile. Hand Tracking is based on Mediapipe which is graph-based framework to build multimodal applied ML pipelines. 01. Data preprocess Mediapipe
  5. 5. What is Hand Tracking? Can we use Hand Tracking for our ML model? Google’s Hand Tracking automatically finds and draws skeleton on the screen in real time 01. Data preprocess
  6. 6. 01. Data preprocess Making training video data example Step1. Record videos for each sign language word Word <Sorry> (0:01~0:02) x 50 videos Word <Yes> (0:01~0:02) x 50 videos
  7. 7. 01. Data preprocess Making training video data example Step2. Save the videos for each word in one folder with word names Example) 50 videos per one word X 8 words = 400 videos
  8. 8. Let’s try OpenCV way using CNN 01. Data preprocess “ ”
  9. 9. We extracted skeleton from bare hands 01. Data preprocess Advantages of Mediapipe + CNN approach • No additional device required (ex. colored gloves or background) -> Convenient • More detailed feature of hand -> Granularity <Our new approach> Mediapipe + CNN approach Our New way...
  10. 10. But.. 01. Data preprocess <RNN train results> • We retrained these picture frames with Inception Model v3 and used the result as RNN input • About 12 words took more than 8 hours to train..! <Conclusion> -> Using every picture frame as RNN input is still too heavy! Is there better way?
  11. 11. Let’s use landmarks instead of skeleton silhouette 01. Data preprocess
  12. 12. Where is 21 landmarks?? Let’s extract them into txt file
  13. 13. 1. Extract 42 landmark pairs(21 * 2) for each frame and combine them into one txt file per video -> Find the corresponding file and modify the mediapipe code 2. Use video input instead of webcam to make our data (Mediapipe is optimized for real-time detection, but we wanted to use it to create data set with video input) -> Add input_video_path and output_video_path for bazel build in command line 3. Make output data ready to load for RNN model (1) Extract txt file for each video (1word) X number of video (2) Combine txt file for every word and label into one .pkl file -> Make python shell script and extract automatically What we need to do ... Customize
  14. 14. 1. Extract 42 landmarks in Mediapipe 01. Use this way - Mediapipe do not provide a file to automatically extracts landmarks : Landmarks are only used for intermediate value inside the graph pipeline Modified Code >> Download uitil/
  15. 15. 2. Input your own training data video and build 01. Use this way - Default input path for mediapipe is webcam so use our python shell script to automatically extract processed mp4 video and txt data files.
  16. 16. Thank you google’s mediapipe team.. 01. Use this way
  17. 17. 3. Make .pkl file for RNN input - Only txt file data is needed so input txt file data path to [INPUT_PATH] and this will create pickle file. Use : Pickle file is used for processing a lot of data at once. (Saved as python object file) -> Output file train_data.pkl will be used for RNN input 01. Use this way
  18. 18. Open source / Library / Platform - building mediapipe: Bazel - LSTM model: keras, tensorflow - Data visualization: pyplot, excel - data augmentation: iMovie
  19. 19. 2. LSTM model Architecture Build LSTM model and test
  20. 20. LSTM model Training data set Apple Bird Blue Cents Child Cow Drink Green Hello Like Me too No Orange Pig Sorry Thank you Where Who Yes You • Each word needs at least 50 videos (20 categories * 50 = recorded 1000 videos) • First 100 ASL words in one hand Build data set for machine learning model Sign language words categories
  21. 21. • Run mediapipe with bazel build for each directory word • Make python script to automatically read the directory files run mediapipe Directory name = sign language word name LSTM model
  22. 22. Convert video into mp4, txt files <video frame images> mp4 files with hand tracking (default) txt files with extracted features (modified) mediapipe * used open source: Bazel (software build and test automation) LSTM model
  23. 23. How to convert these numbers into LSTM input?
  24. 24. LSTM model input shape input_dim = 42 à landmarks timesteps = 100 à prefixed frame number batch_size = 32 • absolute position • normalization LSTM model
  25. 25. Frame number distribution in word video Timestamp decision (for word videos) • frame numbers do not exceed 100 à prefixed value • zero padding <figure> statistical graph for frame distribution LSTM model
  26. 26. Label training set Word samples Labels [0 1 0 0 0 0 ... ] [0 0 1 0 0 0 ... ] [0 0 0 1 0 0 ... ] .... [0 0 ... 0 0 1 ] C = 20 à one hot encoding LSTM model
  27. 27. Dataset - Train / validation / test split Train (50%) validation (30%) test (20%) LSTM model
  28. 28. LSTM model Absolute position change in word video sequence of hand landmarks loss function: categorial cross entropy optimizer: RMSprop prediction! LSTM model LSTM
  29. 29. LSTM model LSTM (1) LSTM (2) LSTM (3) loss function: categorial cross entropy optimizer: RMSprop softmax function LSTM model input shape (100, 42) batch size 32 0.85632
  30. 30. LSTM model LSTM model
  31. 31. Accuracy(current) LSTM model
  32. 32. Segmentation How to segment gesture sequence?
  33. 33. Segmentation Study on the continuous hand gesture recognition system for the Korean sign language Kim Jung Bae (reference)
  34. 34. Segmentation How to segment sequence gesture?
  35. 35. Segmentation How to segment sequence gesture?
  36. 36. Github repository Link▼
  37. 37. What is Next? Improve our RNN model and import in iOS iOS
  38. 38. THANKS Does anyone have any questions? Tean Argo Jongwook Kim Anna Kim