AI Sign Language Interpreter

Chad Harthan
7 min readDec 10, 2020

--

Project Overview

This project was worked on by Chad Harthan, Reed Hopkins, Jacob Medeiros, Shania Paul, Josh Kall and Chris Erattuparambil. This code (link at bottom) is a sign language letter interpreter that allows someone who doesn’t understand sign language to communicate with someone who uses sign language without using a keyboard.

The interpreter prompts a person to speak into their computer and uses a Google NLP library to convert the speech into a list of characters as shown in Figure 1. The other part of the project allows one to use pictures from live video and then uses a residual neural network (ResNet) to read the images to predict what letter that sign language letter image corresponds to. Figure 2 shows an example of how one would use the camera to create sign language letters. This translation bot would be working with a deaf person who uses sign language and another person who does not understand sign language, but could talk. The bot would take speech or sign language and convert it into text, which both parties can understand.

Figure 1: Example Output of Speech to Text
Figure 2: Using the Camera to Create the Letter G in Sign Language

The Data

We used two datasets to train and test our ResNet. The first one is this dataset found on Kaggle, and it contains a train and test folder with JPG images of sign language letters. The train folder, which is shown in Figure 3, contains 29 subfolders with 3000 images in each of the 29 folders (26 folders correspond to a letter in the alphabet, 1 folder for the signed word space, 1 folder for the signed word delete). The test folder contains 28 images of different letters. We chose this dataset for 2 reasons: the amount of images to train on and the variation of the images. In total there are 87,000 train images and the images for each letter had different lighting and hand positions which would make the ResNet more robust.

The other dataset was images we collected with our own camera of the 29 different classes. We did this to give a more accurate representation of what the model could expect in the real-world testing scenario. We captured 20 images of each class and stored them in a Google Drive Folder that we used to train the model. This data helped a considerable amount and we expect that more of this type of data could help the model distinguish between relatively similar signs.

Figure 3: Structure of the Kaggle Training Images Folder

Code Overview

The architecture of the ResNet is built off the following blog on TowardsDataScience.com. There were several iterations of the residual neural network. One of the challenges we faced was hardware limitations during training. Because the training dataset is so large and each picture is high res (200x200 pixels), we quickly ran out of RAM. We first tried to fix this issue by training on one letter at a time, but this seemed to create some major bias issues towards the later letters. We resolved this issue by training the ResNet in batches consisting of each letter and saving our progress. In addition, we created different iterations of the ResNet by varying the way we trained it. Before we could get the Final Bot to work, we had to get the ResNet to accurately predict letters from captured live images. Below is the progression of the model.

Iteration 1: Predicting Z for each image

Iteration 2: Correctly predicting each image from initial testing set

Iteration 3: Training on our own signs using noisy backgrounds

Iteration 4: Overfitting model to non-noisy training images

Iteration 5: Decreasing training time and increasing training set size

Predicting Z for each image

The first way we decided to take batches was 3000 letters at a time, but all these letters were from the same letter folder. We think that this produced a model that was extremely biased towards the letters it trained on last, which is why our model was only predicting Z. We then decided to take 103 letters from each of the 29 categories (A-Z, nothing, space, delete) in each training session, which is almost 3000 entries. We also saved the model and the weights properly and our model started to get better.

Correctly predicting each image from initial testing set

To begin with, we found a large dataset online from people who had done a project similar to ours. They had their own testing set, and once we fixed the problems from the previous iteration our model was able to achieve 100% accuracy on the provided testing set. However, when we tried to use the model to predict our own signs, it performed extremely poorly. We decided to begin forming our own training set.

Training on our own signs using noisy backgrounds

At this point, we each took about 20–30 images for each letter of our own signs to create a new training set. Something we didn’t pay attention to, however, was our image backgrounds. It turns out that complex backgrounds can add a lot of noise to images, making it harder for our model to discover the underlying structure of the ASL alphabet. To fix this, we standardized our backgrounds with a light, single color background.

Overfitting model to non-noisy training images

Here, we each took another 20 images for each letter, but this time with a simple background. We noticed some significant improvements in both training and testing here. During training, we noticed that the model was able to reach a higher accuracy after a smaller number of epochs (which would lead to more problems), which was a good sign. During testing, we found that there was an increase in accuracy (up to 1–10%) but still not enough to call our project a success. We thought the issue was simply not enough training data, so we submitted some more images. However, when the problem didn’t go away, we decided to pivot our approach.

Decreasing training time and increasing training set size

After researching what all the different hyperparameters control during neural net training, we discovered that training for too many epochs on an insufficient amount of training data can very much lead to overfitting. To fix this, we started a new model from scratch. We trained on all the data we had, but decreased the number of epochs by a power of 10. Additionally, Josh took 100 pictures for each letter to create a much larger training set. Finally, our model achieved a high testing accuracy. This solution would also explain why the model is able to predict Josh’s signs much better than anyone else’s.

Sign Language Image Capture

We first worked in Jupyter Notebooks to develop our image capture solution to retrieve the sign language letters. We decided to only capture letters because words in sign language are actions and we did not want to go through the video processing through a neural network. We used a pre built model that took pictures every frame once turned on. We modified this code to take pictures at a longer time interval to allow the user to change letters. We also put a white box on the video screen for the user to center their hand so it would be easier for our bot to interpret. Finally, we put a counter on the video stream, so that the user would know when the photo would be taken.

When putting our final solution to Google Colab, we were not able to implement this exact solution, because we were unable to impose our own images onto the video stream, as the method of streaming was different, due to colab. We also changed the code from automatically taking pictures, to the user, clicking any key to capture an image. This was because the video feed was continuous and we could not isolate frame by frame.

Final Bot

We created the bot with a simple driver: 1. Convert Sign Language to Text 2. Convert Speech To Text 3. Quit Application

By clicking 1 on the keyboard our image capture code starts to run and the deaf user can input their sign language letters. As each letter is captured it is fed into our residual neural network. Then the full phrase is output to the screen for the other person to read.

After this the second person can select 2 on the keyboard and speak. This will be translated into text and the deaf individual can understand what was said. Conversation can be carried in this fashion with our bot until one user selects 3 on the keyboard quitting our application.

Future Improvements

At the moment our bot reliably works with the person who has input the most additional training images, which is Josh. We would like to get a lot more real-world training samples, so that our bot could predict anybody’s sign language letters with ease.

In addition to this, the translation system is quite slow currently which makes it quite difficult to use in a real application. We believe that we could fix this by taking snapshots of the video feed very quickly and only registering a letter whenever the probability of a certain letter is over a specified threshold. This process would make it feel more like a real time rendering of sign language.

Another way we could improve on this project to be more applicable to the real world is to take videos as data to be able to capture words in sign language. This is much harder to do because it is harder to process video, and we would have to use a different neural network to do this.

Links to Code

Colab: https://colab.research.google.com/drive/19gbzbiffEYESSz0wsv4-XR1mD4PfptqR?usp=sharing

Github: https://github.com/j-ka11/ASL_translator_bot.git

--

--