RoboSign

Sign Language Recognition Performance

Isolated Signs:

Data Set	Features	Method	Accuracy
MS-ASL	Raw Frames	Re-Sign	14.69
DEVISIGN	Sparse Coding	SVM	64
SMILE	Hand Movement + Hand Shape	HMM	66.8
DEVISIGN	3D Coordinates	iRDML	56.85
CSL Cyberglove	Phonological Parameters	Fuzzy Decision Tree	91.6
ASLLVD	Raw Frames	CNN + RNN	91
A3LIS-147	Hand Centroids, non-manual features, hand orientation	HMM	48.06
NGT	Pose Key-points, Non-Manual Features	Stacked LSTMs	80

Continuous Signing:

Data Set	Features	Method	WER
SIGNUM	Raw Frames	CNN+HMM	7.4
ATIS	Raw Frames	Search + Rescoring	45.1
RWTH-BOSTON-104	Raw Frames	HMM + GMM + Covariance Matrix + LM	17
RWTH-PHOENIX-Weather 2014	Raw Frames	CNN-LSTM-HMM	26

Various Sign Language Datasets

L1 - native signer
I - isolated, C - continuous
ID-Gloss - lemmatised gloss

Name	Country	L1/L2	Lexicon	I/C	Data Type	Annotation
Ordbog over Dansk Tegnsprog	Denmark	L1	1600	I	Video	Gloss, Phonological Parameters
POLYTROPON	Greece	-	2000	C	RGB-D	Gloss, Translation, Phonological Parameters
ASL Signbank	USA	L1	1000	I	Video	ID Glosses, Phonological Parameters, Relationships
LESCO Corpus	Costa Rica	-	960	C	Video	ID-Gloss, Translation, Phonological Parameters
Cologne Corpus	Germany	L2	281	C	Video	ID-Glosses, Translation, Phonological Parameters
SSLC	Sweden	L1	-	C	Video	ID-Glosses
NGT	Netherlands	L1	3900	C	Video	ID-Gloss, Translation
BSLCP	UK	L1	1800	I&C	Video	ID-Gloss, Translation
RWTH-PHOENIX-Weather 2014	Germany	L1	1558	C	Video	Gloss
MS-ASL	USA	-	1000	I	Video	Gloss
DEVISIGN	China	-	4414	I	Video	Gloss
SMILE	Switzerland	mix	100	I	RGB-D	Gloss
CSL Cyberglove	China	-	5113	I	Glove Data	Gloss, Phonological Parameters
SIGNUM	Germany	L1	450	I&C	Video	Gloss
ATIS	mix	-	400	C	Video	Gloss
RWTH-BOSTON-104	USA	-	104	C	Video	Gloss
ASLLVD	USA	L1	3300	I	Video	Gloss, Phonological Parameters, Epentheses
A3LIS-147	Italy	-	147	I	Video	Gloss
WLASL	USA	-	2000	I	Video	Gloss, start/end frames per sign, signer bounding box

Comparing Sign Languages using Phonological Parameters

Comparing Sign Languages can tell us whether different languages have commonalities (language universals) Tables under show three songs (by Bieber, Beyonce, and Rihanna), interpreted in two sign languages (ASL and Libras) and the relative frequencies of the location and orientation combinations for each video and each sign language. We can observe that

Libras
- has less abdomen activity than the ASL (indicated in light blue)
- has more neck and ears activity than the ASL (dark blue and green respectively)
Both sign languages have more pointing up direction of the hands as opposed to other possible directions

Sign Language Parameters Correlation

Parameters:

Handedness
Hand orientation (ORI)
Hand location (TAB)

Distributions:

Overall ORI/TAB Correlation


right hand: 0.30	left hand: 0.35

Question:
Which ORI/TAB combinations have significant correlations?

https://colab.research.google.com/drive/1gsi8CVA-UJdNrttVjpXaMjs03vOLQIfh
https://stats.stackexchange.com/questions/429685/rxc-contingency-table-to-2x2-tables-for-local-correlation-analysis/430755#430755

Extending FRCNN with Nested Classes

Faster R-CNN (FRCNN) Keras model is designed to perform both object identification and classification given a raw image [1].
There are situations when we would like to have not only multi-object identification and classification on an image, but also nested classification.
For example, we have found a dog on an image and we would like to know whether the dog is facing left or right on the image without having extra top-level labels, like dog that is facing right and the dog that is facing left, etc. This reduces the number of labels. If we have dog and cat images and the three directions the animals are facing, then instead of having six labels

dog left
dog right
dog up
cat left
cat right
cat up

, we would have five labels

dog
cat
left
right
up

As a rule of thumb, the less labels and classes the model has to distinguish, the better it performs.

Current implementation of the FRCNN reuses convolutional feature maps, returned by the pre-trained base neural network (e.g. VGG) after the bounding boxes have been identified and classified in the region proposal part of the network. We can utilise that by adding nested labels in the region-based convolution part of the network.

→

How is this useful for the automated sign languages processing?
Such end-to-end networks can be used to localise hands in an image and then detect handshape, orientation, movement, location classes for that hand in one pass of an image through the network.

https://github.com/mocialov/FRCNN_multiclass

Show Bibliography

Building InMoov Robot for Sign Language Production

While there are many commercial humanoid robots available, majority would not satisfy the Objective 1 in that they would either have simpler hands with either fixed fingers or different type of a gripper (e.g. Atlas, Enon) or fingers that move all at the same time (e.g. Nao, Pepper). There are robots, however, that have dexterous hands (e.g. iCub), but they are expensive and have capabilities beyond the scope of this project.
We propose to build a custom robot - InMoov shown in Figure below with capabilities that are sufficient for the signing. The robot is an open-source project with all the models available for any modifications.

The ability of the robot to assume specific handshapes depends on whether the hands of the robot will be capable of assuming configurations such that resemble all or most of the handshapes of the sign language that it will be signing in. InMoov should be capable of executing about half of the handshapes out of available 63 handshapes in the British Sign Language (handshapes shown below). Mainly, the hand of the InMoov will not be able to assume configuration of handshapes, where fingers are close to each other (i.e. showing the palm with the fingers not spread in different directions). This is due the fact that the InMoov hand does not have degrees of freedom for moving its `metacarpal' bones (links) of the fingers sideways. Same applies to the handshapes, described in the HamNoSys documentation [1]

Show Bibliography

Linguistic Visual Feature Vector

Signs in sign languages can be described using the parameters, such as 1) DEZ - shape of the hand, 2) ORI - orientation of the hand, 3) SIG - movement of the hand, 4) TAB - location of the hand, and 5) NMS - non-manual signs, such as facial expressions, body posture, and similar. Some notation systems, such as HamNoSys have been developed to capture these parameters in their phonetic transcription systems.

DEZ	[2019] Handshape Recognition
ORI	Extended finger direction Extended finger direction in relation to the body (away/towards the body/etc) Palm orientation Palm orientation in relation to the body (palm up/down/towards body/etc)
SIG	Straight (can be targeted) Curved with the direction of the curve (can be targeted) Zigzag, wavy Circular In-place movement (replacement of the handshape, orientation)
TAB	Proximity of the hand to a body part Side of the body part
NMS	Will be handled in the future

Example

Speeds graph for a closed interval from frame N to frame M
Red point signifies both hands' movement with major change in the speed
Green point signifies when the change in the movement started taking place (backtracking)

ORI

Yellow line shows the palm orientation
Black line shows the direction of the extended finger

SIG

Yellow lines show the hands' movement trajectory on the closed interval from frame N to frame M

TAB

Dots on the black background show skeleton keypoints, identified with the OpenPose library
Orange dots are the tracked centroids of the palms
White dots are the triggered body parts in proximity of the centroids of the palms (according to the proximity matrix heatmap)

Live Demo

https://github.com/mocialov/LinguisticVisualFeatureVector

Show Bibliography

Handshape Recognition

Although HamNoSys uses more general handshapes to describe the DEZ parameter, we were interested in recognising handshapes of a particular sign language, Danish sign language, with available public dataset. The dataset consists of the isolated videos of people signing one sign per single video. In addition, the dataset has an XML file that gives information, among other the handshapes in each video. The annotation is, therefore, weak as the XML file specifies overall handshape for the whole video in case if a sign has only one handshape. Unfortunately, the annotation does not specify when a specific handshape begins in a video and when it ends.

We were interested in finding out which visual features of the dataset and which machine learning algorithm produce best handshape recognition results. Figure below shows that 4 datasets were generated: 1) Raw image, cropped around the handshape, 2) Human skeleton features, returned by the OpenPose library, 3) Distances between the human skeleton features, and 4) Black and white images of the handshape skeleton.

Since the dataset is weakly labeled, we had to filter out the frames that were irrelevant. For example, frames, where the handshape was outside of the frame and frames, where the hand was moving into a position to execute the sign (known as motion epenthesis). our approach to filtering out the epenthesis was in accordance with the method in [1]. As a result, the method produced the segmentation points for every video and only the frames between the segmentation points were taken to contain the annotated handshape for that video.

Machine learning methods for recognising the data in the generated datasets was 1) InceptionV3 pre-trained on the ImageNet, 2) Decision Tree, Random Forest, MLP, kNN, 3) Decision Tree, Random Forest, MLP, kNN, and 4) 2- and 3-layers CNN as well as the same architecture, pre-trained on MNIST dataset.

Results in the image below show that the most efficient algorithm for recognising the handshapes is the Random Forest on raw OpenPose features, giving ~90% recognition on the test set, which is composed of 13 handshapes as describd in [2].

https://github.com/mocialov/HandshapeRecogniser

Show References

Sign Language Modelling

Despite the availability of many alternatives for language modelling, such as count-based n-grams and their variations [1-5], hidden Markov models [6-7], decision trees and decision forests [8], and neural networks [9-10], research in sign language modelling predominantly employs simple n-gram models, such as in [11-13].
The reason for the wide-spread use of n-grams in sign language modelling is the simplicity of the method. There is an obvious disconnect between n-grams and the sign language in that sign language is perceived visually, while the n-grams are commonly applied to text sequence modelling. For this reason, the authors in [6], [13-16] model glosses, such as the ones shown on Figure 2, which are obtained from the transcribed sign languages.
Glosses model the meaning of a sign in a written language, but not the execution. Therefore, the true meaning of what was signed may get lost when working with the higher-level glosses. To overcome this issue and to incorporate valuable information into sign language modelling, additional features are added, such as non-manual features (e.g. facial expressions) [13-15], [17].

For our monolingual dataset, we extracted 810 sentences from the BSL corpus with an average length of the sentence being 4.31 words, minimum and maximum lengths of 1 and 13 words respectively.

We explore transfer learning methods, whereby a model developed for one language, such as the pre-processed Penn Treebank (PTB) dataset, is reused as the starting point for a model on a second language, which is less resourced, such as the British Sign Language (BSL). We examine two transfer learning techniques of finetunning and layer substitution for language modelling of the BSL.

The results show improvement in perplexity when using transfer learning with standard stacked LSTM models, trained initially using a large corpus for standard English from the Penn Treebank corpus.

Show References

Methods to Epenthesis Modelling

Motion Epenthesis (ME) is an arbitrary hand and body movement between the signs, usually done to bring the hands into a position for the next sign.

Previously, modeling and recognition of the ME received very little attention in research on continuous sign language recognition. Instead, the research focused on identifying individual signs in continuous signing with such techniques as Dynamic Time Warping as in [3] or Recurrent Neural Networks for recognising isolated signs as in [5], thus, modelling ME implicitly.
One of the first researchers to consider ME in continuous sign language recognition used approach that trained parallel HMMs with explicitly trained HMMs for the ME [2]. After, [4] learned transition-movement models, where the system iteratively was trained both on the isolated signs and transitions between the signs. Latest research explicitly classifies frames in a continuous signing videos as either ME or a part of a sign through a sophisticated process of generating a Laplacian matrix of relationships between the body joint positions and training a random forest classifier [6].
Apart from the machine learning methods for modelling ME for the recognition of the continuous signing, some approaches looked at heuristic approaches to continuous sign language recognition. In particular, [1] used observation that the hand motion during ME is usually faster than the hand motion during signing.

Show References

Making Sense of the Data (in progress..)

Topology learning and associative memory algorithms.
Topology learning can be used as clustering. Some of the methods allow continuous clustering with very few parameters. These are suitable for stream data, where the amount of clusters is unknown a priori.

Method	Features	Variations
	Neural Networks
	Competitive Learning
	Incremental Topology Learning
	Associative Memory
SOM	(Kohonen's Feature Maps). Able to project highly dimensional data onto lower (2D) dimensional space. Must specify number of nodes in advance. Decaying parameters over time. Number of nodes is not pre-defined. Pre-specified and fixe topology that matches the data. Maximum number of nodes must be defined. Has two layers: input and map layer. Forms spatial clusters. Performs vector quantization. May be initialised as a single line of neurons or 2D grid, or any other structure.	Hierarchical SOM, Parallel SOM, eSOM, Self-Growing SOM
BAM	Forget previously learned data
Hopfield	Forget previously learned data. Even when learning in batch
SOIAM	Does not cope with temporal sequences. Based on SOINN. Associative memory.
SOINN	Insertion of nodes is stopped and restarted with new input patterns. Avoids indefinite increase of nodes. Copes with temporal sequences. Difficult to choose when to stop first layer training and start the second layer training.	M-SOINN, E-SOINN
GAM	3-layer input-symbol memorisation-symbol association (grounding) architecture
GNG	Number of nodes is not predefined. Maximum amount of nodes must be defined. Competitive Hebbian Learning (CHL) is not optional. Terminates when network reaches user-defined size. Has issues adapting to rapidly changing distributions. Nodes are added after certain number of iterations. The structure of the network is not constrained. Can map inputs onto different dimensionalities within the same network.	GNGU
eSOM	Faster than SOM. Has two layers: input and map layer. Forms spatial clusters. Does not perform vector quantization. No topological constraints. Nodes are not organised into 1D or 2D.
E-SOINN	Works better on datasets and uses fewer parameters than SOINN. Single-layered network. Result depends on the sequence of input data. Uses Euclidean distance for finding the nearest node, which may not be scalable to higher dimensions.
GNGU	Can adapt to rapidly changing distributions by relocating less useful nodes. It removes nodes that contribute little to the reduction of the error and inserts nodes where they would contribute mostly to the reduction of error. Nodes with low utility are removed.
M-SOINN	Allows to set similarity thresholds for all nodes. Moves nodes and its neighbours closer to the input. Prunes clusters with only few nodes.
LVQ	Number of neurons is pre-defined. Not suitable for incremental learning.
GCS	Based on SOM. Number of neurons is not predefined. Maximum amount of nodes must be defined. Nodes are inserted after certain number of iterations. Topology-preserving. GCS network structure is constrained
MAM	Number of associations must be pre-defined. With 3 layers can deal with 3-3 associations, but not 4-4 associations.
K-mean	Must specify number of clusters.
KFMAM
KFMAM-FW	Fixed weights. May enter infinite loop with edges between nodes if number of maximum nodes is not given.
ANG	Enables incremental learning. Treats two overlapping clusters as one. Can be used for clustering.
Self-Growing SOM	Does not need specified amount of nodes. Every specified amount of iterations, neurons are added into the map space. Instead of one neuron and learning connections, a row or a column of neurons is added. This is done to maintain the structure of the SOM.
SOINN-AM	Given a pattern, reconstructs it.
SOM-AM	Associative memory for temporal sequences. Initial weights are crucial.
NG	Parameters decay over time. CHL is optional. Can map inputs onto different dimensionalities within the same network.
PSOM	Interpolation approach to self-organisation
LB-SOINN
GTM	Alternative to SOM. Does not require decaying parameters. Map highly dimensional data onto lower dimensional data and adds noise. Uses RBF for nonlinear mapping from input space to the output space. Good for representation of the data.
TRN
GGG
GM?
CCLA	Growing network. Supervised. Nodes added to the hidden layer. New nodes act as feature detectors
Incremental Growing Grid	Nodes are added to the perimeter of the grid that grows to cover the input space.
RCE	Use prototype vectors to describe particular classes. If none of the vectors are close to the input, new class is generated. Prototypes cannot move once they have been placed.
ART	More complex example of RCE. Adds new categories when mismatch is found between input and existing categories.
CLAM	Few nodes participate in classification, rather than winner-takes-it-all approach.
GWR	Does not connect the winning and the second winning node. New nodes can be added at any times, not only after certain amount of iterations. Can be used as novelty detectors (if node that fires has not fired before or fired infrequently, then the input is novel)
XOM	?

Sign Language Motion Capture Using Optical Marker-Based Systems

Sign Language (SL) data (usually motion) capture hardware high-level overview:

1. Optical cameras	Collect light
a) RGB / RGB-D	Only collects light. Sensitive to variations
i) Marker-based	Calculate position and orientation using markers. Requires marker identification
ii) Marker-less	Have to extract silhouettes/edges to calculate position and orientation
b) Spectral	Emits and collects light. Not too sensitive to variations (e.g. infra-red)
i) Marker-based	Calculate position and orientation using light coming from markers
1) Active markers	Markers are identified distincly by capturing the frequency of the light that they emit
2) Passive markers	Capture reflection from markers. Need to be assigned labels manually
a) Concave	Reflect light in any direction
b) Flat	Poor reflection during affine transformation
i) Marker-less	Have to extract silhouettes/edges to calculate position and orientation
2. Gyroscopes	Measure rotation
3. Accelerometers	Measure acceleration (e.g. IMU sensors)
4. Flex sensors	Measures degree of deformation / bending (e.g. data-glove)

Examples
RGB Marker-less		→
	Region of interest detection, feature extraction, feature tracking
Spectral Passive Marker-Based		→
	Raw, labeled data

The choice of hardware is a balance between the precision of the data and degree of restriction of free movement. The best option is to choose based on the context/goal. Acquisition of various expensive heterogeneous hardware is inevitable if one wants to gather accurate data and minimally restrict the signer.

Sign Language (SL) data (usually motion) capture steps:

Prepare environment

Set up the hardware (location, orientation, settings). Think whether you need large capture volume or small capture volume. This will affect captured data
Determine location of markers (if any). Try to simplify joints (e.g. do not place markers on every joint)

Place markers on signer (avoid markers' movements between recordings / signers)

Cover all reflecting surfaces
Adjust light / keep light constant

If no model, then create Model (best to have signer-specific model)

Record Data
Post-Processing

Label captured markers
Handle overlapping markers (avoid reversing markers' identifiers)
Fill gaps (linear, cubic, or using neighbouring markers)
Remove noise (e.g. reflections)
Generate model

If model exists, use the model

Record Data
Apply Model
Post-Processing

Label not labeled markers
Handle overlapping markers (avoid reversing markers' identifiers)
Fill gaps (linear, cubic, or using neighbouring markers)
Remove noise (e.g. reflections)

Show Bibliography

Sign Language Analysis Surveys

Definitions:

SL - sign language
SLR - sign language recognition (analysis)
NMS - non-manual sign(s)

Common main points from survey papers on SLR:

Generally, SLR is for classification and understanding the meaning of signs. More detailed, understanding SL involves recognition (classification) of face expressions and gestures, tracking (if tracking method is used) and motion analysis
SL is multimodal and it is performed in parallel (NMS only can be performed in parallel). Therefore, SLR requires simultaneous observations of separate body articulations. Authors propose Parallel HMM instead of the classic HMM
Research focusses on feature extraction, classification, and scaling to large vocabularies
Signs are affected by: a) showing action that is performed in time, b) inter-personal signing, c) emphasis of smth, d) one sign influences the other, e) transition from one sign to the other
Distinguish tracking and non-tracking methods
Identify importance of NMS
Very small corpora/datasets (especially for the Kinect)
Two approaches for classification: a) single classification b) classifying (simultaneous) components (must integrate/combine extracted features to describe a sign)
Distinguish techniques based on what body part is being detected/classified/recognised: hand, fingers, NMS (temporal, appearance, encoding, positional dimensions), body, and head
Data acquisition with datagloves (advantages: precision; disadvantages: price, cumbersome to the signer results in unnatural signing), attached accelerometers (same as datagloves), cameras (with/without colored gloves) (advantages: price, expressive; disadvantages: pre-processing, motion blur, no depth information), Kinect (advantages: depth information; disadvantages: price)

Specific additional points from Ong et al on SLR:

Many incorrectly treat SL as gestures (causes incorrect use of techniques and methods).
Gestures are different, although SL is more complicated (confined), many gesture recognition techniques are applicable to SL.
Although SL is natural language, same as speech, many speech techniques are not suitable for SLR.
Kinect improves classification accuracy
Tracking is hard because conversations are fast and images are blurred and occluded
Research frontiers: continuous SLR through not only simple segmentation but epenthesis modelling, signer independence, fusion of multimodal data, use of linguistics theories to improve recognition, generalisation to complex corpora

Specific additional points from Cooper et al on SLR:

Distinguishes simple, not-so-simple, and robust tracking methods
Simple methods for tracking and occlusion are not satisfactory without coloured or data gloves
Advantage of separating feature-level and sign-level classification is that fewer classes (finite number) need to be distinguished at the feature level.
Approaches that use features for classification are not scalable to larger datasets, while approaches that use sign-level classification are scalable. Feature-level classifiers do not need to be retrained when new signs are added
Success of a classifier (either vision-based on glove-based) is measured with: a) classification accuracy b) continuous signing (general approach - segmentation, boundary detection with automatically learned appropriate features, or model epenthesis that had been proven to be more advantageous to classification accuracy, c) Grammatical processes in sign languages (do not 'squeeze' sign into some window, tolerate variable time) d) signers independence

Show Bibliography

Learning Domain-Specific Policy for Sign Language Recognition

Learned task-oriented policies are applied in navigation [1-2], dialogue [3], control [4-5], multi-agent learning [6], and many more applications mentioned by Deisenroth M.P., Neumann G. and Peters J. in the survey on policy search in robotics [7]. The policy for a specific task can be learned through demonstrations [8-10], guidance, or feedback (reward-driven approach) on overall completion of the task [11].
We were interested in navigation-based and manual-following tasks, touching on appealing recent general purpose approaches that are currently being used in dialogue management. These specific examples have concentrated mostly on policy learning for uni-modal interaction with the main modality being either speech, text, or vision.
We have focused on one-way training, where the human provides instructions to an agent and expects it to follow these instructions to reach the final goal as close as possible to what the human expects.
Vogel A. and Jurafsky D. [1] as well as Branavan S.R.K et al. [12] made use of the reinforcement learning for mapping instructions to actions in navigational map task or for mapping manual text onto actions. Dipendra K. Misra used reinforcement learning to train a neural network with reward shaping for mapping visual observations and textual descriptions onto actions in a simulated environment [13]. In all these cases, the constructed model did not include semantic and syntactic knowledge, although, Vogel A. and Jurafsky D. seeded their method with a set of spatial terms for learning more complicated features, such as the position relative to the landmark. Nevertheless, in their case, learning basic navigation was achieved without semantic and syntactic knowledge.

from state: start go to: start
for utterance: okay

from state: start go to: caravan_park
for utterance: starting off we are above a caravan park
...

Listing above presents a small excerpt from the overall output of the policy, following which the agent moves from one landmark to the other given one utterance. For example, for the first utterance "okay", the agent moves from start landmark to the start landmark, which means the agent does not move. In the next utterance, "caravan park" is mentioned, so the agent moves from start landmark to the caravan park landmark.

Figure above presents both the original map and the envisioned trajectory on the left and the created trajectory by the agent, following the learned policy on the right. The right side shows the final outcome after all the utterances had been presented and all instructions followed.
How is this useful to the computer-based understanding of the sign languages?
There are sign languages instruction-based datasets with a task at hand, similar to the HCRC map task [14]. With these datasets, a model-free policy can be learned, which could respond to navigation commands, given in the sign languages. Unfortunately, we were not able to acquire the datasets for further experiments.

https://github.com/mocialov/RL_HCRC_MapTask_DirectionFollowing_Simplified

Show References

Sign Language Learning Systems

Fig. 3: Sign Language Tutoring Systems

Figure 3 lists a number of tutoring systems from 2005 until 2016.

Three categories are distinguished among identified systems. Tutoring systems are roughly categorised as the categories are not mutually exclusive.

Items in the first category teach sign language to its users without any explicit knowledge evaluation.
Teach & test category contains most of the elements and lists systems that perform both teaching and testing of user's knowledge.
Third category distinguished is more specific than the first two as it names systems that foster interaction and communication among their users (e.g. deaf children or non-deaf family members and deaf children).

Colour-coding in the figure groups systems based on their target groups: non-deaf parents of deaf children, deaf children, both non-deaf parents of deaf children and deaf children, any children, and any users.
Two categories include items that are highlighted with line under and both bold and inclined text. These items correspond to the names of the systems that feature robotic devices in them to perform teaching and/or testing of user's knowledge.

Show Bibliography

[1] N. Ackovska and M. Kostoska, “Sign language tutor - rebuilding and optimizing,”
[2] N. Adamo-Villani and K. Wright, “Smile: An immersive learning game for deaf and hearing children,”
[3] O. Aran, I. Ari, L. Akarun, B. Sankur, A. Benoit, A. Caplier, P. Campr, A. H. Carrilloet al., “Signtutor: An interactive system for sign language tutoring,”
[4] Y. Bouzid, M. A. Khenissi, and M. Jemni, “Designing a game generator as an educational technology for the deaf learners,”
[5] BSL Zone. History of deaf education. [Online]. Available: http://www.bslzone.co.uk/watch/ history-of-deaf-education/history-deaf-education-2/
[6] C.-H. Chuan and C. A. Guardino, “Designing smartsignplay: An interactive and intelligent american sign language app for children who are deaf or hard of hearing and their families,”
[7] M. J. Davidson, “Paula: A computer-based sign language tutor for hearing adults.”
[8] K. Ellis and K. Blashki, “The digital playground: Kindergarten children learning sign language through multimedia,”
[9] P. Escudeiro, N. Escudeiro, R. Reis, M. Barbosa, J. Bidarra, A. B. Baltasar, P. Rodrigues, J. Lopeset al., “Virtual sign game learning sign language.”
[10] C. Harbig, M. Burton, M. Melkumyan, L. Zhang, and J. Choi, “Signbright: a storytelling application to connect deaf children and hearing parents,”
[11] V. Henderson, S. Lee, H. Brashear, H. Hamilton, T. Starner, and S. Hamilton, “Development of an american sign language game for deaf children,”
[12] K. Huang, J. Smith, K. Spreen, and M. F. Jones, “Breaking the sound barrier: Designing an interactive tool for language acquisition in preschool deaf children,”
[13] P. Ittisarn and N. Toadithep, “3d animation editor and display sign language system case study: Thai sign language,”
[14] T. Kamnardsiri, L.-o.Hongsit, and N. Wongta, “Designing a sign language intelligent game-based learning framework with kinect,”
[15] J. Ohene-Djan and S. Naqvi, “An adaptive www-based system to teach british sign language,”
[16] A. Özkul, H. Köse, R. Yorganci, and G. Ince, “Robostar: An interaction game with humanoid robots for learning sign language,”
[17] L. E. Potter, J. Korte, and S. Nielsen, “Sign my world: Lessons learned from prototyping sessions with young deaf children,”
[18] J. A. Toro, J. C. McDonald, and R. Wolfe,Fostering Better Deaf/Hearing Communication through a Novel Mobile App for Fingerspelling.
[19] P. Uluer, N. Akalın, and H. Köse, “A new robotic platform for sign language tutoring,”
[20] K. A. Weaver and T. Starner, “Smartsign: A different flavor of accessibility.”
[21] L. Xu, V. Varadharajan, J. Maravich, R. Tongia, and J. Mostow , “Design: An intelligent tutor to teach american sign language,”
[22] S. Yarosh, M. Topping, K. Huang, and I. Mosher,PlayWare: Augmenting natural play to teach sign language
[23] Z. Zafrulla, H. Brashear, P. Presti, H. Hamilton, and T. Starner, “Copycat: An american sign language game for deaf children,”
[24] M. Zakipour, A. Meghdari,and M. Alemi, “Rasa: A low-cost upper-torso social robot acting as a sign language teaching assistant,”

Gesture recognition algorithm with the help of OpenCV, using string of feature graphs, HyperNEAT with novelty search, and resilient backpropagation

Fig. 1: Gesture classifier

Figure 1 shows a gesture classifier that takes a raw video input through a number of stages.

Pre-defined features are extracted from the video.
Extracted features for every frame of the video are compared and the difference for every frame is recorded in an affinity matrix that describes the similarity of every single frame with every other frame in a video.
Single layer detector neural networks are evolved using novelty search to extract unique features from the affinity matrix.
Extracted features are inputted into the final classifier neural network that is trained to classify gestures in the video.

Fig 2: Gesture classifier evaluation

Figure 2 shows the evaluation results of the gesture classifier.

Human and robot subject small datasets were created for controlled algorithm testing.
Algorithm had also been tested on a public ChaLearn^[3] dataset.
The algorithm had not been tailored to different datasets.

Implementation: https://github.com/mocialov/MSc-in-Robotics-and-Autonomous-Systems/tree/master/gesture_recognition_pipeline

Show References

RoboSign

Table of Content

Table of Content:

Sign Language Recognition Performance

Various Sign Language Datasets

Comparing Sign Languages using Phonological Parameters

Sign Language Parameters Correlation

Question:
Which ORI/TAB combinations have significant correlations?

Extending FRCNN with Nested Classes

Building InMoov Robot for Sign Language Production

Linguistic Visual Feature Vector

Example

Live Demo

Handshape Recognition

Sign Language Modelling

Methods to Epenthesis Modelling

Making Sense of the Data (in progress..)

Sign Language Motion Capture Using Optical Marker-Based Systems

Sign Language Analysis Surveys

Learning Domain-Specific Policy for Sign Language Recognition

Sign Language Learning Systems

Gesture recognition algorithm with the help of OpenCV, using string of feature graphs, HyperNEAT with novelty search, and resilient backpropagation

Table of Content

Table of Content:

Sign Language Recognition Performance

Various Sign Language Datasets

Comparing Sign Languages using Phonological Parameters

Sign Language Parameters Correlation

Question:Which ORI/TAB combinations have significant correlations?

Extending FRCNN with Nested Classes

Building InMoov Robot for Sign Language Production

Linguistic Visual Feature Vector

Example

Live Demo

Handshape Recognition

Sign Language Modelling

Methods to Epenthesis Modelling

Making Sense of the Data (in progress..)

Sign Language Motion Capture Using Optical Marker-Based Systems

Sign Language Analysis Surveys

Learning Domain-Specific Policy for Sign Language Recognition

Sign Language Learning Systems

Gesture recognition algorithm with the help of OpenCV, using string of feature graphs, HyperNEAT with novelty search, and resilient backpropagation

Question:
Which ORI/TAB combinations have significant correlations?