Sign Language Recognition Performance

Isolated Signs:

Data Set Features Method Accuracy
MS-ASL Raw Frames Re-Sign 14.69
DEVISIGN Sparse Coding SVM 64
SMILE Hand Movement + Hand Shape HMM 66.8
DEVISIGN 3D Coordinates iRDML 56.85
CSL Cyberglove Phonological Parameters Fuzzy Decision Tree 91.6
ASLLVD Raw Frames CNN + RNN 91
A3LIS-147 Hand Centroids, non-manual features, hand orientation HMM 48.06
NGT Pose Key-points, Non-Manual Features Stacked LSTMs 80

Continuous Signing:

Data Set Features Method WER
SIGNUM Raw Frames CNN+HMM 7.4
ATIS Raw Frames Search + Rescoring 45.1
RWTH-BOSTON-104 Raw Frames HMM + GMM + Covariance Matrix + LM 17
RWTH-PHOENIX-Weather 2014 Raw Frames CNN-LSTM-HMM 26

Various Sign Language Datasets

L1 - native signer
I - isolated, C - continuous
ID-Gloss - lemmatised gloss
Name Country L1/L2 Lexicon I/C Data Type Annotation
Ordbog over Dansk Tegnsprog Denmark L1 1600 I Video Gloss, Phonological Parameters
POLYTROPON Greece - 2000 C RGB-D Gloss, Translation, Phonological Parameters
ASL Signbank USA L1 1000 I Video ID Glosses, Phonological Parameters, Relationships
LESCO Corpus Costa Rica - 960 C Video ID-Gloss, Translation, Phonological Parameters
Cologne Corpus Germany L2 281 C Video ID-Glosses, Translation, Phonological Parameters
SSLC Sweden L1 - C Video ID-Glosses
NGT Netherlands L1 3900 C Video ID-Gloss, Translation
BSLCP UK L1 1800 I&C Video ID-Gloss, Translation
RWTH-PHOENIX-Weather 2014 Germany L1 1558 C Video Gloss
MS-ASL USA - 1000 I Video Gloss
DEVISIGN China - 4414 I Video Gloss
SMILE Switzerland mix 100 I RGB-D Gloss
CSL Cyberglove China - 5113 I Glove Data Gloss, Phonological Parameters
SIGNUM Germany L1 450 I&C Video Gloss
ATIS mix - 400 C Video Gloss
RWTH-BOSTON-104 USA - 104 C Video Gloss
ASLLVD USA L1 3300 I Video Gloss, Phonological Parameters, Epentheses
A3LIS-147 Italy - 147 I Video Gloss
WLASL USA - 2000 I Video Gloss, start/end frames per sign, signer bounding box

Comparing Sign Languages using Phonological Parameters

Comparing Sign Languages can tell us whether different languages have commonalities (language universals) Tables under show three songs (by Bieber, Beyonce, and Rihanna), interpreted in two sign languages (ASL and Libras) and the relative frequencies of the location and orientation combinations for each video and each sign language. We can observe that
  • Libras
    • has less abdomen activity than the ASL (indicated in light blue)
    • has more neck and ears activity than the ASL (dark blue and green respectively)
  • Both sign languages have more pointing up direction of the hands as opposed to other possible directions

Sign Language Parameters Correlation

Parameters:
  • Handedness
  • Hand orientation (ORI)
  • Hand location (TAB)

Distributions:


Overall ORI/TAB Correlation
right hand: 0.30 left hand: 0.35

Question:
Which ORI/TAB combinations have significant correlations?



https://colab.research.google.com/drive/1gsi8CVA-UJdNrttVjpXaMjs03vOLQIfh
https://stats.stackexchange.com/questions/429685/rxc-contingency-table-to-2x2-tables-for-local-correlation-analysis/430755#430755

Extending FRCNN with Nested Classes

Faster R-CNN (FRCNN) Keras model is designed to perform both object identification and classification given a raw image [1].
There are situations when we would like to have not only multi-object identification and classification on an image, but also nested classification.
For example, we have found a dog on an image and we would like to know whether the dog is facing left or right on the image without having extra top-level labels, like dog that is facing right and the dog that is facing left, etc. This reduces the number of labels. If we have dog and cat images and the three directions the animals are facing, then instead of having six labels
  • dog left
  • dog right
  • dog up
  • cat left
  • cat right
  • cat up
, we would have five labels
  • dog
  • cat
  • left
  • right
  • up
As a rule of thumb, the less labels and classes the model has to distinguish, the better it performs.

Current implementation of the FRCNN reuses convolutional feature maps, returned by the pre-trained base neural network (e.g. VGG) after the bounding boxes have been identified and classified in the region proposal part of the network. We can utilise that by adding nested labels in the region-based convolution part of the network.

How is this useful for the automated sign languages processing?
Such end-to-end networks can be used to localise hands in an image and then detect handshape, orientation, movement, location classes for that hand in one pass of an image through the network.

https://github.com/mocialov/FRCNN_multiclass

Building InMoov Robot for Sign Language Production

While there are many commercial humanoid robots available, majority would not satisfy the Objective 1 in that they would either have simpler hands with either fixed fingers or different type of a gripper (e.g. Atlas, Enon) or fingers that move all at the same time (e.g. Nao, Pepper). There are robots, however, that have dexterous hands (e.g. iCub), but they are expensive and have capabilities beyond the scope of this project.
We propose to build a custom robot - InMoov shown in Figure below with capabilities that are sufficient for the signing. The robot is an open-source project with all the models available for any modifications.

The ability of the robot to assume specific handshapes depends on whether the hands of the robot will be capable of assuming configurations such that resemble all or most of the handshapes of the sign language that it will be signing in. InMoov should be capable of executing about half of the handshapes out of available 63 handshapes in the British Sign Language (handshapes shown below). Mainly, the hand of the InMoov will not be able to assume configuration of handshapes, where fingers are close to each other (i.e. showing the palm with the fingers not spread in different directions). This is due the fact that the InMoov hand does not have degrees of freedom for moving its `metacarpal' bones (links) of the fingers sideways. Same applies to the handshapes, described in the HamNoSys documentation [1]

Linguistic Visual Feature Vector

Signs in sign languages can be described using the parameters, such as 1) DEZ - shape of the hand, 2) ORI - orientation of the hand, 3) SIG - movement of the hand, 4) TAB - location of the hand, and 5) NMS - non-manual signs, such as facial expressions, body posture, and similar. Some notation systems, such as HamNoSys have been developed to capture these parameters in their phonetic transcription systems.
DEZ
ORI
  • Extended finger direction
  • Extended finger direction in relation to the body (away/towards the body/etc)
  • Palm orientation
  • Palm orientation in relation to the body (palm up/down/towards body/etc)
SIG
  • Straight (can be targeted)
  • Curved with the direction of the curve (can be targeted)
  • Zigzag, wavy
  • Circular
  • In-place movement (replacement of the handshape, orientation)
TAB
  • Proximity of the hand to a body part
  • Side of the body part
NMS
  • Will be handled in the future


Example

  • Speeds graph for a closed interval from frame N to frame M
  • Red point signifies both hands' movement with major change in the speed
  • Green point signifies when the change in the movement started taking place (backtracking)
ORI
  • Yellow line shows the palm orientation
  • Black line shows the direction of the extended finger
SIG
  • Yellow lines show the hands' movement trajectory on the closed interval from frame N to frame M
TAB
  • Dots on the black background show skeleton keypoints, identified with the OpenPose library
  • Orange dots are the tracked centroids of the palms
  • White dots are the triggered body parts in proximity of the centroids of the palms (according to the proximity matrix heatmap)


Live Demo


https://github.com/mocialov/LinguisticVisualFeatureVector

Handshape Recognition

Although HamNoSys uses more general handshapes to describe the DEZ parameter, we were interested in recognising handshapes of a particular sign language, Danish sign language, with available public dataset. The dataset consists of the isolated videos of people signing one sign per single video. In addition, the dataset has an XML file that gives information, among other the handshapes in each video. The annotation is, therefore, weak as the XML file specifies overall handshape for the whole video in case if a sign has only one handshape. Unfortunately, the annotation does not specify when a specific handshape begins in a video and when it ends.

We were interested in finding out which visual features of the dataset and which machine learning algorithm produce best handshape recognition results. Figure below shows that 4 datasets were generated: 1) Raw image, cropped around the handshape, 2) Human skeleton features, returned by the OpenPose library, 3) Distances between the human skeleton features, and 4) Black and white images of the handshape skeleton.

Since the dataset is weakly labeled, we had to filter out the frames that were irrelevant. For example, frames, where the handshape was outside of the frame and frames, where the hand was moving into a position to execute the sign (known as motion epenthesis). our approach to filtering out the epenthesis was in accordance with the method in [1]. As a result, the method produced the segmentation points for every video and only the frames between the segmentation points were taken to contain the annotated handshape for that video.

Machine learning methods for recognising the data in the generated datasets was 1) InceptionV3 pre-trained on the ImageNet, 2) Decision Tree, Random Forest, MLP, kNN, 3) Decision Tree, Random Forest, MLP, kNN, and 4) 2- and 3-layers CNN as well as the same architecture, pre-trained on MNIST dataset.
Results in the image below show that the most efficient algorithm for recognising the handshapes is the Random Forest on raw OpenPose features, giving ~90% recognition on the test set, which is composed of 13 handshapes as describd in [2].

https://github.com/mocialov/HandshapeRecogniser

Sign Language Modelling

Despite the availability of many alternatives for language modelling, such as count-based n-grams and their variations [1-5], hidden Markov models [6-7], decision trees and decision forests [8], and neural networks [9-10], research in sign language modelling predominantly employs simple n-gram models, such as in [11-13].
The reason for the wide-spread use of n-grams in sign language modelling is the simplicity of the method. There is an obvious disconnect between n-grams and the sign language in that sign language is perceived visually, while the n-grams are commonly applied to text sequence modelling. For this reason, the authors in [6], [13-16] model glosses, such as the ones shown on Figure 2, which are obtained from the transcribed sign languages.
Glosses model the meaning of a sign in a written language, but not the execution. Therefore, the true meaning of what was signed may get lost when working with the higher-level glosses. To overcome this issue and to incorporate valuable information into sign language modelling, additional features are added, such as non-manual features (e.g. facial expressions) [13-15], [17].

For our monolingual dataset, we extracted 810 sentences from the BSL corpus with an average length of the sentence being 4.31 words, minimum and maximum lengths of 1 and 13 words respectively.

We explore transfer learning methods, whereby a model developed for one language, such as the pre-processed Penn Treebank (PTB) dataset, is reused as the starting point for a model on a second language, which is less resourced, such as the British Sign Language (BSL). We examine two transfer learning techniques of finetunning and layer substitution for language modelling of the BSL.

The results show improvement in perplexity when using transfer learning with standard stacked LSTM models, trained initially using a large corpus for standard English from the Penn Treebank corpus.

Methods to Epenthesis Modelling

Motion Epenthesis (ME) is an arbitrary hand and body movement between the signs, usually done to bring the hands into a position for the next sign.

Previously, modeling and recognition of the ME received very little attention in research on continuous sign language recognition. Instead, the research focused on identifying individual signs in continuous signing with such techniques as Dynamic Time Warping as in [3] or Recurrent Neural Networks for recognising isolated signs as in [5], thus, modelling ME implicitly.
One of the first researchers to consider ME in continuous sign language recognition used approach that trained parallel HMMs with explicitly trained HMMs for the ME [2]. After, [4] learned transition-movement models, where the system iteratively was trained both on the isolated signs and transitions between the signs. Latest research explicitly classifies frames in a continuous signing videos as either ME or a part of a sign through a sophisticated process of generating a Laplacian matrix of relationships between the body joint positions and training a random forest classifier [6].
Apart from the machine learning methods for modelling ME for the recognition of the continuous signing, some approaches looked at heuristic approaches to continuous sign language recognition. In particular, [1] used observation that the hand motion during ME is usually faster than the hand motion during signing.

Making Sense of the Data (in progress..)

Topology learning and associative memory algorithms.
Topology learning can be used as clustering. Some of the methods allow continuous clustering with very few parameters. These are suitable for stream data, where the amount of clusters is unknown a priori.


  
Neural Networks
  
Competitive Learning
  
Incremental Topology Learning
  
Associative Memory
Method Features Variations
  
  
SOM (Kohonen's Feature Maps). Able to project highly dimensional data onto lower (2D) dimensional space. Must specify number of nodes in advance. Decaying parameters over time. Number of nodes is not pre-defined. Pre-specified and fixe topology that matches the data. Maximum number of nodes must be defined. Has two layers: input and map layer. Forms spatial clusters. Performs vector quantization. May be initialised as a single line of neurons or 2D grid, or any other structure. Hierarchical SOM, Parallel SOM, eSOM, Self-Growing SOM
  
  
BAM Forget previously learned data
  
  
Hopfield Forget previously learned data. Even when learning in batch
  
SOIAM Does not cope with temporal sequences. Based on SOINN. Associative memory.
  
  
  
SOINN Insertion of nodes is stopped and restarted with new input patterns. Avoids indefinite increase of nodes. Copes with temporal sequences. Difficult to choose when to stop first layer training and start the second layer training. M-SOINN, E-SOINN
  
  
GAM 3-layer input-symbol memorisation-symbol association (grounding) architecture
  
  
GNG Number of nodes is not predefined. Maximum amount of nodes must be defined. Competitive Hebbian Learning (CHL) is not optional. Terminates when network reaches user-defined size. Has issues adapting to rapidly changing distributions. Nodes are added after certain number of iterations. The structure of the network is not constrained. Can map inputs onto different dimensionalities within the same network. GNGU
  
eSOM Faster than SOM. Has two layers: input and map layer. Forms spatial clusters. Does not perform vector quantization. No topological constraints. Nodes are not organised into 1D or 2D.
  
  
E-SOINN Works better on datasets and uses fewer parameters than SOINN. Single-layered network. Result depends on the sequence of input data. Uses Euclidean distance for finding the nearest node, which may not be scalable to higher dimensions.
  
GNGU Can adapt to rapidly changing distributions by relocating less useful nodes. It removes nodes that contribute little to the reduction of the error and inserts nodes where they would contribute mostly to the reduction of error. Nodes with low utility are removed.
M-SOINN Allows to set similarity thresholds for all nodes. Moves nodes and its neighbours closer to the input. Prunes clusters with only few nodes.
  
LVQ Number of neurons is pre-defined. Not suitable for incremental learning.
GCS Based on SOM. Number of neurons is not predefined. Maximum amount of nodes must be defined. Nodes are inserted after certain number of iterations. Topology-preserving. GCS network structure is constrained
  
MAM Number of associations must be pre-defined. With 3 layers can deal with 3-3 associations, but not 4-4 associations.
K-mean Must specify number of clusters.
  
KFMAM
  
KFMAM-FW Fixed weights. May enter infinite loop with edges between nodes if number of maximum nodes is not given.
  
  
ANG Enables incremental learning. Treats two overlapping clusters as one. Can be used for clustering.
Self-Growing SOM Does not need specified amount of nodes. Every specified amount of iterations, neurons are added into the map space. Instead of one neuron and learning connections, a row or a column of neurons is added. This is done to maintain the structure of the SOM.
  
SOINN-AM Given a pattern, reconstructs it.
  
SOM-AM Associative memory for temporal sequences. Initial weights are crucial.
  
  
  
NG Parameters decay over time. CHL is optional. Can map inputs onto different dimensionalities within the same network.
PSOM Interpolation approach to self-organisation
LB-SOINN
GTM Alternative to SOM. Does not require decaying parameters. Map highly dimensional data onto lower dimensional data and adds noise. Uses RBF for nonlinear mapping from input space to the output space. Good for representation of the data.
TRN
GGG
GM?
CCLA Growing network. Supervised. Nodes added to the hidden layer. New nodes act as feature detectors
Incremental Growing Grid Nodes are added to the perimeter of the grid that grows to cover the input space.
RCE Use prototype vectors to describe particular classes. If none of the vectors are close to the input, new class is generated. Prototypes cannot move once they have been placed.
ART More complex example of RCE. Adds new categories when mismatch is found between input and existing categories.
CLAM Few nodes participate in classification, rather than winner-takes-it-all approach.
  
  
GWR Does not connect the winning and the second winning node. New nodes can be added at any times, not only after certain amount of iterations. Can be used as novelty detectors (if node that fires has not fired before or fired infrequently, then the input is novel)
XOM ?

Sign Language Motion Capture Using Optical Marker-Based Systems

Sign Language (SL) data (usually motion) capture hardware high-level overview:

1. Optical cameras Collect light
    a) RGB / RGB-D Only collects light. Sensitive to variations
        i) Marker-based Calculate position and orientation using markers. Requires marker identification
        ii) Marker-less Have to extract silhouettes/edges to calculate position and orientation
    b) Spectral Emits and collects light. Not too sensitive to variations (e.g. infra-red)
        i) Marker-based Calculate position and orientation using light coming from markers
            1) Active markers Markers are identified distincly by capturing the frequency of the light that they emit
            2) Passive markers Capture reflection from markers. Need to be assigned labels manually
                a) Concave Reflect light in any direction
                b) Flat Poor reflection during affine transformation
        i) Marker-less Have to extract silhouettes/edges to calculate position and orientation
2. Gyroscopes Measure rotation
3. Accelerometers Measure acceleration (e.g. IMU sensors)
4. Flex sensors Measures degree of deformation / bending (e.g. data-glove)

RGB Marker-less
Region of interest detection, feature extraction, feature tracking
Spectral Passive Marker-Based
Raw, labeled data
Examples

The choice of hardware is a balance between the precision of the data and degree of restriction of free movement. The best option is to choose based on the context/goal. Acquisition of various expensive heterogeneous hardware is inevitable if one wants to gather accurate data and minimally restrict the signer.

Sign Language (SL) data (usually motion) capture steps:
  1. Prepare environment
    • Set up the hardware (location, orientation, settings). Think whether you need large capture volume or small capture volume. This will affect captured data
    • Determine location of markers (if any). Try to simplify joints (e.g. do not place markers on every joint)
      • Place markers on signer (avoid markers' movements between recordings / signers)
    • Cover all reflecting surfaces
    • Adjust light / keep light constant
  2. If no model, then create Model (best to have signer-specific model)
    • Record Data
    • Post-Processing
      • Label captured markers
      • Handle overlapping markers (avoid reversing markers' identifiers)
      • Fill gaps (linear, cubic, or using neighbouring markers)
      • Remove noise (e.g. reflections)
      • Generate model
  3. If model exists, use the model
    • Record Data
    • Apply Model
    • Post-Processing
      • Label not labeled markers
      • Handle overlapping markers (avoid reversing markers' identifiers)
      • Fill gaps (linear, cubic, or using neighbouring markers)
      • Remove noise (e.g. reflections)

Sign Language Analysis Surveys

Definitions:
  1. SL - sign language
  2. SLR - sign language recognition (analysis)
  3. NMS - non-manual sign(s)
Common main points from survey papers on SLR:
  1. Generally, SLR is for classification and understanding the meaning of signs. More detailed, understanding SL involves recognition (classification) of face expressions and gestures, tracking (if tracking method is used) and motion analysis
  2. SL is multimodal and it is performed in parallel (NMS only can be performed in parallel). Therefore, SLR requires simultaneous observations of separate body articulations. Authors propose Parallel HMM instead of the classic HMM
  3. Research focusses on feature extraction, classification, and scaling to large vocabularies
  4. Signs are affected by: a) showing action that is performed in time, b) inter-personal signing, c) emphasis of smth, d) one sign influences the other, e) transition from one sign to the other
  5. Distinguish tracking and non-tracking methods
  6. Identify importance of NMS
  7. Very small corpora/datasets (especially for the Kinect)
  8. Two approaches for classification: a) single classification b) classifying (simultaneous) components (must integrate/combine extracted features to describe a sign)
  9. Distinguish techniques based on what body part is being detected/classified/recognised: hand, fingers, NMS (temporal, appearance, encoding, positional dimensions), body, and head
  10. Data acquisition with datagloves (advantages: precision; disadvantages: price, cumbersome to the signer results in unnatural signing), attached accelerometers (same as datagloves), cameras (with/without colored gloves) (advantages: price, expressive; disadvantages: pre-processing, motion blur, no depth information), Kinect (advantages: depth information; disadvantages: price)
Specific additional points from Ong et al on SLR:
  1. Many incorrectly treat SL as gestures (causes incorrect use of techniques and methods).
  2. Gestures are different, although SL is more complicated (confined), many gesture recognition techniques are applicable to SL. 
  3. Although SL is natural language, same as speech, many speech techniques are not suitable for SLR.
  4. Kinect improves classification accuracy
  5. Tracking is hard because conversations are fast and images are blurred and occluded
  6. Research frontiers: continuous SLR through not only simple segmentation but epenthesis modelling, signer independence, fusion of multimodal data, use of linguistics theories to improve recognition, generalisation to complex corpora
Specific additional points from Cooper et al on SLR:
  1. Distinguishes simple, not-so-simple, and robust tracking methods
  2. Simple methods for tracking and occlusion are not satisfactory without coloured or data gloves
  3. Advantage of separating feature-level and sign-level classification is that fewer classes (finite number) need to be distinguished at the feature level.
  4. Approaches that use features for classification are not scalable to larger datasets, while approaches that use sign-level classification are scalable. Feature-level classifiers do not need to be retrained when new signs are added
  5. Success of a classifier (either vision-based on glove-based) is measured with: a) classification accuracy b) continuous signing (general approach - segmentation, boundary detection with automatically learned appropriate features, or model epenthesis that had been proven to be more advantageous to classification accuracy, c) Grammatical processes in sign languages (do not 'squeeze' sign into some window, tolerate variable time) d) signers independence

Learning Domain-Specific Policy for Sign Language Recognition

Learned task-oriented policies are applied in navigation [1-2], dialogue [3], control [4-5], multi-agent learning [6], and many more applications mentioned by Deisenroth M.P., Neumann G. and Peters J. in the survey on policy search in robotics [7]. The policy for a specific task can be learned through demonstrations [8-10], guidance, or feedback (reward-driven approach) on overall completion of the task [11].
We were interested in navigation-based and manual-following tasks, touching on appealing recent general purpose approaches that are currently being used in dialogue management. These specific examples have concentrated mostly on policy learning for uni-modal interaction with the main modality being either speech, text, or vision.
We have focused on one-way training, where the human provides instructions to an agent and expects it to follow these instructions to reach the final goal as close as possible to what the human expects.
Vogel A. and Jurafsky D. [1] as well as Branavan S.R.K et al. [12] made use of the reinforcement learning for mapping instructions to actions in navigational map task or for mapping manual text onto actions. Dipendra K. Misra used reinforcement learning to train a neural network with reward shaping for mapping visual observations and textual descriptions onto actions in a simulated environment [13]. In all these cases, the constructed model did not include semantic and syntactic knowledge, although, Vogel A. and Jurafsky D. seeded their method with a set of spatial terms for learning more complicated features, such as the position relative to the landmark. Nevertheless, in their case, learning basic navigation was achieved without semantic and syntactic knowledge.
from state: start go to: start
for utterance: okay

from state: start go to: caravan_park
for utterance: starting off we are above a caravan park
...

Listing above presents a small excerpt from the overall output of the policy, following which the agent moves from one landmark to the other given one utterance. For example, for the first utterance "okay", the agent moves from start landmark to the start landmark, which means the agent does not move. In the next utterance, "caravan park" is mentioned, so the agent moves from start landmark to the caravan park landmark.

Figure above presents both the original map and the envisioned trajectory on the left and the created trajectory by the agent, following the learned policy on the right. The right side shows the final outcome after all the utterances had been presented and all instructions followed.
How is this useful to the computer-based understanding of the sign languages?
There are sign languages instruction-based datasets with a task at hand, similar to the HCRC map task [14]. With these datasets, a model-free policy can be learned, which could respond to navigation commands, given in the sign languages. Unfortunately, we were not able to acquire the datasets for further experiments.

https://github.com/mocialov/RL_HCRC_MapTask_DirectionFollowing_Simplified

Sign Language Learning Systems

Fig. 3: Sign Language Tutoring Systems
Figure 3 lists a number of tutoring systems from 2005 until 2016.

  • Three categories are distinguished among identified systems. Tutoring systems are roughly categorised as the categories are not mutually exclusive.
    1. Items in the first category teach sign language to its users without any explicit knowledge evaluation. 
    2. Teach & test category contains most of the elements and lists systems that perform both teaching and testing of user's knowledge. 
    3. Third category distinguished is more specific than the first two as it names systems that foster interaction and communication among their users (e.g. deaf children or non-deaf family members and deaf children).
  • Colour-coding in the figure groups systems based on their target groups: non-deaf parents of deaf children, deaf children, both non-deaf parents of deaf children and deaf children, any children, and any users.
  • Two categories include items that are highlighted with line under and both bold and inclined text. These items correspond to the names of the systems that feature robotic devices in them to perform teaching and/or testing of user's knowledge.

Gesture recognition algorithm with the help of OpenCV, using string of feature graphs, HyperNEAT with novelty search, and resilient backpropagation

Fig. 1: Gesture classifier
Figure 1 shows a gesture classifier that takes a raw video input through a number of stages.
  1. Pre-defined features are extracted from the video. 
  2. Extracted features for every frame of the video are compared and the difference for every frame is recorded in an affinity matrix that describes the similarity of every single frame with every other frame in a video.
  3. Single layer detector neural networks are evolved using novelty search to extract unique features from the affinity matrix. 
  4. Extracted features are inputted into the final classifier neural network that is trained to classify gestures in the video.

Fig 2: Gesture classifier evaluation

Figure 2 shows the evaluation results of the gesture classifier. 
  • Human and robot subject small datasets were created for controlled algorithm testing. 
  • Algorithm had also been tested on a public ChaLearn[3] dataset. 
  • The algorithm had not been tailored to different datasets.

Implementation: https://github.com/mocialov/MSc-in-Robotics-and-Autonomous-Systems/tree/master/gesture_recognition_pipeline