Why another HMM Library?

Several general machine learning toolkits have become popular over the years, such as Weka in Java, Sckits-Learn in Python, or more recently MLPack in C++. However, none of the above libraries were adapted for the purpose of this thesis. As a matter of fact, most HMM implementations are oriented towards classification and they often only implement offline inference using the Viterbi algorithm.

In speech processing, the Hidden Markov Model Toolkit (HTK) has now become a standard in Automatic Speech Recognition, and gave birth to a branch oriented towards synthesis, called HTS. Both libraries present many features specific to speech synthesis that do not yet match our use-cases in movement and sound processing, and have a really complex structure that does not facilitate embedding.

Above all, we did not find any library explicitly implementing the Hierarchical HMM, nor the regression methods based on GMMs and HMMs. For these reasons, we decided to start of novel implementation of these methods with the following constraints:

Real-Time: Inference must be performed in continuously, meaning that the models must update their internal state and prediction at each new observation to allow continuous recognition and generation.
Interactive: The library must be compatible with an interactive learning workflow, that allows users to easily define and edit training sets, train models, and evaluate the results through direct interaction. All models must be able to learn from few examples (possibly a single demonstration).
Portable: In order to be integrated within various software, platforms, the library must be portable, cross-platform, and lightweight.

We chose C++ that is both efficient and easy to integrate within other software and languages such as Max and Python. We now detail the four models that are implemented to date, the architecture of the library as well as the proposed Max/MuBu implementation with several examples.

Four Models

The implemented models are summarized in Table the following table. Each of the four model addresses a different combination of the multimodal and temporal aspects. We implemented two instantaneous models based on Gaussian Mixture Models and two temporal models with a hierarchical structure, based on an extension of the basic Hidden Markov Model (HMM) formalism.

\	Movement	Multimodal
Instantaneous	Gaussian Mixture Model (GMM)	Gaussian Mixture Regression

(GMR) Temporal | Hierarchical Hidden Markov Model(HHMM) | Multimodal Hierarchical Hidden Markov Model(MHMM)

Gaussian Mixture Models (GMMs) are instantaneous movement models. The input data associated to a class defined by the training sets is abstracted by a mixture (i.e. a weighted sum) of Gaussian distributions. This representation allows recognition in the performance phase: for each input frame the model calculates the likelihood of each class (Figure 1 (a)).
Gaussian Mixture Regression (GMR) are a straightforward extension of Gaussian Mixture Models used for regression. Trained with multimodal data, GMR allows for predicting the features of one modality (e.g. sound) from the features of another (e.g. movement) through non-linear regression between both feature sets (Figure 1 (b)).
Hierarchical HMM (HHMM) integrates a high-level structure that governs the transitions between classical HMM structures representing the temporal evolution of — low-level — movement segments. In the performance phase of the system, the hierarchical model estimates the likeliest gesture according to the transitions defined by the user. The system continuously estimates the likelihood for each model, as well as the time progression within the original training phrases (Figure 1 (c)).
Multimodal Hierarchical HMM (MHMM) allows for predicting a stream of sound parameters from a stream of movement features. It simultaneously takes into account the temporal evolution of movement and sound as well as their dynamic relationship according to the given example phrases. In this way, it guarantees the temporal consistency of the generated sound, while realizing the trained temporal movement-sound mappings (Figure 1 (d)).

Figure 1: Schematic Representation of the 4

implemented models"

Architecture

Our implementation has a particular attention to the interactive training procedure, and to the respect of the real-time constraints of the performance mode. The library is built upon four components representing phrases, training sets, models and model groups, as represented on Figure 2. A phrase is a multimodal data container used to store training examples. A training set is used to aggregate phrases associated with labels. It provides a set of function for interactive recording, editing and annotation of the phrases. Each instance of a model is connected to a training set that provides access to the training phrases. Performance functions are designed for real-time usage, updating the internal state of the model and the results for each new observation of a new movement. The library is portable and cross-platform. It defines a specific format for exchanging trained models, and provides Python bindings for scripting purpose or offline processing.

Figure 2: Architecture of the XMM library

Related Publications

J. Francoise, N. Schnell, R. Borghesi, and F. Bevilacqua, Probabilistic Models for Designing Motion and Sound Relationships. In Proceedings of the 2014 International Conference on New Interfaces for Musical Expression, NIME’14, London, UK, 2014. Download

J. Francoise, N. Schnell, and F. Bevilacqua, A Multimodal Probabilistic Model for Gesture-based Control of Sound Synthesis. In Proceedings of the 21st ACM international conference on Multimedia (MM’13), Barcelona, Spain, 2013. Download

Prev: Home | Next: Compilation and Usage.

XMM - Probabilistic Models for Motion Recognition and Mapping

Table of Contents

Why another HMM Library?

Four Models

Architecture

Related Publications