Machine Learning Speech Recognition

Keeping up my yearly blogging cadence, it’s about time I wrote to let people know what I’ve been up to for the last year or so at Mozilla. People keeping up would have heard of the sad news regarding the Connected Devices team here. While I’m sad for my colleagues and quite disappointed in how this transition period has been handled as a whole, thankfully this hasn’t adversely affected the Vaani project. We recently moved to the Emerging Technologies team and have refocused on the technical side of things, a side that I think most would agree is far more interesting, and also far more suited to Mozilla and our core competence.

Project DeepSpeech

So, out with Project Vaani, and in with Project DeepSpeech (name will likely change…) – Project DeepSpeech is a machine learning speech-to-text engine based on the Baidu Deep Speech research paper. We use a particular layer configuration and initial parameters to train a neural network to translate from processed audio data to English text. You can see roughly how we’re progressing with that here. We’re aiming for a 10% Word Error Rate (WER) on English speech at the moment.

You may ask, why bother? Google and others provide state-of-the-art speech-to-text in multiple languages, and in many cases you can use it for free. There are multiple problems with existing solutions, however. First and foremost, most are not open-source/free software (at least none that could rival the error rate of Google). Secondly, you cannot use these solutions offline. Third, you cannot use these solutions for free in a commercial product. The reason a viable free software alternative hasn’t arisen is mostly down to the cost and restrictions around training data. This makes the project a great fit for Mozilla as not only can we use some of our resources to overcome those costs, but we can also use the power of our community and our expertise in open source to provide access to training data that can be used openly. We’re tackling this issue from multiple sides, some of which you should start hearing about Real Soon Now™.

The whole team has made contributions to the main code. In particular, I’ve been concentrating on exporting our models and writing clients so that the trained model can be used in a generic fashion. This lets us test and demo the project more easily, and also provides a lower barrier for entry for people that want to try out the project and perhaps make contributions. One of the great advantages of using TensorFlow is how relatively easy it makes it to both understand and change the make-up of the network. On the other hand, one of the great disadvantages of TensorFlow is that it’s an absolute beast to build and integrates very poorly with other open-source software projects. I’ve been trying to overcome this by writing straight-forward documentation, and hopefully in the future we’ll be able to distribute binaries and trained models for multiple platforms.

Getting Involved

We’re still at a fairly early stage at the moment, which means there are many ways to get involved if you feel so inclined. The first thing to do, in any case, is to just check out the project and get it working. There are instructions provided in READMEs to get it going, and fairly extensive instructions on the TensorFlow site on installing TensorFlow. It can take a while to install all the dependencies correctly, but at least you only have to do it once! Once you have it installed, there are a number of scripts for training different models. You’ll need a powerful GPU(s) with CUDA support (think GTX 1080 or Titan X), a lot of disk space and a lot of time to train with the larger datasets. You can, however, limit the number of samples, or use the single-sample dataset (LDC93S1) to test simple code changes or behaviour.

One of the fairly intractable problems about machine learning speech recognition (and machine learning in general) is that you need lots of CPU/GPU time to do training. This becomes a problem when there are so many initial variables to tweak that can have dramatic effects on the outcome. If you have the resources, this is an area that you can very easily help with. What kind of results do you get when you tweak dropout slightly? Or layer sizes? Or distributions? What about when you add or remove layers? We have fairly powerful hardware at our disposal, and we still don’t have conclusive results about the affects of many of the initial variables. Any testing is appreciated! The Deep Speech 2 paper is a great place to start for ideas if you’re already experienced in this field. Note that we already have a work-in-progress branch implementing some of these ideas.

Let’s say you don’t have those resources (and very few do), what else can you do? Well, you can still test changes on the LDC93S1 dataset, which consists of a single sample. You won’t be able to effectively tweak initial parameters (as unsurprisingly, a dataset of a single sample does not represent the behaviour of a dataset with many thousands of samples), but you will be able to test optimisations. For example, we’re experimenting with model quantisation, which will likely be one of multiple optimisations necessary to make trained models usable on mobile platforms. It doesn’t particularly matter how effective the model is, as long as it produces consistent results before and after quantisation. Any optimisation that can be made to reduce the size or the processor requirement of training and using the model is very valuable. Even small optimisations can save lots of time when you start talking about days worth of training.

Our clients are also in a fairly early state, and this is another place where contribution doesn’t require expensive hardware. We have two clients at the moment. One written in Python that takes advantage of TensorFlow serving, and a second that uses TensorFlow’s native C++ API. This second client is the beginnings of what we hope to be able to run on embedded hardware, but it’s very early days right now.

And Finally

Imagine a future where state-of-the-art speech-to-text is available, for free (in cost and liberty), on even low-powered devices. It’s already looking like speech is going to be the next frontier of human-computer interaction, and currently it’s a space completely tied up by entities like Google, Amazon, Microsoft and IBM. Putting this power into everyone’s hands could be hugely transformative, and it’s great to be working towards this goal, even in a relatively modest capacity. This is the vision, and I look forward to helping make it a reality.

Open Source Speech Recognition

I’m currently working on the Vaani project at Mozilla, and part of my work on that allows me to do some exploration around the topic of speech recognition and speech assistants. After looking at some of the commercial offerings available, I thought that if we were going to do some kind of add-on API, we’d be best off aping the Amazon Alexa skills JS API. Amazon Echo appears to be doing quite well and people have written a number of skills with their API. There isn’t really any alternative right now, but I actually happen to think their API is quite well thought out and concise, and maps well to the sort of data structures you need to do reliable speech recognition.

So skipping forward a bit, I decided to prototype with Node.js and some existing open source projects to implement an offline version of the Alexa skills JS API. Today it’s gotten to the point where it’s actually usable (for certain values of usable) and I’ve just spent the last 5 minutes asking it to tell me Knock-Knock jokes, so rather than waste any more time on that, I thought I’d write this about it instead. If you want to try it out, check out this repository and run npm install in the usual way. You’ll need pocketsphinx installed for that to succeed (install sphinxbase and pocketsphinx from github), and you’ll need espeak installed and some skills for it to do anything interesting, so check out the Alexa sample skills and sym-link the ‘samples‘ directory as a directory called ‘skills‘ in your ferris checkout directory. After that, just run the included example file with node and talk to it via your default recording device (hint: say ‘launch wise guy‘).

Hopefully someone else finds this useful – I’ll be using this as a base to prototype further voice experiments, and I’ll likely be extending the Alexa API further in non-standard ways. What was quite neat about all this was just how easy it all was. The Alexa API is extremely well documented, Node.js is also extremely well documented and just as easy to use, and there are tons of libraries (of varying quality…) to do what you need to do. The only real stumbling block was pocketsphinx’s lack of documentation (there’s no documentation at all for the Node bindings and the C API documentation is pretty sparse, to say the least), but thankfully other members of my team are much more familiar with this codebase than I am and I could lean on them for support.

I’m reasonably impressed with the state of lightweight open source voice recognition. This is easily good enough to be useful if you can limit the scope of what you need to recognise, and I find the Alexa API is a great way of doing that. I’d be interested to know how close the internal implementation is to how I’ve gone about it if anyone has that insider knowledge.