Mozilla's open-source speech recognition

Harnum, Alan aharnum at
Fri Dec 1 12:28:44 UTC 2017

Hey Tony, that’s good stuff to know – I simultaneously have a lot of interest in the potential of widespread voice recognition in consumer devices and browsers for building new forms of AT, and a lot of concerns about the proprietary nature (and online storage of individual’s voice samples) that makes the magic happen in the new generation of devices like the Google Home and the Amazon Echo (and smartphones).

What I’m hopeful for long-term with the Mozilla work is that a viable open source speech recognition engine will allow the full potential for good ends of this to be unlocked, as opposed to locked into the platforms of a few giant vendors.

Shorter term, I hope it means we get good voice recognition soon in-browser in Firefox – it’s already quite good in Chrome and we’ve done some interesting spike experiments with it.

From: Tony Atkins <tony at>
Date: Friday, December 1, 2017 at 5:20 AM
To: "Harnum, Alan" <aharnum at>
Cc: Fluid Work <fluid-work at>
Subject: Re: Mozilla's open-source speech recognition

Hi, Alan:

I studied voice recognition quite a bit in the 90s as part of my master's, and wanted to comment with my impressions.  In those days, the tradeoff was between being able to recognise a very limited vocabulary for a wide range of speakers, or training a computer to recognise a wide vocabulary from a single speaker.  With their massive body of training data (and with newer and faster computers), solutions like this are now really good at a lot of limited vocabularies, and pretty good at open-ended recognition, and all without training for the individual speaker.

In the late 90s there were systems that used speaker-independent recognition to do things like check flight times, relying on their ability to understand one or two specific sets of words (dates and times, places you might fly to/from).  These days, there are still clear limits, but systems can recognise which of dozens of contexts you're talking about, and then understand a much deeper vocabulary within each context.

For open ended speech, they claim around 95% accuracy, which is actually really good for speaker-independent recognition.  However, as a starting point for things like automatically adding subtitles, 95% is still noticeably and sometimes laughably off.  The good news is that with tools like YouTube's subtitles editor, human reviewers can focus on transcription errors and the timing of the subtitles, instead of also typing in the 95% the speech-to-text engine captures successfully.  And even that 95% is usually better than nothing.

I also love that they provide not only the specific tool, but also the dataset they used to train it.  The same data can be used to find better answers to this problem, but can also be used in unexpected ways, for example, identifying and gaming an engine for specific accents.

Anyway, thanks for sharing this.



On 30 November 2017 at 15:03, Harnum, Alan <aharnum at<mailto:aharnum at>> wrote:
Interesting news on this front:

Node is one of the initial supported bindings:

fluid-work mailing list - fluid-work at<mailto:fluid-work at>
To unsubscribe, change settings or access archives,

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the fluid-work mailing list