The State Of Linux Voice Recognition

Freespeech-VR Voice Recognition
Freespeech-VR Voice Recognition.

I spend a lot of time researching for articles and quite often I think about the subject matter for an article whilst walking to the train station or when out and about in general.

One evening whilst walking the 1.5 miles to the station from my work I thought "wouldn't it be good if I could record what I wanted to say and then have it transcribed automatically to a text file which I could edit and format later on".

I have spent many long hours looking at the different options available for voice recognition and dictation including recording directly through a microphone using dictation software in Linux, recording the file to MP3 or WAV format and converting it via the command line, as well as using Chrome and Android applications.

This article highlights my findings after days of hard labor. 

Linux Options

Trying to find dictation and voice recognition software in Linux isn't as easy as it could be and the options available aren't that clever.

This wikipedia page has a list of potential options including CMU Sphinx, Julius and Simon.

I am using SparkyLinux which is based on Debian Testing at the moment and I can tell you that the only voice recognition package available in the repositories is Sphinx.

The native Linux programs I ended up trying were PocketSphinx, which I used to convert WAV files to text and Freespeech-VR which is a python application which lets you record straight from a microphone.

I also tried a couple of Chrome apps including VoiceNote II and Dictanote.

Finally I tried the "Dictation and Email" and "Talk And Talk Dictation" Android Apps.


Freespeech-VR isn't available in the standard repositories. I downloaded the files from here.

After downloading and extracting the contents of the zip file I opened a terminal and navigated to the folder where the files were extracted to. I typed the following command to open freespeech-vr.

sudo python freespeech-vr

I have a pair of headphones with a fairly decent microphone and a fairly clear southern English accent.

The following text appeared in the freespeech-vr window:

Welcome to the unit dogs of outcome Today Have ensuring How to Managed Tests An have to test When To text Uses a the system way Speech I the To one each was Only In a To hope of staying And The to Means of One chickens golden as system The Ea when it my name the next ofch calls phone This file Soon enough a cases phone to Hands- Space the sphinx Going That isn't a phones will be shared A trained and and tools Use speaking When you finished Say A used file Last a story A And using a by the When it is very how success This Linux was as Do you avoid is

I would just like to say now that this is not the Unit Of Dogs website and at no point did I mention anything to do with Golden chickens. I was actually trying to describe the process of using voice recognition software.

I tried the software a few times including varying pitch and speed but the accuracy was poor.


PocketSphinx is able to take a WAV file and convert it to text using the command line. PocketSphinx is available via the Debian repositories and should be available for most distributions.

The main issue I found with PocketSphinx is that you virtually need a degree in the concepts of voice recognition, language files, dictionaries and how to train the system.

After installing PocketSphinx you should go to the CMU Sphinx website and read as much information as possible. You also need to download the following model file

  • US English Generic Language Model

(If you are not a native English speaker choose the language model that is appropriate for you).

The documentation for PocketSphinx and Sphinx in general is difficult to understand for the lay person but from what I could make out dictionary files are used to provide a list of possible words and language models have a list of potential pronunciations.

To test PocketSphinx I used a recording of my own voice, a snippet from Al Pacino in "The Devils Advocate" and a snippet from "Morgan Freeman". The point of this was to try different voices and for me there is nobody who can tell a story as clearly as Morgan Freeman and nobody delivers a line like Al Pacino.

For PocketSphinx to work it needs a WAV file and it needs to be in a certain format. If the file is in MP3 format use the ffmpeg command to convert it into WAV format:

ffmpeg -i inputfilename.mp3 -acodec pcm_s16le -ar 16000 outputfilename.wav

To run PocketSphinx use the following command:

pocketsphinx_continuous -dict /usr/share/pocketsphinx/model/lm/en_US/cmu07a.dic -infile  voice2.wav -lm cmusphinx-5.0-en-us.lm 2>voice2.log

pocketsphinx_continuous takes a WAV file and converts it to text.

In the command above pocketsphinx is told to use a dictionary file called "/usr/share/pocketsphinx/model/lm/en_US/cmu07a.dic" with the language model "cmusphinx-5.0-en-us.lm". The file being converted to text is called voice2.wav (which is a recording I made with my voice). Finally the 2> places all the verbose output that you don't necessarily need into a file called voice2.log. The actual results of the test are displayed within the terminal window.

The results using my voice are as follows:

welcome to the next about well no this week subject about which recognition software in a minute

The results aren't as horrendous as with freespeech-vr but still not really useable. I then tried using PocketSphinx with Al Pacino but this returned no results at all.

Finally I tried using Morgan Freeman's voice from the movie "Bruce Almighty" and here are the results:

000000000: we'll on her
000000001: are all that tough yeah the day that right now yeah this is the most we've been alive i'm part by the hot
000000002: in the elevator who's the key out of a bit of baseball o'clock or know what to do to in lives
000000003: what are the ones that will recover
000000004: they didn't write it
000000005: they have on me right out
000000006: you must be rules
000000007: i've been expecting you
000000008: and he learned here that was an illustration is was the killer christmas party
000000009: it turns out one of the way to write o. ass i thought few always wear one
000000010: like the problem united will not give he the good i'm the estimated them at that moment when we did not all that you think i'm in the world will homes and i have seen that
000000011: a father who has it
000000012: what a lot about this
000000013: does that given
000000014: everything you those that don't fall for a lot
000000015: right in the fall
000000016: well hold on just for me
000000017: it a unhappy if i think too that they're going to have an that the that will all of that married on a was no we do i like the unlike the way

My test can hardly be considered scientific and the developers of PocketSphinx may state that I am not using the software correctly. There is also a technique called voice training which can be used to create better dictionaries and language files.

My overriding opinion though is that it is just too difficult for standard everyday use.

VoiceNote II

VoiceNote II is a Chrome App which uses the Google Voice recognition API. 

If you are using the Chrome or Chromium browsers you can install VoiceNote II via the Web Store.

The icons on VoiceNote II are laid out in a strange fashion as you need to set up the language at the bottom of the window and the edit button is also at the bottom, however the record button is in the top right position.

The first thing you need to do is select a language and this can be achieved by clicking on the world icon. 

To begin recording, click on the microphone icon and start speaking into your microphone. For the best results I found speaking slowly was key so that the software would have a chance to keep up.

The results weren't great as can be seen below:

Hello and welcome to connect. todays articles about voice to text conversion dunelm farrell recession 2008 as conversions and it said well supported the best way i found voice text addon to show 2014debian or rpm package open it voice type to speech to text open it if you want to choose vs chose in edinburgh french german get you the time in united kingdomstart at sea microphonewhat you finished writing your text as a text file to itsuccess well that's very standard english accent from south of england best for it but i'm going to the textvia this torrentalong with the actual document and you can see for the mistakes that makethank you for listeningfriends 


Dictanote is another Chrome App which can be used for dictation purposes and came across as being more intuitive but the results weren't any better than VoiceNote II.

I only used the demo version of Dictanote which prevents you from creating new documents but it lets you talk over text that is already in the editor. I was able to test the voice recognition but the results were no better than VoiceNote II and so I didn't sign up for the pro version.

Dictation And Mail

"Dictation And Mail" is an Android Application which uses the native Google voice recognition API.

The results from "Dictation and Mail" were much better than any of the other program attempted up to this point.

hello welcome to Linux lifewire., today we talking about converting sound to text

The trick with "Dictation and Mail" is to speak slowly and pronunciate as well as you can with an even accent.

After you have finished talking you can email the results to yourself.

Talk And Talk Dictation

The other Android Application that I tried was "Talk And Talk Dictation".

The interface for this app was the best of the bunch and the voice recognition worked very well indeed. After recording the dictation I was able to share the results in various ways including via email.

welcome to linux today we're talking about converting speech to text

As you can see the text above is about as clear as you can possibly expect to get. Talking slowly is the key.


Native Linux has some way to go with regards to Voice recognition and specifically dictation. There are some applications that use the Google Voice API but they are not yet listed in repositories.

ChromeOS applications are a little bit better but by far the best results were achieved using my Android phone. Maybe the phone has a better microphone and therefore the voice recognition software stands a better chance of conversion.

For voice recognition to become really usable it needs to be more intuitive with less setup required. You shouldn't need to mess around with language models and dictionaries in order to make it intelligible.

I appreciate however that the whole art of voice recognition is very challenging because everybody has a different voice and there are so many dialects from region to region in one country nevermind worrying about the hundreds of languages used throughout the world.

My analysis, therefore, is that voice recognition software is still work in progress.