Speech-recognition app to convert MP3 voice to text

Question

Does anyone know of an application that can convert audio to text?

I assume it is spoken text. Which language is that text in? – Martin Ueding Jul 09 '12 at 11:33 — Martin Ueding, Jul 09 '12 at 11:33
The speech text is in simple english. – Kopano Jul 09 '12 at 14:33 — Kopano, Jul 09 '12 at 14:33

score 34 · Answer 1 · edited Oct 07 '20 at 17:36

34

The software you can use is Vosk-api, a modern speech recognition toolkit based on neural networks. It supports 7+ languages and works on variety of platforms including RPi and mobile.

First you convert the file to the required format and then you recognize it:

ffmpeg -i file.mp3 -ar 16000 -ac 1 file.wav

Then install vosk-api with pip:

pip3 install vosk

Then use these steps:

git clone https://github.com/alphacep/vosk-api
cd vosk-api/python/example
wget https://alphacephei.com/kaldi/models/vosk-model-small-en-us-0.3.zip
unzip vosk-model-small-en-us-0.3.zip
mv vosk-model-small-en-us-0.3 model
python3 ./test_simple.py test.wav > result.json

The result will be stored in json format.

The same directory also contains an srt subtitle output example, which is easier to evaluate and can be directly useful to some users:

python3 -m pip install srt
python3 ./test_srt.py test.wav

The example given in the repository says in perfect American English accent and perfect sound quality three sentences which I transcribe as:

one zero zero zero one
nine oh two one oh
zero one eight zero three

The "nine oh two one oh" is said very fast, but still clear. The "z" of the before last "zero" sounds a bit like an "s".

The SRT generated above reads:

1
00:00:00,870 --> 00:00:02,610
what zero zero zero one

2
00:00:03,930 --> 00:00:04,950
no no to uno
3
00:00:06,240 --> 00:00:08,010
cyril one eight zero three

so we can see that several mistakes were made, presumably in part because we have the understanding that all words are numbers to help us.

Next I also tried with the vosk-model-en-us-aspire-0.2 which was a 1.4GB download compared to 36MB of vosk-model-small-en-us-0.3 and is listed at https://alphacephei.com/vosk/models:

mv model model.vosk-model-small-en-us-0.3
wget https://alphacephei.com/vosk/models/vosk-model-en-us-aspire-0.2.zip
unzip vosk-model-en-us-aspire-0.2.zip
mv vosk-model-en-us-aspire-0.2 model

and the result was:

1
00:00:00,840 --> 00:00:02,610
one zero zero zero one

2
00:00:04,026 --> 00:00:04,980
i know what you window
3
00:00:06,270 --> 00:00:07,980
serial one eight zero three

which got one more word correct.

Tested on vosk-api 7af3e9a334fbb9557f2a41b97ba77b9745e120b3.

edited Oct 07 '20 at 17:36

Ciro Santilli OurBigBook.com

31,462

answered Feb 20 '14 at 20:24

Nikolay Shmyrev

533

also, as an addition to this answer, there's a cool demo of both speech recognition and voice command tools here: https://www.youtube.com/watch?v=gr1FZ2F7KYA&list=PLvB_ffZs45ufhOnw1epfjncXYRa5pfGw8&index=1 – Daithí Jan 08 '15 at 10:22
How do you add an acoustic model to the system? – jarno Feb 08 '15 at 13:38
You just download it and unpack, there is no such thing as "add to the system" – Nikolay Shmyrev Feb 08 '15 at 13:56
@NikolayShmyrev Where should I unpack it so that pocketsphinx_continuous finds it? – jarno Feb 08 '15 at 14:14
You can unpack it anywhere, pocketsphinx does not look in any location automatically, you have to specify model path in -hmm argument as pointed in the answer above. – Nikolay Shmyrev Feb 08 '15 at 14:33
4

Well, I installed packages pocketsphinx-utils, pocketsphinx-hmm-en-hub4wsj and pocketsphinx-lm-en-hub4 in universe repository of Ubuntu 14.04. Then pocketsphinx_continuous -infile file.wav -hmm en_US/hub4wsj_sc_8k -lm en_US/hub4.5000.DMP 2> pocketsphinx.log worked. Maybe they are not optimal packages, but they were best matches I could find in the repositories. – jarno Feb 08 '15 at 15:05
Please see help so that I can see notifications. – jarno Feb 08 '15 at 15:06
1

They are VERY unoptimal packages, it is not recommended to use them unless you want to complain about bad accuracy. It is better to install pocketsphinx from github, ubuntu version is very much outdated. – Nikolay Shmyrev Feb 08 '15 at 17:59
brew has this package but the quality is terrible. Also wondering if there's a way to insert line breaks at pauses so I can basically get paragraphs when speaker pauses. – chovy Mar 09 '17 at 07:57

score 15 · Answer 2 · edited Jan 17 '18 at 10:43

I know this is old, but to expand on Nikolay's answer and hopefully save someone some time in the future, in order to get an up-to-date version of pocketsphinx working you need to compile it from the github or sourceforge repository (not sure which is kept more up to date). Note the -j8 means run 8 separate jobs in parallel if possible; if you have more CPU cores you can increase the number.

git clone https://github.com/cmusphinx/sphinxbase.git
cd sphinxbase
./autogen.sh
./configure
make -j8
make -j8 check
sudo make install
cd ..
git clone https://github.com/cmusphinx/pocketsphinx.git
cd pocketsphinx
./autogen.sh
./configure
make -j8
make -j8 check
sudo make install
cd ..

Then, from: https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/US%20English/ download the newest versions of cmusphinx-en-us-....tar.gz and en-70k-....lm.gz

tar -xzf cmusphinx-en-us-....tar.gz
gunzip en-70k-....lm.gz

Then you can finally proceed with the steps from Nikolay's answer:

ffmpeg -i book.mp3 -ar 16000 -ac 1 book.wav
pocketsphinx_continuous -infile book.wav \
    -hmm cmusphinx-en-us-8khz-5.2 -lm en-70k-0.2.lm \
    2>pocketsphinx.log >book.txt

Sphinx works alright. I wouldn't rely on it to make a readable version of the text, but it's good enough that you can search it if you're looking for a particular quote. That works especially well if you use a search algorithm like Xapian (http://www.lesbonscomptes.com/recoll/) which accepts wildcards and doesn't require exact search expressions.

Hope this helps.

every thing works like a charm but in my case i had to run following command to fix pocketsphinx_continuous: error while loading shared libraries: libpocketsphinx.so.3: cannot open shared object file: No such file or directory -------> export LD_LIBRARY_PATH=/usr/local/lib -------> export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig — Vijay Dohare, Sep 19 '17 at 11:30
This is also recommended at https://cmusphinx.github.io/wiki/tutorialpocketsphinx/#installation-on-unix-system — andrybak, Sep 19 '19 at 21:58

score 12 · Answer 3 · edited Jul 31 '23 at 07:21

12

I you are looking to convert speech to text you could try installing the julius package:

sudo apt install julius

Description:

"Julius" is a high-performance, two-pass large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers.

Or another option that isn't in Ubuntu's repositories or in the Snap Store is Simon:

... is an open-source speech recognition program and replaces the mouse and keyboard.

Reference Links:

Julius:

Simon:

edited Jul 31 '23 at 07:21

Flimm

43,943

answered Jul 09 '12 at 11:54

CoalaWeb

3,227

2

Could you add to this answer an example of how to run Julius? It is phenomenally unclear from the documentation. – Mar 07 '20 at 00:17
In the future, it would be better to post one answer for the software Julius and one answer for the software Simon, so that each answer can be voted on and commented on separately. – Flimm Jul 31 '23 at 07:22

score 2 · Answer 4 · answered Jan 05 '20 at 13:02

2

You can use Mozilla DeepSpeech is an opensource speech-to-text tool. But you will need to train the application or download Mozilla's pre-trained model. For my project, the accuracy was still not sufficient, as audio files were not good quality, and used Transcribear instead, a web based editor with speech-to-text capabilities, but you will need to be connected online to upload recordings to the Transcribear server.

answered Jan 05 '20 at 13:02

John

81

I am considering transcribing about 70 hours by one speaker (accented, clear non-native en). Can DeepSpeech be trained using the existing mp3 files. – LenB Apr 26 '20 at 19:18
In the future, it would be better to post one answer for Mozilla DeepSpeech, and one answer for Transcribear, so that each answer can be voted on and commented on separately. – Flimm Jul 31 '23 at 07:22
Mozilla DeepSpeech hasn't seen any development ever since Mozilla fired the DeepSpeech team. See this issue for more details: github.com/mozilla/DeepSpeech/issues/3693 – Flimm Aug 23 '23 at 13:24

Ciro Santilli OurBigBook.com · Answer 5 · 2025-03-19T20:49:15.607

vosk-transcriber official CLI from Vosk

I was randomly tab completing after installing Vosk today, previously mentioned at: https://askubuntu.com/a/423849/52975 when I saw they had added a nice CLI wrapper at last, so now tested on Ubuntu 23.10, you can install with the English model as:

pipx install vosk
mkdir -p ~/var/lib/vosk
cd ~/var/lib/vosk
wget https://alphacephei.com/vosk/models/vosk-model-en-us-0.22.zip
unzip vosk-model-en-us-0.22.zip
cd -

and then use as:

wget -O think.ogg https://upload.wikimedia.org/wikipedia/commons/4/49/Think_Thomas_J_Watson_Sr.ogg
vosk-transcriber -m ~/var/lib/vosk/vosk-model-en-us-0.22 -i think.ogg -o think.srt -t srt

-i will eat pretty much anything including compressed audio files like .ogg or even video files like .ogv, presumably FFmpeg at work.

Nice! Now they only need a vosk-transcriber --download-model en option and have a default -m directory to finally make things fully clean, but this is already a huge improvement of life.

I played with a few examples to informally evaluate accuracy at: https://unix.stackexchange.com/questions/256138/is-there-any-decent-speech-recognition-software-for-linux/613392#613392

OpenAI Whisper

https://github.com/openai/whisper

Tested on Ubuntu 24.04, install:

sudo apt install ffmpeg
pipx install openai-whisper==20231117

Sample usage:

wget https://upload.wikimedia.org/wikipedia/commons/f/f6/Appuru.wav
time whisper Appuru.wav

Terminal output with this perfectly clean en-US demo: https://commons.wikimedia.org/wiki/File:Appuru.wav

/home/ciro/.local/pipx/venvs/openai-whisper/lib/python3.12/site-packages/whisper/transcribe.py:115: UserWarning: FP16 is not supported on CPU; using FP32 instead
  warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:03.000]  The apple does not fall far from the tree.

real    0m7.516s
user    0m31.209s
sys     0m4.194s

and cwd now contains several output files such as Appuru.srt:

1
00:00:00,000 --> 00:00:03,000
The apple does not fall far from the tree.

so it worked perfectly.

Here I did a longer benchmark with a video: Automatically generate subtitles/close caption from a video using speech-to-text? wand it worked amazingly!

https://unix.stackexchange.com/questions/256138/is-there-any-decent-speech-recognition-software-for-linux/718354#718354 by Franck Dernoncourt reports that to make use of a Nvidia 3090 GPU, add the following after conda activate whisperpy39:

pip install -f https://download.pytorch.org/whl/torch_stable.html
conda install pytorch==1.10.1 torchvision torchaudio cudatoolkit=11.0 -c pytorch

OpenAI Whisper-based tools

A list: https://github.com/sindresorhus/awesome-whisper#cli-tools learned from: Automatically generate subtitles/close caption from a video using speech-to-text? Perhaps some of those will make the model a bit easier to use.

Speech Note

https://github.com/mkiol/dsnote

This project is a front-end for a bunch of possible backend TTS and STT models on multiple languages. Install and launch:

flatpak install flathub net.mkiol.SpeechNote
flatpak run net.mkiol.SpeechNote

opens a GUI:

Then under:

Languages
English
Text to Speech

I can download a model:

They have both Whisper and Vosk and a few others.

Then you can either:

Click "Listen" to take voice input from the microphone
File > Import from a file to select a sound file containing the speech

and the recognized text will appear in the text box.

CLI-only usage is limited unfortunately: https://github.com/mkiol/dsnote/issues/83

Tested on Speech Note 4.7.0, Ubuntu 24.10.

Benchmarks

https://github.com/Picovoice/speech-to-text-benchmark mentions a few:

LibriSpeech. This one is also part of MLPerf v3.1.
TED-LIUM
Common Voice

Speech-recognition app to convert MP3 voice to text

5 Answers5

Reference Links:

Linked