Speech to Text - Never Complete Only Abandoned

# See Also - [[Text to Speech]] - [[Personal Assistant]] # Subfolders ```dataview LIST FROM #foldernote WHERE contains(file.folder, this.file.folder) AND file != this.file SORT file.name ASC ``` # Notes in this Folder ```dataview LIST FROM -#foldernote WHERE file.folder = this.file.folder AND database-plugin != "basic" SORT file.name ASC ``` # Resources ## Libraries and Applications - [aprilasr](https://github.com/abb128/april-asr) is a minimal library that provides an API for offline streaming speech-to-text applications written in [[3. Reference/Software/Programming Languages/C|C]] - Used by: https://github.com/abb128/LiveCaptions - [Pocketsphinx](https://github.com/cmusphinx/pocketsphinx) one of Carnegie Mellon University's open source large vocabulary, speaker-independent continuous speech recognition engines. - [Spchcat](https://github.com/petewarden/spchcat) Speech recognition tool to convert audio to text transcripts, for Linux and Raspberry Pi. Written in [[3. Reference/Software/Programming Languages/C|C]] using [[Coqui]]'s STT library. - [DeepSpeech](https://github.com/mozilla/DeepSpeech) is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers. - [Faster Whisper](https://github.com/SYSTRAN/faster-whisper) is a reimplementation of OpenAI's Whisper, which uses less memory and processes audio around 5 times faster. It is written in [[Python]] using [CTranslate2](https://github.com/OpenNMT/CTranslate2/). - Flashlight [Automatic Speech Recognition](https://github.com/flashlight/flashlight/tree/main/flashlight/app/asr) provides training and inference capabilities for end-to-end speech recognition systems. It is written in [[C++]]. - [Julius](https://github.com/julius-speech/julius) is a high-performance, small-footprint large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers. It is written in [[3. Reference/Software/Programming Languages/C|C]]. - [PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech) is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models. Written in [[C++]] and [[Python]]. - [Vosk](https://alphacephei.com/vosk/) is a speech recognition toolkit. Written in [[Python]]. Comes up a lot, but doesn't seem to have great accuracy. - https://github.com/alphacep/vosk-android-demo - [_Athena_](https://github.com/athena-team/athena) is an open-source implementation of end-to-end speech processing engine. It is written in [[C++]]. Appears to be partly based on Kaldi. ## Frontends - [voice2json](https://github.com/synesthesiam/voice2json) is a *frontend* for several other speech recognition systems, but combines them with "intent" recognition. Written in [[Python]] - https://hackaday.com/2021/09/25/making-linux-offline-voice-recognition-easier/ - [Speech Note](https://github.com/mkiol/dsnote) let you take, read and translate notes in multiple languages. It uses Speech to Text, Text to Speech and Machine Translation to do so. Text and voice processing take place entirely offline, locally on your computer, without using a network connection. It is a GUI *frontend* for several speech recognition systems. It is written in [[C++]]. - [Nerd Dictation](https://github.com/ideasman42/nerd-dictation) is simple commandline *frontend* for Vosk which can simulate keystrokes. Written in [[Python]]. - [[Mycroft AI]] also supports speech recognition, using a [variety](https://mycroft.ai/initiatives/) of backends. - [Precise](https://github.com/MycroftAI/mycroft-precise) is a wake word listener. Written in [[Python]]. - [Adapt](https://github.com/MycroftAI/adapt) is an intent parser. Written in [[Python]]. - [Padatious](https://github.com/MycroftAI/padatious) is an intent parser. Written in [[Python]]. - [ESPnet](https://github.com/espnet/espnet): end-to-end speech processing toolkit. Can use Kaldi or potentially other systems. It doesn't just transcribe speech, it can also use other systems to generate speech from text as well. Written in [[Python]]. - [Dragonfly](https://dragonfly2.readthedocs.io/en/latest/index.html) - *frontend* which is geared towards commands and programming - [Webspeech](https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API) - Browser API which may call out to some backend service. Supports speech recognition ([[Chromium]]-based browsers only) as well as [[Speech Synthesis]]. ## Discontinued / Abandoned - [OpenSeq2Seq](https://github.com/NVIDIA/OpenSeq2Seq) main goal is to allow researchers to most effectively explore various sequence-to-sequence models. (Discontinued NVidia project) Written in [[Python]]. ## Lists - https://unix.stackexchange.com/questions/256138/is-there-any-decent-speech-recognition-software-for-linux - https://fosspost.org/open-source-speech-recognition/ # References - https://medium.com/@nick.nagari/comparing-4-popular-open-source-speech-to-text-neural-network-models-92676a9f9265 - https://www.isca-archive.org/interspeech_2024/ciaperoni24_interspeech.html