soundswallower JavaScript package
SoundSwallower can be called either from Python (see soundswallower package for API details) or from JavaScript. In the case of JavaScript, we use Emscripten to compile the C library into WebAssembly, which is loaded by a JavaScript wrapper module. This means that there are certain idiosyncracies that must be taken into account when using the library, mostly with respect to deployment and initialization.
Using SoundSwallower on the Web
Since version 0.3.0, SoundSwallower’s JavaScript API can be used directly from a web page without any need to wrap it in a Web Worker. You may still wish to do so if you are processing large blocks of data or running on a slower machine. Doing so is currently outside the scope of this document.
Initialization of Decoder()
is separate from configuration
and asynchronous. This means that you must either call it from within
an asynchronous function using await
, or use the Promise
it
returns in the usual manner. If this means nothing to you, please
consult
https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Asynchronous.
By default, a narrow-bandwidth English acoustic model is loaded and
made available. If you want to use a different one, just put it where
your web server can find it, then pass the relative URL to the
directory containing the model files using the hmm
configuration
parameter and the URL of the dictionary using the dict
parameter.
Here is an example, presuming that you have downloaded and unpacked
the Brazilian Portuguese model and dictionary from
https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/Portuguese/
and placed them under /model
in your web server root:
// Avoid loading the default model
const ssjs = { defaultModel: null };
await require('soundswallower')(ssjs);
const decoder = new ssjs.Decoder({hmm: "/model/cmusphinx-pt-br-5.2",
dict: "/model/br-pt.dic"});
await decoder.initialize();
For the moment, to use SoundSwallower with Webpack, various
incantations are required in your webpack.config.js
. Sorry, I don’t
make the rules:
const CopyPlugin = require("copy-webpack-plugin");
// Then... in your `module_exports` or `config` or whatever:
plugins: [
// Just copy the damn WASM because webpack can't recognize
// Emscripten modules.
new CopyPlugin({
patterns: [
{ from: "node_modules/soundswallower/soundswallower.wasm*",
to: "[name][ext]"},
// And copy the model files too. (add any excludes you like)
{ from: modelDir,
to: "model"},
],
// Eliminate webpack's node junk when using webpack
resolve: {
fallback: {
crypto: false,
fs: false,
path: false,
},
},
node: {
global: false,
__filename: false,
__dirname: false,
},
For a more elaborate example, see [the soundswallower-demo code](https://github.com/dhdaines/soundswallower-demo).
Using SoundSwallower under Node.js
Using SoundSwallower-JS in Node.js is mostly straightforward. Here is a fairly minimal example. First you can record yourself saying some digits (note that we record in 32-bit floating-point at 44.1kHz, which is the default format for WebAudio and thus the default in SoundSwallower-JS as well):
sox -c 1 -r 44100 -b 32 -e floating-point -d digits.raw
Now run this with node
:
(async () => { // Wrap everything in an async function call
// Load the library and pre-load the default model
const ssjs = await require("soundswallower")();
const decoder = new ssjs.Decoder();
// Initialization is asynchronous
await decoder.initialize();
const grammar = decoder.set_jsgf(`#JSGF V1.0;
grammar digits;
public <digits> = <digit>*;
<digit> = one | two | three | four | five | six | seven | eight
| nine | ten | eleven;`); // It goes to eleven
// Default input is 16kHz, 32-bit floating-point PCM
const fs = require("fs/promises");
let pcm = await fs.readFile("digits.raw");
// Start speech processing
decoder.start();
// Takes a typed array, as returned by readFile
decoder.process_audio(pcm);
// Finalize speech processing
decoder.stop();
// Get recognized text (NOTE: synchronous method)
console.log(decoder.get_text());
// We must manually release memory...
decoder.delete();
})();
Decoder class
- class Decoder(config)
Speech recognizer object.
Create the decoder.
You may optionally call this with an Object containing configuration keys and values.
- Arguments
config (
Object()
) – Configuration parameters.
- Decoder.add_words(()
Add words to the pronunciation dictionary.
Example:
decoder.add_words([“hello”, “H EH L OW”], [“world”, “W ER L D”]);
- Arguments
(...Array) – words Any number of 2-element arrays containing the word text in position 0 and a string with whitespace-separated phones in position 1.
- Decoder.assert_initialized()
Throw an error if decoder is not initialized.
- Throws
Error()
– If decoder is not initialized.
- Decoder.delete()
Free resources used by the decoder.
- Decoder.get_alignment(config)
Get the current recognition result as a word (and possibly phone) segmentation.
- Arguments
config (
Object()
) –config.start (
number()
) – Start time to add to returned segment times.config.align_level (
number()
) – 0 for no word alignments only, 1 for wor and phone alignments, 2 for word, phone and state alignments.
- Returns
Array.<Segment> – Array of segments for the words recognized, each with the keys
t
,b
andd
, for text, start time, and duration, respectively.
- Decoder.get_config(key)
Get a configuration parameter’s value.
- Arguments
key (
string()
) – Parameter name.
- Throws
ReferenceError()
– Throws ReferenceError if key is not a known parameter.- Returns
number|string – Parameter value.
- Decoder.get_config_json()
Get configuration as JSON.
- Decoder.get_text()
Get the currently recognized text.
- Returns
string – Currently recognized text.
- Decoder.has_config(key)
Test if a key is a known parameter.
- Arguments
key (
string()
) – Key whose existence to check.
- Decoder.init_acmod()
Create acoustic model from configuration.
- Decoder.init_cleanup()
Clean up any lingering search modules.
- Decoder.init_dict()
Load dictionary from configuration.
- Decoder.init_fe()
Create front-end from configuration.
- Decoder.init_feat()
Create dynamic feature module from configuration.
- Decoder.init_featparams()
Read feature parameters from acoustic model.
- Decoder.init_grammar()
Load grammar from configuration.
- Decoder.initialize()
Initialize or reinitialize the decoder asynchronously.
- Returns
Promise – Promise resolved once decoder is ready.
- Decoder.load_acmod_files()
Load acoustic model files
- Decoder.load_gmm(means_path, variances_path, sendump_path, mixw_path)
Load Gaussian mixture models
- Decoder.load_mdef()
Load binary model definition file
- Decoder.load_tmat(tmat_path)
Load transition matrices
- Decoder.lookup_word(word)
Look up a word in the pronunciation dictionary.
- Arguments
word (
string()
) – Text of word to look up.
- Returns
string – Space-separated list of phones, or
null
if word is not in the dictionary.
- Decoder.process_audio(pcm)
Process a block of audio data.
- Arguments
pcm (
Float32Array()
) – Audio data, in float32 format, in the range [-1.0, 1.0].
- Returns
Number of frames processed.
- Decoder.reinitialize_audio()
Re-initialize only the audio feature extraction.
- Returns
Promise – Promise resolved once reinitialized.
- Decoder.set_align_text(text)
Set word sequence for alignment.
- Arguments
text (
string()
) – Sentence to align, as whitespace-separated words. All words must be present in the dictionary.
- Decoder.set_config(key, val)
Set a configuration parameter.
- Arguments
key (
string()
) – Parameter name.val (
number|string()
) – Parameter value.
- Throws
ReferenceError()
– Throws ReferenceError if key is not a known parameter.
- Decoder.set_grammar(jsgf_string, toprule="null")
Set recognition grammar from JSGF.
- Arguments
jsgf_string (
string()
) – String containing JSGF grammar.toprule (
string()
) – Name of starting rule for grammar, if not specified, the first public rule will be used.
- Decoder.spectrogram(pcm)
Compute spectrogram from audio
- Arguments
pcm (
Float32Array()
) – Audio data, in float32 format, in the range [-1.0, 1.0].
- Returns
Promise.<FeatureBuffer> – Promise resolved to an object containing
data
,nfr
, andnfeat
properties.
- Decoder.start()
Start processing input.
- Decoder.stop()
Finish processing input.
- Decoder.unset_config(key)
Reset a configuration parameter to its default value.
- Arguments
key (
string()
) – Parameter name.
- Throws
ReferenceError()
– Throws ReferenceError if key is not a known parameter.
Endpointer class
- class Endpointer(config)
Simple endpointer using voice activity detection.
Create the endpointer
- Arguments
config (
Object()
) –config.samprate (
number()
) – Sampling rate of the input audio.config.frame_length (
number()
) – Length in seconds of an input frame, must be 0.01, 0.02, or 0.03.config.mode (
number()
) – Aggressiveness of voice activity detction, must be 0, 1, 2, or 3. Higher numbers will create “tighter” endpoints at the possible expense of clipping the start of utterances.config.window (
number()
) – Length in seconds of the window used to make a speech/non-speech decision.config.ratio (
number()
) – Ratio of window that must be detected as speech (or not speech) in order to trigger decision.
- Throws
Error()
– on invalid parameters.
- Endpointer.end_stream(pcm)
Read a final frame of data and return speech if any.
This function should only be called at the end of the input stream (and then, only if you are currently in a speech region). It will return any remaining speech data detected by the endpointer.
- Arguments
pcm (
Float32Array()
) – Audio data, in float32 format, in the range [-1.0, 1.0]. Must contain get_frame_size() samples or less.
- Returns
Float32Array – Speech data, if any, or null if none.
- Endpointer.get_frame_length()
Get the effective length of a frame in seconds (may be different from the one requested in the constructor)
- Returns
number – Length of a frame in seconds.
- Endpointer.get_frame_size()
Get the effective length of a frame in samples.
Note that you must pass this many samples in each input frame, no more, no less.
- Returns
number – Size of required frame in samples.
- Endpointer.get_in_speech()
Is the endpointer currently in a speech segment?
To detect transitions from non-speech to speech, check this before process(). If it was false but process() returns data, then speech has started.
Likewise, to detect transitions from speech to non-speech, call this after process(). If process() returned data but this returns false, then speech has stopped.
For example:
let prev_in_speech = ep.get_in_speech(); let frame_size = ep.get_frame_size(); // Presume `frame` is a Float32Array of frame_size or less let speech; if (frame.size < frame_size) speech = ep.end_stream(frame); else speech = ep.process(frame); if (speech !== null) { if (!prev_in_speech) console.log("Speech started at " + ep.get_speech_start()); if (!ep.get_in_speech()) console.log("Speech ended at " + ep.get_speech_end()); }
- Returns
Boolean – are we currently in a speech region?
- Endpointer.get_speech_end()
Get end time of current speech region.
- Returns
number – Time in seconds.
- Endpointer.get_speech_start()
Get start time of current speech region.
- Returns
number – Time in seconds.
- Endpointer.process(pcm)
Read a frame of data and return speech if detected.
- Arguments
pcm (
Float32Array()
) – Audio data, in float32 format, in the range [-1.0, 1.0]. Must contain get_frame_size() samples.
- Returns
Float32Array – Speech data, if any, or null if none.
Functions
- get_model_path(subpath)
Get a model or model file from the built-in model path.
The base path can be set by modifying the modelBase property of the module object, at initialization or any other time. Or you can also just override this function if you have special needs.
This function is used by Decoder to find the default model, which is equivalent to Model.modelBase + Model.defaultModel.
- Arguments
subpath (
string()
) – path to model directory or parameter file, e.g. “en-us”, “en-us/variances”, etc
- Returns
string – concatenated path. Note this is a simple string concatenation on the Web, so ensure that modelBase has a trailing slash if it is a directory.