soundswallower JavaScript package

SoundSwallower can be called either from Python (see soundswallower package for API details) or from JavaScript. In the case of JavaScript, we use Emscripten to compile the C library into WebAssembly, which is loaded by a JavaScript wrapper module. This means that there are certain idiosyncracies that must be taken into account when using the library, mostly with respect to deployment and initialization.

Using SoundSwallower on the Web

Since version 0.3.0, SoundSwallower’s JavaScript API can be used directly from a web page without any need to wrap it in a Web Worker. You may still wish to do so if you are processing large blocks of data or running on a slower machine. Doing so is currently outside the scope of this document.

Initialization of Decoder() is separate from configuration and asynchronous. This means that you must either call it from within an asynchronous function using await, or use the Promise it returns in the usual manner. If this means nothing to you, please consult https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Asynchronous.

By default, a narrow-bandwidth English acoustic model is loaded and made available. If you want to use a different one, just put it where your web server can find it, then pass the relative URL to the directory containing the model files using the hmm configuration parameter and the URL of the dictionary using the dict parameter. Here is an example, presuming that you have downloaded and unpacked the Brazilian Portuguese model and dictionary from https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/Portuguese/ and placed them under /model in your web server root:

// Avoid loading the default model
const ssjs = { defaultModel: null };
await require('soundswallower')(ssjs);
const decoder = new ssjs.Decoder({hmm: "/model/cmusphinx-pt-br-5.2",
                                  dict: "/model/br-pt.dic"});
await decoder.initialize();

For the moment, to use SoundSwallower with Webpack, various incantations are required in your webpack.config.js. Sorry, I don’t make the rules:

const CopyPlugin = require("copy-webpack-plugin");

// Then... in your `module_exports` or `config` or whatever:
plugins: [
    // Just copy the damn WASM because webpack can't recognize
    // Emscripten modules.
    new CopyPlugin({
        patterns: [
        { from: "node_modules/soundswallower/soundswallower.wasm*",
          to: "[name][ext]"},
        // And copy the model files too. (add any excludes you like)
        { from: modelDir,
          to: "model"},
    ],
// Eliminate webpack's node junk when using webpack
resolve: {
    fallback: {
        crypto: false,
        fs: false,
        path: false,
    },
},
node: {
    global: false,
    __filename: false,
    __dirname: false,
},

For a more elaborate example, see [the soundswallower-demo code](https://github.com/dhdaines/soundswallower-demo).

Using SoundSwallower under Node.js

Using SoundSwallower-JS in Node.js is mostly straightforward. Here is a fairly minimal example. First you can record yourself saying some digits (note that we record in 32-bit floating-point at 44.1kHz, which is the default format for WebAudio and thus the default in SoundSwallower-JS as well):

sox -c 1 -r 44100 -b 32 -e floating-point -d digits.raw

Now run this with node:

(async () => { // Wrap everything in an async function call
    // Load the library and pre-load the default model
    const ssjs = await require("soundswallower")();
    const decoder = new ssjs.Decoder();
    // Initialization is asynchronous
    await decoder.initialize();
    const grammar = decoder.set_jsgf(`#JSGF V1.0;
grammar digits;
public <digits> = <digit>*;
<digit> = one | two | three | four | five | six | seven | eight
    | nine | ten | eleven;`); // It goes to eleven
    // Default input is 16kHz, 32-bit floating-point PCM
    const fs = require("fs/promises");
    let pcm = await fs.readFile("digits.raw");
    // Start speech processing
    decoder.start();
    // Takes a typed array, as returned by readFile
    decoder.process_audio(pcm);
    // Finalize speech processing
    decoder.stop();
    // Get recognized text (NOTE: synchronous method)
    console.log(decoder.get_text());
    // We must manually release memory...
    decoder.delete();
})();

Decoder class

class Decoder(config)

Speech recognizer object.

Create the decoder.

You may optionally call this with an Object containing configuration keys and values.

Arguments

config (Object()) – Configuration parameters.

Decoder.add_words(()

Add words to the pronunciation dictionary.

Example:

decoder.add_words([“hello”, “H EH L OW”], [“world”, “W ER L D”]);

Arguments

(...Array) – words Any number of 2-element arrays containing the word text in position 0 and a string with whitespace-separated phones in position 1.

Decoder.assert_initialized()

Throw an error if decoder is not initialized.

Throws: Error() – If decoder is not initialized.

Decoder.delete(): Free resources used by the decoder.

Decoder.get_alignment(config)

Get the current recognition result as a word (and possibly phone) segmentation.

Arguments

config (Object()) –
config.start (number()) – Start time to add to returned segment times.
config.align_level (number()) – 0 for no word alignments only, 1 for wor and phone alignments, 2 for word, phone and state alignments.

Returns

Array.<Segment> – Array of segments for the words recognized, each with the keys t, b and d, for text, start time, and duration, respectively.

Decoder.get_config(key)

Get a configuration parameter’s value.

Arguments

key (string()) – Parameter name.

Throws

ReferenceError() – Throws ReferenceError if key is not a known parameter.

Returns

number|string – Parameter value.

Decoder.get_config_json(): Get configuration as JSON.

Decoder.get_text()

Get the currently recognized text.

Returns: string – Currently recognized text.

Decoder.has_config(key)

Test if a key is a known parameter.

Arguments

key (string()) – Key whose existence to check.

Decoder.init_acmod(): Create acoustic model from configuration.

Decoder.init_cleanup(): Clean up any lingering search modules.

Decoder.init_dict(): Load dictionary from configuration.

Decoder.init_fe(): Create front-end from configuration.

Decoder.init_feat(): Create dynamic feature module from configuration.

Decoder.init_featparams(): Read feature parameters from acoustic model.

Decoder.init_grammar(): Load grammar from configuration.

Decoder.initialize()

Initialize or reinitialize the decoder asynchronously.

Returns: Promise – Promise resolved once decoder is ready.

Decoder.load_acmod_files(): Load acoustic model files

Decoder.load_gmm(means_path, variances_path, sendump_path, mixw_path): Load Gaussian mixture models

Decoder.load_mdef(): Load binary model definition file

Decoder.load_tmat(tmat_path): Load transition matrices

Decoder.lookup_word(word)

Look up a word in the pronunciation dictionary.

Arguments

word (string()) – Text of word to look up.

Returns

string – Space-separated list of phones, or null if word is not in the dictionary.

Decoder.process_audio(pcm)

Process a block of audio data.

Arguments

pcm (Float32Array()) – Audio data, in float32 format, in the range [-1.0, 1.0].

Returns

Number of frames processed.

Decoder.reinitialize_audio()

Re-initialize only the audio feature extraction.

Returns: Promise – Promise resolved once reinitialized.

Decoder.set_align_text(text)

Set word sequence for alignment.

Arguments

text (string()) – Sentence to align, as whitespace-separated words. All words must be present in the dictionary.

Decoder.set_config(key, val)

Set a configuration parameter.

Arguments

key (string()) – Parameter name.
val (number|string()) – Parameter value.

Throws

ReferenceError() – Throws ReferenceError if key is not a known parameter.

Decoder.set_grammar(jsgf_string, toprule="null")

Set recognition grammar from JSGF.

Arguments

jsgf_string (string()) – String containing JSGF grammar.
toprule (string()) – Name of starting rule for grammar, if not specified, the first public rule will be used.

Decoder.spectrogram(pcm)

Compute spectrogram from audio

Arguments

pcm (Float32Array()) – Audio data, in float32 format, in the range [-1.0, 1.0].

Returns

Promise.<FeatureBuffer> – Promise resolved to an object containing data, nfr, and nfeat properties.

Decoder.start(): Start processing input.

Decoder.stop(): Finish processing input.

Decoder.unset_config(key)

Reset a configuration parameter to its default value.

Arguments

key (string()) – Parameter name.

Throws

ReferenceError() – Throws ReferenceError if key is not a known parameter.

Endpointer class

class Endpointer(config)

Simple endpointer using voice activity detection.

Create the endpointer

Arguments

config (Object()) –
config.samprate (number()) – Sampling rate of the input audio.
config.frame_length (number()) – Length in seconds of an input frame, must be 0.01, 0.02, or 0.03.
config.mode (number()) – Aggressiveness of voice activity detction, must be 0, 1, 2, or 3. Higher numbers will create “tighter” endpoints at the possible expense of clipping the start of utterances.
config.window (number()) – Length in seconds of the window used to make a speech/non-speech decision.
config.ratio (number()) – Ratio of window that must be detected as speech (or not speech) in order to trigger decision.

Throws

Error() – on invalid parameters.

Endpointer.end_stream(pcm)

Read a final frame of data and return speech if any.

This function should only be called at the end of the input stream (and then, only if you are currently in a speech region). It will return any remaining speech data detected by the endpointer.

Arguments

pcm (Float32Array()) – Audio data, in float32 format, in the range [-1.0, 1.0]. Must contain get_frame_size() samples or less.

Returns

Float32Array – Speech data, if any, or null if none.

Endpointer.get_frame_length()

Get the effective length of a frame in seconds (may be different from the one requested in the constructor)

Returns: number – Length of a frame in seconds.

Endpointer.get_frame_size()

Get the effective length of a frame in samples.

Note that you must pass this many samples in each input frame, no more, no less.

Returns: number – Size of required frame in samples.

Endpointer.get_in_speech()

Is the endpointer currently in a speech segment?

To detect transitions from non-speech to speech, check this before process(). If it was false but process() returns data, then speech has started.

Likewise, to detect transitions from speech to non-speech, call this after process(). If process() returned data but this returns false, then speech has stopped.

For example:

let prev_in_speech = ep.get_in_speech();
let frame_size = ep.get_frame_size();
// Presume `frame` is a Float32Array of frame_size or less
let speech;
if (frame.size < frame_size)
    speech = ep.end_stream(frame);
else
    speech = ep.process(frame);
if (speech !== null) {
    if (!prev_in_speech)
        console.log("Speech started at " + ep.get_speech_start());
    if (!ep.get_in_speech())
        console.log("Speech ended at " + ep.get_speech_end());
}

Returns: Boolean – are we currently in a speech region?

Endpointer.get_speech_end()

Get end time of current speech region.

Returns: number – Time in seconds.

Endpointer.get_speech_start()

Get start time of current speech region.

Returns: number – Time in seconds.

Endpointer.process(pcm)

Read a frame of data and return speech if detected.

Arguments

pcm (Float32Array()) – Audio data, in float32 format, in the range [-1.0, 1.0]. Must contain get_frame_size() samples.

Returns

Float32Array – Speech data, if any, or null if none.

Functions

get_model_path(subpath)

Get a model or model file from the built-in model path.

The base path can be set by modifying the modelBase property of the module object, at initialization or any other time. Or you can also just override this function if you have special needs.

This function is used by Decoder to find the default model, which is equivalent to Model.modelBase + Model.defaultModel.

Arguments

subpath (string()) – path to model directory or parameter file, e.g. “en-us”, “en-us/variances”, etc

Returns

string – concatenated path. Note this is a simple string concatenation on the Web, so ensure that modelBase has a trailing slash if it is a directory.