soundswallower package

Main module for the SoundSwallower speech recognizer.

SoundSwallower is a small and not particularly powerful speech recognition engine for constrained grammars. It can also be used to align text to audio. Most of the functionality is contained in the Decoder class. Basic usage:

from soundswallower import Decoder, get_model_path
decoder = Decoder(hmm=get_model_path("en-us"),
                  dict=get_model_path("en-us.dict"),
                  jsgf="some_grammar_file.gram")
hyp, seg = decoder.decode_file("example.wav")
print("Recognized text:", hyp)
for word, start, end in seg:
    print("Word %s from %.3f to %.3f" % (word, start, end))
class soundswallower.Arg(name, default, doc, type, required)

Description of a configuration parameter.

default

Default value of parameter.

doc

Description of parameter.

name

Parameter name.

required

Is this parameter required?

type

Type (as a Python type object) of parameter value.

class soundswallower.Config(**kwargs)

Configuration object for SoundSwallower.

The SoundSwallower can be configured either implicitly, by passing keyword arguments to Decoder, or by creating and manipulating Config objects. There are a large number of parameters, most of which are not important or subject to change. These mostly correspond to the command-line arguments used by PocketSphinx.

A Config can be initialized with keyword arguments:

config = Config(hmm="path/to/things", dict="my.dict")

It can also be initialized by parsing JSON (either as bytes or str):

config = Config.parse_json('''{"hmm": "path/to/things",
                               "dict": "my.dict"}''')

The “parser” is very much not strict, so you can also pass a sort of pseudo-YAML to it, e.g.:

config = Config.parse_json("hmm: path/to/things, dict: my.dict")

You can also initialize an empty Config and set arguments in it directly:

config = Config()
config["hmm"] = "path/to/things"

In general, a Config mostly acts like a dictionary, and can be iterated over in the same fashion. However, attempting to access a parameter that does not already exist will raise a KeyError.

See Configuration parameters for a description of existing parameters.

describe(self)

Iterate over parameter descriptions.

This function returns a generator over the parameters defined in a configuration, as Arg objects.

Returns

Descriptions of parameters including their default values and documentation

Return type

Iterable[Arg]

dumps(self)

Serialize configuration to a JSON-formatted str.

This produces JSON from a configuration object, with default values included.

Returns

Serialized JSON

Return type

str

Raises

RuntimeError – if serialization fails somehow.

items(self)
static parse_json(json)

Parse JSON (or pseudo-YAML) configuration

Parameters

json (bytes|str) – JSON data.

Returns

Parsed config, or None on error.

Return type

Config

class soundswallower.Decoder(*args, **kwargs)

Main class for speech recognition and alignment in SoundSwallower.

See Configuration parameters for a description of keyword arguments.

Parameters
  • hmm (str) – Path to directory containing acoustic model files.

  • dict (str) – Path to pronunciation dictionary.

  • jsgf (str) – Path to JSGF grammar file.

  • fsg (str) – Path to FSG grammar file (only one of jsgf or fsg should be specified).

  • toprule (str) – Name of top-level rule in JSGF file to use as entry point.

  • samprate (float) – Sampling rate for raw audio data.

  • logfn (str) – File to write log messages to (set to os.devnull to silence these messages)

Raises
add_word(self, unicode word, unicode phones, update=True)

Add a word to the pronunciation dictionary.

Parameters
  • word (str) – Text of word to be added.

  • phones (str) – Space-separated list of phones for this word’s pronunciation. This will depend on the underlying acoustic model but is probably in ARPABET. FIXME: Should accept IPA, duh.

  • update (bool) – Update the recognizer immediately. You can set this to False if you are adding a lot of words, to speed things up. FIXME: This API is bad and will be changed.

Returns

Word ID of added word.

Return type

int

Raises

KeyError – If word already exists in dictionary.

alignment

The current sub-word alignment, if any.

This property may take some time to access as it runs a second pass of decoding.

Returns

Alignment - if an alignment exists.

cmn

Get current cepstral mean.

Parameters

update (boolean) – Update the mean based on current utterance.

Returns

Cepstral mean as a comma-separated list of numbers.

Return type

str

config

Property containing configuration object.

You may use this to set or unset configuration options, but be aware that you must call initialize to apply them, e.g.:

del decoder.config[“dict”] decoder.config[“loglevel”] = “INFO” decoder.config[“bestpath”] = True decoder.initialize()

Note that this will also remove any dictionary words or grammars which you have loaded. See reinit_feat for an alternative if you do not wish to do this.

Returns

decoder configuration.

Return type

Config

static create(*args, **kwargs)

Create and configure, but do not initialize, the decoder.

This method exists if you wish to override or unset some of the parameters which are filled in automatically when the model is loaded, in particular dict, if you wish to create a dictionary programmatically rather than loading from a file. For example:

d = Decoder.create() del d.config[“dict”] d.initialize() d.add_word(“word”, “W ER D”, True) # Just one word! (or more)

See Configuration parameters for a description of keyword arguments.

Parameters
  • hmm (str) – Path to directory containing acoustic model files.

  • dict (str) – Path to pronunciation dictionary.

  • jsgf (str) – Path to JSGF grammar file.

  • fsg (str) – Path to FSG grammar file (only one of jsgf or fsg should be specified).

  • toprule (str) – Name of top-level rule in JSGF file to use as entry point.

  • samprate (float) – Sampling rate for raw audio data.

  • logfn (str) – File to write log messages to (set to os.devnull to silence these messages)

Raises

ValueError – on invalid configuration options.

create_fsg(self, name, start_state, final_state, transitions)

Create a finite-state grammar.

This method allows the creation of a grammar directly from a list of transitions. States and words will be created implicitly from the state numbers and word strings present in this list. Make sure that the pronunciation dictionary contains the words, or you will not be able to recognize. Basic usage:

fsg = decoder.create_fsg("mygrammar",
                         start_state=0, final_state=3,
                         transitions=[(0, 1, 0.75, "hello"),
                                      (0, 1, 0.25, "goodbye"),
                                      (1, 2, 0.75, "beautiful"),
                                      (1, 2, 0.25, "cruel"),
                                      (2, 3, 1.0, "world")])
Parameters
  • name (str) – Name to give this FSG (not very important).

  • start_state (int) – Index of starting state.

  • final_state (int) – Index of end state.

  • transitions (list) – List of transitions, each of which is a 3- or 4-tuple of (from, to, probability[, word]). If the word is not specified, this is an epsilon (null) transition that will always be followed.

Returns

Newly created finite-state grammar.

Return type

FsgModel

Raises

ValueError – On invalid input.

decode_file(self, input_file)

Decode audio from a file in the filesystem.

Currently supports single-channel WAV and raw audio files. If the sampling rate for a WAV file differs from the one set in the decoder’s configuration, the configuration will be updated to match it.

Note that we always decode the entire file at once. It would have to be really huge for this to cause memory problems, in which case the decoder would explode anyway. Otherwise, CMN doesn’t work as well, which causes unnecessary recognition errors.

Parameters

input_file – Path to an audio file.

Returns

Recognized text, Word segmentation.

Return type

(str, Iterable[Seg])

dumps(self, start_time=0., align_level=0)

Get decoding result as JSON.

end_utt(self)

Finish processing raw audio input.

This method must be called at the end of each separate “utterance” of raw audio input. It takes care of flushing any internal buffers and finalizing recognition results.

hyp

Current recognition hypothesis.

Returns

Current recognition output.

Return type

Hyp

initialize(self, Config config=None)

(Re-)initialize the decoder.

Parameters

config (Config) – Optional new configuration to apply, otherwise the existing configuration in the config attribute will be reloaded.

Raises

RuntimeError – On invalid configuration or other failure to reinitialize decoder.

lookup_word(self, unicode word)

Look up a word in the dictionary and return phone transcription for it.

Parameters

word (str) – Text of word to search for.

Returns

Space-separated list of phones, or None if not found.

Return type

str

n_frames

The number of frames processed up to this point.

Returns

Like it says.

Return type

int

parse_jsgf(self, jsgf_string, toprule=None)

Parse a JSGF grammar from bytes or string.

Because SoundSwallower uses UTF-8 internally, it is more efficient to parse from bytes, as a string will get encoded and subsequently decoded.

Parameters
  • jsgf_string (bytes|str) – JSGF grammar as string or UTF-8 encoded bytes.

  • toprule (str) – Name of starting rule in grammar (will default to first public rule).

Returns

Newly loaded finite-state grammar.

Return type

FsgModel

Raises
process_raw(self, data, no_search=False, full_utt=False)

Process a block of raw audio.

Parameters
  • data (bytes) – Raw audio data, a block of 16-bit signed integer binary data.

  • no_search (bool) – If True, do not do any decoding on this data.

  • full_utt (bool) – If True, assume this is the entire utterance, for purposes of acoustic normalization.

Raises

RuntimeError – If processing fails.

read_fsg(self, filename)

Read a grammar from an FSG file.

Parameters

filename (str) – Path to FSG file.

Returns

Newly loaded finite-state grammar.

Return type

FsgModel

read_jsgf(self, filename)

Read a grammar from a JSGF file.

The top rule used is the one specified by the “toprule” configuration parameter.

Parameters

filename (str) – Path to JSGF file.

Returns

Newly loaded finite-state grammar.

Return type

FsgModel

reinit_feat(self, Config config=None)

Reinitialize only the feature computation.

Parameters

config (Config) – Optional new configuration to apply, otherwise the existing configuration in the config attribute will be reloaded.

Raises

RuntimeError – On invalid configuration or other failure to initialize feature computation.

seg

Current word segmentation.

Returns

Generator over word segmentations.

Return type

Iterable[Seg]

set_align_text(self, text)

Set a word sequence for alignment.

You must do any text normalization yourself. For word-level alignment, once you call this, simply decode and get the segmentation in the usual manner. For phone-level alignment, you can use the alignment property.

Parameters

text (str) – Sentence to align, as whitespace-separated words. All words must be present in the dictionary.

Raises

RuntimeError – If text is invalid somehow.

set_fsg(self, FsgModel fsg)

Set the grammar for recognition.

Parameters

fsg (FsgModel) – Previously loaded or constructed grammar.

set_jsgf_file(self, filename)

Set the grammar for recognition from a JSGF file.

Parameters

filename (str) – Path to a JSGF file to load.

set_jsgf_string(self, jsgf_string)

Set the grammar for recognition from JSGF bytes or string.

Parameters

jsgf_string (bytes) – JSGF grammar as string or UTF-8 encoded bytes.

start_utt(self)

Start processing raw audio input.

This method must be called at the beginning of each separate “utterance” of raw audio input.

Raises

RuntimeError – If processing fails to start (usually if it has already been started).

update_cmn(self)

Update current cepstral mean.

Returns

New cepstral mean as a comma-separated list of numbers.

Return type

str

class soundswallower.Endpointer(window=0.3, ratio=0.9, vad_mode=Vad.LOOSE, sample_rate=Vad.DEFAULT_SAMPLE_RATE, frame_length=Vad.DEFAULT_FRAME_LENGTH)

Simple endpointer using voice activity detection.

Parameters
  • window (float) – Length in seconds of window for decision.

  • ratio (float) – Fraction of window that must be speech or non-speech to make a transition.

  • mode (int) – Aggressiveness of voice activity detction (0-3)

  • sample_rate (int) – Sampling rate of input, default is 16000. Rates other than 8000, 16000, 32000, 48000 are only approximately supported, see note in frame_length. Outlandish sampling rates like 3924 and 115200 will raise a ValueError.

  • frame_length (float) – Desired input frame length in seconds, default is 0.03. The actual frame length may be different if an approximately supported sampling rate is requested. You must always use the frame_bytes and frame_length attributes to determine the input size.

Raises

ValueError – Invalid input parameter. Also raised if the ratio makes it impossible to do endpointing (i.e. it is more than N-1 or less than 1 frame).

DEFAULT_RATIO = 0.9
DEFAULT_WINDOW = 0.3
end_stream(self, frame)

Read a final frame of data and return speech if any.

This function should only be called at the end of the input stream (and then, only if you are currently in a speech region). It will return any remaining speech data detected by the endpointer.

Parameters

frame (bytes) – Buffer containing speech data (16-bit signed integers). Must be of length frame_bytes (in bytes) or less.

Returns

Remaining speech data (could be more than one frame), or None if none detected.

Return type

bytes

Raises
frame_bytes

Number of bytes (not samples) required in an input frame.

You must pass input of this size, as bytes, to the Endpointer.

Type

int

frame_length

Length of a frame in seconds (may be different from the one requested in the constructor!)

Type

float

in_speech

Is the endpointer currently in a speech segment?

To detect transitions from non-speech to speech, check this before process. If it was False but process returns data, then speech has started:

prev_in_speech = ep.in_speech
speech = ep.process(frame)
if speech is not None:
    if not prev_in_speech:
        print("Speech started at", ep.speech_start)

Likewise, to detect transitions from speech to non-speech, call this after process. If process returned data but this returns False, then speech has stopped:

speech = ep.process(frame)
if speech is not None:
    if not ep.in_speech:
        print("Speech ended at", ep.speech_end)
Type

bool

process(self, frame)

Read a frame of data and return speech if detected.

Parameters

frame (bytes) – Buffer containing speech data (16-bit signed integers). Must be of length frame_bytes (in bytes).

Returns

Frame of speech data, or None if none detected.

Return type

bytes

Raises
sample_rate

Sampling rate of input data.

Type

int

speech_end

End time of current speech region.

Type

float

speech_start

Start time of current speech region.

Type

float

class soundswallower.FsgModel

Finite-state recognition grammar.

Note that you cannot create one of these directly, as it depends on some internal configuration from the Decoder. Use the factory methods such as create_fsg or parse_jsgf instead.

class soundswallower.Hyp(text, score, prob)

Recognition hypothesis.

prob

Posterior probability of hypothesis (often 1.0, sorry).

score

Best path score.

text

Recognized text.

class soundswallower.Seg(text, start, duration, ascore, lscore)

Segment in a word segmentation.

ascore

Acoustic match score.

duration

Duration in seconds.

lscore

Language (grammar) match score.

start

Start time in the audio stream in seconds.

text

Word text.

class soundswallower.Vad(mode=VAD_LOOSE, sample_rate=VAD_DEFAULT_SAMPLE_RATE, frame_length=VAD_DEFAULT_FRAME_LENGTH)

Voice activity detection class.

Parameters
  • mode (int) – Aggressiveness of voice activity detction (0-3)

  • sample_rate (int) – Sampling rate of input, default is 16000. Rates other than 8000, 16000, 32000, 48000 are only approximately supported, see note in frame_length. Outlandish sampling rates like 3924 and 115200 will raise a ValueError.

  • frame_length (float) – Desired input frame length in seconds, default is 0.03. The actual frame length may be different if an approximately supported sampling rate is requested. You must always use the frame_bytes and frame_length attributes to determine the input size.

Raises

ValueError – Invalid input parameter (see above).

DEFAULT_FRAME_LENGTH = 0.03
DEFAULT_SAMPLE_RATE = 16000
LOOSE = 0
MEDIUM_LOOSE = 1
MEDIUM_STRICT = 2
STRICT = 3
frame_bytes

Number of bytes (not samples) required in an input frame.

You must pass input of this size, as bytes, to the Vad.

Type

int

frame_length

Length of a frame in seconds (may be different from the one requested in the constructor!)

Type

float

is_speech(self, frame, sample_rate=None)

Classify a frame as speech or not.

Parameters

frame (bytes) – Buffer containing speech data (16-bit signed integers). Must be of length frame_bytes (in bytes).

Returns

Classification as speech or not speech.

Return type

bool

Raises
sample_rate

Sampling rate of input data.

Type

int

soundswallower.get_audio_data(input_file: str) Tuple[bytes, Optional[int]][source]

Try to get single-channel audio data in the most portable way possible.

Currently suports only single-channel WAV and raw audio.

Parameters

input_file – Path to an audio file.

Returns

Raw audio data, sampling rate or None for a raw file.

Return type

(bytes, int)

soundswallower.get_model_path(subpath: Optional[str] = None) str[source]

Return path to the model directory, or optionally, a specific file or directory within it.

Parameters

subpath – An optional path to add to the model directory.

Returns

The requested path within the model directory.

Decoder class

class soundswallower.Decoder(*args, **kwargs)

Main class for speech recognition and alignment in SoundSwallower.

See Configuration parameters for a description of keyword arguments.

Parameters
  • hmm (str) – Path to directory containing acoustic model files.

  • dict (str) – Path to pronunciation dictionary.

  • jsgf (str) – Path to JSGF grammar file.

  • fsg (str) – Path to FSG grammar file (only one of jsgf or fsg should be specified).

  • toprule (str) – Name of top-level rule in JSGF file to use as entry point.

  • samprate (float) – Sampling rate for raw audio data.

  • logfn (str) – File to write log messages to (set to os.devnull to silence these messages)

Raises
add_word(self, unicode word, unicode phones, update=True)

Add a word to the pronunciation dictionary.

Parameters
  • word (str) – Text of word to be added.

  • phones (str) – Space-separated list of phones for this word’s pronunciation. This will depend on the underlying acoustic model but is probably in ARPABET. FIXME: Should accept IPA, duh.

  • update (bool) – Update the recognizer immediately. You can set this to False if you are adding a lot of words, to speed things up. FIXME: This API is bad and will be changed.

Returns

Word ID of added word.

Return type

int

Raises

KeyError – If word already exists in dictionary.

alignment

The current sub-word alignment, if any.

This property may take some time to access as it runs a second pass of decoding.

Returns

Alignment - if an alignment exists.

cmn

Get current cepstral mean.

Parameters

update (boolean) – Update the mean based on current utterance.

Returns

Cepstral mean as a comma-separated list of numbers.

Return type

str

config

Property containing configuration object.

You may use this to set or unset configuration options, but be aware that you must call initialize to apply them, e.g.:

del decoder.config[“dict”] decoder.config[“loglevel”] = “INFO” decoder.config[“bestpath”] = True decoder.initialize()

Note that this will also remove any dictionary words or grammars which you have loaded. See reinit_feat for an alternative if you do not wish to do this.

Returns

decoder configuration.

Return type

Config

static create(*args, **kwargs)

Create and configure, but do not initialize, the decoder.

This method exists if you wish to override or unset some of the parameters which are filled in automatically when the model is loaded, in particular dict, if you wish to create a dictionary programmatically rather than loading from a file. For example:

d = Decoder.create() del d.config[“dict”] d.initialize() d.add_word(“word”, “W ER D”, True) # Just one word! (or more)

See Configuration parameters for a description of keyword arguments.

Parameters
  • hmm (str) – Path to directory containing acoustic model files.

  • dict (str) – Path to pronunciation dictionary.

  • jsgf (str) – Path to JSGF grammar file.

  • fsg (str) – Path to FSG grammar file (only one of jsgf or fsg should be specified).

  • toprule (str) – Name of top-level rule in JSGF file to use as entry point.

  • samprate (float) – Sampling rate for raw audio data.

  • logfn (str) – File to write log messages to (set to os.devnull to silence these messages)

Raises

ValueError – on invalid configuration options.

create_fsg(self, name, start_state, final_state, transitions)

Create a finite-state grammar.

This method allows the creation of a grammar directly from a list of transitions. States and words will be created implicitly from the state numbers and word strings present in this list. Make sure that the pronunciation dictionary contains the words, or you will not be able to recognize. Basic usage:

fsg = decoder.create_fsg("mygrammar",
                         start_state=0, final_state=3,
                         transitions=[(0, 1, 0.75, "hello"),
                                      (0, 1, 0.25, "goodbye"),
                                      (1, 2, 0.75, "beautiful"),
                                      (1, 2, 0.25, "cruel"),
                                      (2, 3, 1.0, "world")])
Parameters
  • name (str) – Name to give this FSG (not very important).

  • start_state (int) – Index of starting state.

  • final_state (int) – Index of end state.

  • transitions (list) – List of transitions, each of which is a 3- or 4-tuple of (from, to, probability[, word]). If the word is not specified, this is an epsilon (null) transition that will always be followed.

Returns

Newly created finite-state grammar.

Return type

FsgModel

Raises

ValueError – On invalid input.

decode_file(self, input_file)

Decode audio from a file in the filesystem.

Currently supports single-channel WAV and raw audio files. If the sampling rate for a WAV file differs from the one set in the decoder’s configuration, the configuration will be updated to match it.

Note that we always decode the entire file at once. It would have to be really huge for this to cause memory problems, in which case the decoder would explode anyway. Otherwise, CMN doesn’t work as well, which causes unnecessary recognition errors.

Parameters

input_file – Path to an audio file.

Returns

Recognized text, Word segmentation.

Return type

(str, Iterable[Seg])

dumps(self, start_time=0., align_level=0)

Get decoding result as JSON.

end_utt(self)

Finish processing raw audio input.

This method must be called at the end of each separate “utterance” of raw audio input. It takes care of flushing any internal buffers and finalizing recognition results.

hyp

Current recognition hypothesis.

Returns

Current recognition output.

Return type

Hyp

initialize(self, Config config=None)

(Re-)initialize the decoder.

Parameters

config (Config) – Optional new configuration to apply, otherwise the existing configuration in the config attribute will be reloaded.

Raises

RuntimeError – On invalid configuration or other failure to reinitialize decoder.

lookup_word(self, unicode word)

Look up a word in the dictionary and return phone transcription for it.

Parameters

word (str) – Text of word to search for.

Returns

Space-separated list of phones, or None if not found.

Return type

str

n_frames

The number of frames processed up to this point.

Returns

Like it says.

Return type

int

parse_jsgf(self, jsgf_string, toprule=None)

Parse a JSGF grammar from bytes or string.

Because SoundSwallower uses UTF-8 internally, it is more efficient to parse from bytes, as a string will get encoded and subsequently decoded.

Parameters
  • jsgf_string (bytes|str) – JSGF grammar as string or UTF-8 encoded bytes.

  • toprule (str) – Name of starting rule in grammar (will default to first public rule).

Returns

Newly loaded finite-state grammar.

Return type

FsgModel

Raises
process_raw(self, data, no_search=False, full_utt=False)

Process a block of raw audio.

Parameters
  • data (bytes) – Raw audio data, a block of 16-bit signed integer binary data.

  • no_search (bool) – If True, do not do any decoding on this data.

  • full_utt (bool) – If True, assume this is the entire utterance, for purposes of acoustic normalization.

Raises

RuntimeError – If processing fails.

read_fsg(self, filename)

Read a grammar from an FSG file.

Parameters

filename (str) – Path to FSG file.

Returns

Newly loaded finite-state grammar.

Return type

FsgModel

read_jsgf(self, filename)

Read a grammar from a JSGF file.

The top rule used is the one specified by the “toprule” configuration parameter.

Parameters

filename (str) – Path to JSGF file.

Returns

Newly loaded finite-state grammar.

Return type

FsgModel

reinit_feat(self, Config config=None)

Reinitialize only the feature computation.

Parameters

config (Config) – Optional new configuration to apply, otherwise the existing configuration in the config attribute will be reloaded.

Raises

RuntimeError – On invalid configuration or other failure to initialize feature computation.

seg

Current word segmentation.

Returns

Generator over word segmentations.

Return type

Iterable[Seg]

set_align_text(self, text)

Set a word sequence for alignment.

You must do any text normalization yourself. For word-level alignment, once you call this, simply decode and get the segmentation in the usual manner. For phone-level alignment, you can use the alignment property.

Parameters

text (str) – Sentence to align, as whitespace-separated words. All words must be present in the dictionary.

Raises

RuntimeError – If text is invalid somehow.

set_fsg(self, FsgModel fsg)

Set the grammar for recognition.

Parameters

fsg (FsgModel) – Previously loaded or constructed grammar.

set_jsgf_file(self, filename)

Set the grammar for recognition from a JSGF file.

Parameters

filename (str) – Path to a JSGF file to load.

set_jsgf_string(self, jsgf_string)

Set the grammar for recognition from JSGF bytes or string.

Parameters

jsgf_string (bytes) – JSGF grammar as string or UTF-8 encoded bytes.

start_utt(self)

Start processing raw audio input.

This method must be called at the beginning of each separate “utterance” of raw audio input.

Raises

RuntimeError – If processing fails to start (usually if it has already been started).

update_cmn(self)

Update current cepstral mean.

Returns

New cepstral mean as a comma-separated list of numbers.

Return type

str

Segmentation and Endpointing classes

class soundswallower.Endpointer(window=0.3, ratio=0.9, vad_mode=Vad.LOOSE, sample_rate=Vad.DEFAULT_SAMPLE_RATE, frame_length=Vad.DEFAULT_FRAME_LENGTH)

Simple endpointer using voice activity detection.

Parameters
  • window (float) – Length in seconds of window for decision.

  • ratio (float) – Fraction of window that must be speech or non-speech to make a transition.

  • mode (int) – Aggressiveness of voice activity detction (0-3)

  • sample_rate (int) – Sampling rate of input, default is 16000. Rates other than 8000, 16000, 32000, 48000 are only approximately supported, see note in frame_length. Outlandish sampling rates like 3924 and 115200 will raise a ValueError.

  • frame_length (float) – Desired input frame length in seconds, default is 0.03. The actual frame length may be different if an approximately supported sampling rate is requested. You must always use the frame_bytes and frame_length attributes to determine the input size.

Raises

ValueError – Invalid input parameter. Also raised if the ratio makes it impossible to do endpointing (i.e. it is more than N-1 or less than 1 frame).

end_stream(self, frame)

Read a final frame of data and return speech if any.

This function should only be called at the end of the input stream (and then, only if you are currently in a speech region). It will return any remaining speech data detected by the endpointer.

Parameters

frame (bytes) – Buffer containing speech data (16-bit signed integers). Must be of length frame_bytes (in bytes) or less.

Returns

Remaining speech data (could be more than one frame), or None if none detected.

Return type

bytes

Raises
frame_bytes

Number of bytes (not samples) required in an input frame.

You must pass input of this size, as bytes, to the Endpointer.

Type

int

frame_length

Length of a frame in seconds (may be different from the one requested in the constructor!)

Type

float

in_speech

Is the endpointer currently in a speech segment?

To detect transitions from non-speech to speech, check this before process. If it was False but process returns data, then speech has started:

prev_in_speech = ep.in_speech
speech = ep.process(frame)
if speech is not None:
    if not prev_in_speech:
        print("Speech started at", ep.speech_start)

Likewise, to detect transitions from speech to non-speech, call this after process. If process returned data but this returns False, then speech has stopped:

speech = ep.process(frame)
if speech is not None:
    if not ep.in_speech:
        print("Speech ended at", ep.speech_end)
Type

bool

process(self, frame)

Read a frame of data and return speech if detected.

Parameters

frame (bytes) – Buffer containing speech data (16-bit signed integers). Must be of length frame_bytes (in bytes).

Returns

Frame of speech data, or None if none detected.

Return type

bytes

Raises
sample_rate

Sampling rate of input data.

Type

int

speech_end

End time of current speech region.

Type

float

speech_start

Start time of current speech region.

Type

float

class soundswallower.Vad(mode=VAD_LOOSE, sample_rate=VAD_DEFAULT_SAMPLE_RATE, frame_length=VAD_DEFAULT_FRAME_LENGTH)

Voice activity detection class.

Parameters
  • mode (int) – Aggressiveness of voice activity detction (0-3)

  • sample_rate (int) – Sampling rate of input, default is 16000. Rates other than 8000, 16000, 32000, 48000 are only approximately supported, see note in frame_length. Outlandish sampling rates like 3924 and 115200 will raise a ValueError.

  • frame_length (float) – Desired input frame length in seconds, default is 0.03. The actual frame length may be different if an approximately supported sampling rate is requested. You must always use the frame_bytes and frame_length attributes to determine the input size.

Raises

ValueError – Invalid input parameter (see above).

frame_bytes

Number of bytes (not samples) required in an input frame.

You must pass input of this size, as bytes, to the Vad.

Type

int

frame_length

Length of a frame in seconds (may be different from the one requested in the constructor!)

Type

float

is_speech(self, frame, sample_rate=None)

Classify a frame as speech or not.

Parameters

frame (bytes) – Buffer containing speech data (16-bit signed integers). Must be of length frame_bytes (in bytes).

Returns

Classification as speech or not speech.

Return type

bool

Raises
sample_rate

Sampling rate of input data.

Type

int

Other classes

class soundswallower.Config(**kwargs)

Configuration object for SoundSwallower.

The SoundSwallower can be configured either implicitly, by passing keyword arguments to Decoder, or by creating and manipulating Config objects. There are a large number of parameters, most of which are not important or subject to change. These mostly correspond to the command-line arguments used by PocketSphinx.

A Config can be initialized with keyword arguments:

config = Config(hmm="path/to/things", dict="my.dict")

It can also be initialized by parsing JSON (either as bytes or str):

config = Config.parse_json('''{"hmm": "path/to/things",
                               "dict": "my.dict"}''')

The “parser” is very much not strict, so you can also pass a sort of pseudo-YAML to it, e.g.:

config = Config.parse_json("hmm: path/to/things, dict: my.dict")

You can also initialize an empty Config and set arguments in it directly:

config = Config()
config["hmm"] = "path/to/things"

In general, a Config mostly acts like a dictionary, and can be iterated over in the same fashion. However, attempting to access a parameter that does not already exist will raise a KeyError.

See Configuration parameters for a description of existing parameters.

describe(self)

Iterate over parameter descriptions.

This function returns a generator over the parameters defined in a configuration, as Arg objects.

Returns

Descriptions of parameters including their default values and documentation

Return type

Iterable[Arg]

dumps(self)

Serialize configuration to a JSON-formatted str.

This produces JSON from a configuration object, with default values included.

Returns

Serialized JSON

Return type

str

Raises

RuntimeError – if serialization fails somehow.

items(self)
static parse_json(json)

Parse JSON (or pseudo-YAML) configuration

Parameters

json (bytes|str) – JSON data.

Returns

Parsed config, or None on error.

Return type

Config