soundswallower package

Main module for the SoundSwallower speech recognizer.

SoundSwallower is a small and not particularly powerful speech recognition engine for constrained grammars. It can also be used to align text to audio. Most of the functionality is contained in the Decoder class. Basic usage:

from soundswallower import Decoder, get_model_path
decoder = Decoder(hmm=get_model_path("en-us"),
                  dict=get_model_path("en-us.dict"),
                  jsgf="some_grammar_file.gram")
hyp, seg = decoder.decode_file("example.wav")
print("Recognized text:", hyp)
for word, start, end in seg:
    print("Word %s from %.3f to %.3f" % (word, start, end))

class soundswallower.Arg(name, default, doc, type, required)

Description of a configuration parameter.

default: Default value of parameter.

doc: Description of parameter.

name: Parameter name.

required: Is this parameter required?

type: Type (as a Python type object) of parameter value.

class soundswallower.Config(**kwargs)

Configuration object for SoundSwallower.

The SoundSwallower can be configured either implicitly, by passing keyword arguments to Decoder, or by creating and manipulating Config objects. There are a large number of parameters, most of which are not important or subject to change. These mostly correspond to the command-line arguments used by PocketSphinx.

A Config can be initialized with keyword arguments:

config = Config(hmm="path/to/things", dict="my.dict")

It can also be initialized by parsing JSON (either as bytes or str):

config = Config.parse_json('''{"hmm": "path/to/things",
                               "dict": "my.dict"}''')

The “parser” is very much not strict, so you can also pass a sort of pseudo-YAML to it, e.g.:

config = Config.parse_json("hmm: path/to/things, dict: my.dict")

You can also initialize an empty Config and set arguments in it directly:

config = Config()
config["hmm"] = "path/to/things"

In general, a Config mostly acts like a dictionary, and can be iterated over in the same fashion. However, attempting to access a parameter that does not already exist will raise a KeyError.

See Configuration parameters for a description of existing parameters.

describe(self)

Iterate over parameter descriptions.

This function returns a generator over the parameters defined in a configuration, as Arg objects.

Returns: Descriptions of parameters including their default values and documentation
Return type: Iterable[Arg]

dumps(self)

Serialize configuration to a JSON-formatted str.

This produces JSON from a configuration object, with default values included.

Returns: Serialized JSON
Return type: str
Raises: RuntimeError – if serialization fails somehow.

items(self)

static parse_json(json)

Parse JSON (or pseudo-YAML) configuration

Parameters: json (bytes|str) – JSON data.
Returns: Parsed config, or None on error.
Return type: Config

class soundswallower.Decoder(*args, **kwargs)

Main class for speech recognition and alignment in SoundSwallower.

See Configuration parameters for a description of keyword arguments.

Parameters

hmm (str) – Path to directory containing acoustic model files.
dict (str) – Path to pronunciation dictionary.
jsgf (str) – Path to JSGF grammar file.
fsg (str) – Path to FSG grammar file (only one of jsgf or fsg should be specified).
toprule (str) – Name of top-level rule in JSGF file to use as entry point.
samprate (float) – Sampling rate for raw audio data.
logfn (str) – File to write log messages to (set to os.devnull to silence these messages)

Raises

ValueError – on invalid configuration options.
RuntimeError – on failure to create decoder.

add_word(self, unicode word, unicode phones, update=True)

Add a word to the pronunciation dictionary.

Parameters

word (str) – Text of word to be added.
phones (str) – Space-separated list of phones for this word’s pronunciation. This will depend on the underlying acoustic model but is probably in ARPABET. FIXME: Should accept IPA, duh.
update (bool) – Update the recognizer immediately. You can set this to False if you are adding a lot of words, to speed things up. FIXME: This API is bad and will be changed.

Returns

Word ID of added word.

Return type

int

Raises

KeyError – If word already exists in dictionary.

alignment

The current sub-word alignment, if any.

This property may take some time to access as it runs a second pass of decoding.

Returns: Alignment - if an alignment exists.

cmn

Get current cepstral mean.

Parameters: update (boolean) – Update the mean based on current utterance.
Returns: Cepstral mean as a comma-separated list of numbers.
Return type: str

config

Property containing configuration object.

You may use this to set or unset configuration options, but be aware that you must call initialize to apply them, e.g.:

del decoder.config[“dict”] decoder.config[“loglevel”] = “INFO” decoder.config[“bestpath”] = True decoder.initialize()

Note that this will also remove any dictionary words or grammars which you have loaded. See reinit_feat for an alternative if you do not wish to do this.

Returns: decoder configuration.
Return type: Config

static create(*args, **kwargs)

Create and configure, but do not initialize, the decoder.

This method exists if you wish to override or unset some of the parameters which are filled in automatically when the model is loaded, in particular dict, if you wish to create a dictionary programmatically rather than loading from a file. For example:

d = Decoder.create() del d.config[“dict”] d.initialize() d.add_word(“word”, “W ER D”, True) # Just one word! (or more)

See Configuration parameters for a description of keyword arguments.

Parameters

hmm (str) – Path to directory containing acoustic model files.
dict (str) – Path to pronunciation dictionary.
jsgf (str) – Path to JSGF grammar file.
fsg (str) – Path to FSG grammar file (only one of jsgf or fsg should be specified).
toprule (str) – Name of top-level rule in JSGF file to use as entry point.
samprate (float) – Sampling rate for raw audio data.
logfn (str) – File to write log messages to (set to os.devnull to silence these messages)

Raises

ValueError – on invalid configuration options.

create_fsg(self, name, start_state, final_state, transitions)

Create a finite-state grammar.

This method allows the creation of a grammar directly from a list of transitions. States and words will be created implicitly from the state numbers and word strings present in this list. Make sure that the pronunciation dictionary contains the words, or you will not be able to recognize. Basic usage:

fsg = decoder.create_fsg("mygrammar",
                         start_state=0, final_state=3,
                         transitions=[(0, 1, 0.75, "hello"),
                                      (0, 1, 0.25, "goodbye"),
                                      (1, 2, 0.75, "beautiful"),
                                      (1, 2, 0.25, "cruel"),
                                      (2, 3, 1.0, "world")])

Parameters

name (str) – Name to give this FSG (not very important).
start_state (int) – Index of starting state.
final_state (int) – Index of end state.
transitions (list) – List of transitions, each of which is a 3- or 4-tuple of (from, to, probability[, word]). If the word is not specified, this is an epsilon (null) transition that will always be followed.

Returns

Newly created finite-state grammar.

Return type

FsgModel

Raises

ValueError – On invalid input.

decode_file(self, input_file)

Decode audio from a file in the filesystem.

Currently supports single-channel WAV and raw audio files. If the sampling rate for a WAV file differs from the one set in the decoder’s configuration, the configuration will be updated to match it.

Note that we always decode the entire file at once. It would have to be really huge for this to cause memory problems, in which case the decoder would explode anyway. Otherwise, CMN doesn’t work as well, which causes unnecessary recognition errors.

Parameters: input_file – Path to an audio file.
Returns: Recognized text, Word segmentation.
Return type: (str, Iterable[Seg])

dumps(self, start_time=0., align_level=0): Get decoding result as JSON.

end_utt(self)

Finish processing raw audio input.

This method must be called at the end of each separate “utterance” of raw audio input. It takes care of flushing any internal buffers and finalizing recognition results.

hyp

Current recognition hypothesis.

Returns: Current recognition output.
Return type: Hyp

initialize(self, Config config=None)

(Re-)initialize the decoder.

Parameters: config (Config) – Optional new configuration to apply, otherwise the existing configuration in the config attribute will be reloaded.
Raises: RuntimeError – On invalid configuration or other failure to reinitialize decoder.

lookup_word(self, unicode word)

Look up a word in the dictionary and return phone transcription for it.

Parameters: word (str) – Text of word to search for.
Returns: Space-separated list of phones, or None if not found.
Return type: str

n_frames

The number of frames processed up to this point.

Returns: Like it says.
Return type: int

parse_jsgf(self, jsgf_string, toprule=None)

Parse a JSGF grammar from bytes or string.

Because SoundSwallower uses UTF-8 internally, it is more efficient to parse from bytes, as a string will get encoded and subsequently decoded.

Parameters

jsgf_string (bytes|str) – JSGF grammar as string or UTF-8 encoded bytes.
toprule (str) – Name of starting rule in grammar (will default to first public rule).

Returns

Newly loaded finite-state grammar.

Return type

FsgModel

Raises

ValueError – On failure to parse or find toprule.
RuntimeError – If JSGF has no public rules.

process_raw(self, data, no_search=False, full_utt=False)

Process a block of raw audio.

Parameters

data (bytes) – Raw audio data, a block of 16-bit signed integer binary data.
no_search (bool) – If True, do not do any decoding on this data.
full_utt (bool) – If True, assume this is the entire utterance, for purposes of acoustic normalization.

Raises

RuntimeError – If processing fails.

read_fsg(self, filename)

Read a grammar from an FSG file.

Parameters: filename (str) – Path to FSG file.
Returns: Newly loaded finite-state grammar.
Return type: FsgModel

read_jsgf(self, filename)

Read a grammar from a JSGF file.

The top rule used is the one specified by the “toprule” configuration parameter.

Parameters: filename (str) – Path to JSGF file.
Returns: Newly loaded finite-state grammar.
Return type: FsgModel

reinit_feat(self, Config config=None)

Reinitialize only the feature computation.

Parameters: config (Config) – Optional new configuration to apply, otherwise the existing configuration in the config attribute will be reloaded.
Raises: RuntimeError – On invalid configuration or other failure to initialize feature computation.

seg

Current word segmentation.

Returns: Generator over word segmentations.
Return type: Iterable[Seg]

set_align_text(self, text)

Set a word sequence for alignment.

You must do any text normalization yourself. For word-level alignment, once you call this, simply decode and get the segmentation in the usual manner. For phone-level alignment, you can use the alignment property.

Parameters: text (str) – Sentence to align, as whitespace-separated words. All words must be present in the dictionary.
Raises: RuntimeError – If text is invalid somehow.

set_fsg(self, FsgModel fsg)

Set the grammar for recognition.

Parameters: fsg (FsgModel) – Previously loaded or constructed grammar.

set_jsgf_file(self, filename)

Set the grammar for recognition from a JSGF file.

Parameters: filename (str) – Path to a JSGF file to load.

set_jsgf_string(self, jsgf_string)

Set the grammar for recognition from JSGF bytes or string.

Parameters: jsgf_string (bytes) – JSGF grammar as string or UTF-8 encoded bytes.

start_utt(self)

Start processing raw audio input.

This method must be called at the beginning of each separate “utterance” of raw audio input.

Raises: RuntimeError – If processing fails to start (usually if it has already been started).

update_cmn(self)

Update current cepstral mean.

Returns: New cepstral mean as a comma-separated list of numbers.
Return type: str

class soundswallower.Endpointer(window=0.3, ratio=0.9, vad_mode=Vad.LOOSE, sample_rate=Vad.DEFAULT_SAMPLE_RATE, frame_length=Vad.DEFAULT_FRAME_LENGTH)

Simple endpointer using voice activity detection.

Parameters

window (float) – Length in seconds of window for decision.
ratio (float) – Fraction of window that must be speech or non-speech to make a transition.
mode (int) – Aggressiveness of voice activity detction (0-3)
sample_rate (int) – Sampling rate of input, default is 16000. Rates other than 8000, 16000, 32000, 48000 are only approximately supported, see note in frame_length. Outlandish sampling rates like 3924 and 115200 will raise a ValueError.
frame_length (float) – Desired input frame length in seconds, default is 0.03. The actual frame length may be different if an approximately supported sampling rate is requested. You must always use the frame_bytes and frame_length attributes to determine the input size.

Raises

ValueError – Invalid input parameter. Also raised if the ratio makes it impossible to do endpointing (i.e. it is more than N-1 or less than 1 frame).

DEFAULT_RATIO = 0.9

DEFAULT_WINDOW = 0.3

end_stream(self, frame)

Read a final frame of data and return speech if any.

This function should only be called at the end of the input stream (and then, only if you are currently in a speech region). It will return any remaining speech data detected by the endpointer.

Parameters

frame (bytes) – Buffer containing speech data (16-bit signed integers). Must be of length frame_bytes (in bytes) or less.

Returns

Remaining speech data (could be more than one frame), or None if none detected.

Return type

bytes

Raises

IndexError – buf is of invalid size.
ValueError – Other internal VAD error.

frame_bytes

Number of bytes (not samples) required in an input frame.

You must pass input of this size, as bytes, to the Endpointer.

Type: int

frame_length

Length of a frame in seconds (may be different from the one requested in the constructor!)

Type: float

in_speech

Is the endpointer currently in a speech segment?

To detect transitions from non-speech to speech, check this before process. If it was False but process returns data, then speech has started:

prev_in_speech = ep.in_speech
speech = ep.process(frame)
if speech is not None:
    if not prev_in_speech:
        print("Speech started at", ep.speech_start)

Likewise, to detect transitions from speech to non-speech, call this after process. If process returned data but this returns False, then speech has stopped:

speech = ep.process(frame)
if speech is not None:
    if not ep.in_speech:
        print("Speech ended at", ep.speech_end)

Type: bool

process(self, frame)

Read a frame of data and return speech if detected.

Parameters

frame (bytes) – Buffer containing speech data (16-bit signed integers). Must be of length frame_bytes (in bytes).

Returns

Frame of speech data, or None if none detected.

Return type

bytes

Raises

IndexError – buf is of invalid size.
ValueError – Other internal VAD error.

sample_rate

Sampling rate of input data.

Type: int

speech_end

End time of current speech region.

Type: float

speech_start

Start time of current speech region.

Type: float

class soundswallower.FsgModel

Finite-state recognition grammar.

Note that you cannot create one of these directly, as it depends on some internal configuration from the Decoder. Use the factory methods such as create_fsg or parse_jsgf instead.

class soundswallower.Hyp(text, score, prob)

Recognition hypothesis.

prob: Posterior probability of hypothesis (often 1.0, sorry).

score: Best path score.

text: Recognized text.

class soundswallower.Seg(text, start, duration, ascore, lscore)

Segment in a word segmentation.

ascore: Acoustic match score.

duration: Duration in seconds.

lscore: Language (grammar) match score.

start: Start time in the audio stream in seconds.

text: Word text.

class soundswallower.Vad(mode=VAD_LOOSE, sample_rate=VAD_DEFAULT_SAMPLE_RATE, frame_length=VAD_DEFAULT_FRAME_LENGTH)

Voice activity detection class.

Parameters

mode (int) – Aggressiveness of voice activity detction (0-3)
sample_rate (int) – Sampling rate of input, default is 16000. Rates other than 8000, 16000, 32000, 48000 are only approximately supported, see note in frame_length. Outlandish sampling rates like 3924 and 115200 will raise a ValueError.
frame_length (float) – Desired input frame length in seconds, default is 0.03. The actual frame length may be different if an approximately supported sampling rate is requested. You must always use the frame_bytes and frame_length attributes to determine the input size.

Raises

ValueError – Invalid input parameter (see above).

DEFAULT_FRAME_LENGTH = 0.03

DEFAULT_SAMPLE_RATE = 16000

LOOSE = 0

MEDIUM_LOOSE = 1

MEDIUM_STRICT = 2

STRICT = 3

frame_bytes

Number of bytes (not samples) required in an input frame.

You must pass input of this size, as bytes, to the Vad.

Type: int

frame_length

Length of a frame in seconds (may be different from the one requested in the constructor!)

Type: float

is_speech(self, frame, sample_rate=None)

Classify a frame as speech or not.

Parameters

frame (bytes) – Buffer containing speech data (16-bit signed integers). Must be of length frame_bytes (in bytes).

Returns

Classification as speech or not speech.

Return type

bool

Raises

IndexError – buf is of invalid size.
ValueError – Other internal VAD error.

sample_rate

Sampling rate of input data.

Type: int

soundswallower.get_audio_data(input_file: str) → Tuple[bytes, Optional[int]][source]

Try to get single-channel audio data in the most portable way possible.

Currently suports only single-channel WAV and raw audio.

Parameters: input_file – Path to an audio file.
Returns: Raw audio data, sampling rate or None for a raw file.
Return type: (bytes, int)

soundswallower.get_model_path(subpath: Optional[str] = None) → str[source]

Return path to the model directory, or optionally, a specific file or directory within it.

Parameters: subpath – An optional path to add to the model directory.
Returns: The requested path within the model directory.

Decoder class

class soundswallower.Decoder(*args, **kwargs)

Main class for speech recognition and alignment in SoundSwallower.

See Configuration parameters for a description of keyword arguments.

Parameters

hmm (str) – Path to directory containing acoustic model files.
dict (str) – Path to pronunciation dictionary.
jsgf (str) – Path to JSGF grammar file.
fsg (str) – Path to FSG grammar file (only one of jsgf or fsg should be specified).
toprule (str) – Name of top-level rule in JSGF file to use as entry point.
samprate (float) – Sampling rate for raw audio data.
logfn (str) – File to write log messages to (set to os.devnull to silence these messages)

Raises

ValueError – on invalid configuration options.
RuntimeError – on failure to create decoder.

add_word(self, unicode word, unicode phones, update=True)

Add a word to the pronunciation dictionary.

Parameters

word (str) – Text of word to be added.
phones (str) – Space-separated list of phones for this word’s pronunciation. This will depend on the underlying acoustic model but is probably in ARPABET. FIXME: Should accept IPA, duh.
update (bool) – Update the recognizer immediately. You can set this to False if you are adding a lot of words, to speed things up. FIXME: This API is bad and will be changed.

Returns

Word ID of added word.

Return type

int

Raises

KeyError – If word already exists in dictionary.

alignment

The current sub-word alignment, if any.

This property may take some time to access as it runs a second pass of decoding.

Returns: Alignment - if an alignment exists.

cmn

Get current cepstral mean.

Parameters: update (boolean) – Update the mean based on current utterance.
Returns: Cepstral mean as a comma-separated list of numbers.
Return type: str

config

Property containing configuration object.

You may use this to set or unset configuration options, but be aware that you must call initialize to apply them, e.g.:

del decoder.config[“dict”] decoder.config[“loglevel”] = “INFO” decoder.config[“bestpath”] = True decoder.initialize()

Note that this will also remove any dictionary words or grammars which you have loaded. See reinit_feat for an alternative if you do not wish to do this.

Returns: decoder configuration.
Return type: Config

static create(*args, **kwargs)

Create and configure, but do not initialize, the decoder.

This method exists if you wish to override or unset some of the parameters which are filled in automatically when the model is loaded, in particular dict, if you wish to create a dictionary programmatically rather than loading from a file. For example:

d = Decoder.create() del d.config[“dict”] d.initialize() d.add_word(“word”, “W ER D”, True) # Just one word! (or more)

See Configuration parameters for a description of keyword arguments.

Parameters

hmm (str) – Path to directory containing acoustic model files.
dict (str) – Path to pronunciation dictionary.
jsgf (str) – Path to JSGF grammar file.
fsg (str) – Path to FSG grammar file (only one of jsgf or fsg should be specified).
toprule (str) – Name of top-level rule in JSGF file to use as entry point.
samprate (float) – Sampling rate for raw audio data.
logfn (str) – File to write log messages to (set to os.devnull to silence these messages)

Raises

ValueError – on invalid configuration options.

create_fsg(self, name, start_state, final_state, transitions)

Create a finite-state grammar.

This method allows the creation of a grammar directly from a list of transitions. States and words will be created implicitly from the state numbers and word strings present in this list. Make sure that the pronunciation dictionary contains the words, or you will not be able to recognize. Basic usage:

fsg = decoder.create_fsg("mygrammar",
                         start_state=0, final_state=3,
                         transitions=[(0, 1, 0.75, "hello"),
                                      (0, 1, 0.25, "goodbye"),
                                      (1, 2, 0.75, "beautiful"),
                                      (1, 2, 0.25, "cruel"),
                                      (2, 3, 1.0, "world")])

Parameters

name (str) – Name to give this FSG (not very important).
start_state (int) – Index of starting state.
final_state (int) – Index of end state.
transitions (list) – List of transitions, each of which is a 3- or 4-tuple of (from, to, probability[, word]). If the word is not specified, this is an epsilon (null) transition that will always be followed.

Returns

Newly created finite-state grammar.

Return type

FsgModel

Raises

ValueError – On invalid input.

decode_file(self, input_file)

Decode audio from a file in the filesystem.

Currently supports single-channel WAV and raw audio files. If the sampling rate for a WAV file differs from the one set in the decoder’s configuration, the configuration will be updated to match it.

Note that we always decode the entire file at once. It would have to be really huge for this to cause memory problems, in which case the decoder would explode anyway. Otherwise, CMN doesn’t work as well, which causes unnecessary recognition errors.

Parameters: input_file – Path to an audio file.
Returns: Recognized text, Word segmentation.
Return type: (str, Iterable[Seg])

dumps(self, start_time=0., align_level=0): Get decoding result as JSON.

end_utt(self)

Finish processing raw audio input.

This method must be called at the end of each separate “utterance” of raw audio input. It takes care of flushing any internal buffers and finalizing recognition results.

hyp

Current recognition hypothesis.

Returns: Current recognition output.
Return type: Hyp

initialize(self, Config config=None)

(Re-)initialize the decoder.

Parameters: config (Config) – Optional new configuration to apply, otherwise the existing configuration in the config attribute will be reloaded.
Raises: RuntimeError – On invalid configuration or other failure to reinitialize decoder.

lookup_word(self, unicode word)

Look up a word in the dictionary and return phone transcription for it.

Parameters: word (str) – Text of word to search for.
Returns: Space-separated list of phones, or None if not found.
Return type: str

n_frames

The number of frames processed up to this point.

Returns: Like it says.
Return type: int

parse_jsgf(self, jsgf_string, toprule=None)

Parse a JSGF grammar from bytes or string.

Because SoundSwallower uses UTF-8 internally, it is more efficient to parse from bytes, as a string will get encoded and subsequently decoded.

Parameters

jsgf_string (bytes|str) – JSGF grammar as string or UTF-8 encoded bytes.
toprule (str) – Name of starting rule in grammar (will default to first public rule).

Returns

Newly loaded finite-state grammar.

Return type

FsgModel

Raises

ValueError – On failure to parse or find toprule.
RuntimeError – If JSGF has no public rules.

process_raw(self, data, no_search=False, full_utt=False)

Process a block of raw audio.

Parameters

data (bytes) – Raw audio data, a block of 16-bit signed integer binary data.
no_search (bool) – If True, do not do any decoding on this data.
full_utt (bool) – If True, assume this is the entire utterance, for purposes of acoustic normalization.

Raises

RuntimeError – If processing fails.

read_fsg(self, filename)

Read a grammar from an FSG file.

Parameters: filename (str) – Path to FSG file.
Returns: Newly loaded finite-state grammar.
Return type: FsgModel

read_jsgf(self, filename)

Read a grammar from a JSGF file.

The top rule used is the one specified by the “toprule” configuration parameter.

Parameters: filename (str) – Path to JSGF file.
Returns: Newly loaded finite-state grammar.
Return type: FsgModel

reinit_feat(self, Config config=None)

Reinitialize only the feature computation.

Parameters: config (Config) – Optional new configuration to apply, otherwise the existing configuration in the config attribute will be reloaded.
Raises: RuntimeError – On invalid configuration or other failure to initialize feature computation.

seg

Current word segmentation.

Returns: Generator over word segmentations.
Return type: Iterable[Seg]

set_align_text(self, text)

Set a word sequence for alignment.

You must do any text normalization yourself. For word-level alignment, once you call this, simply decode and get the segmentation in the usual manner. For phone-level alignment, you can use the alignment property.

Parameters: text (str) – Sentence to align, as whitespace-separated words. All words must be present in the dictionary.
Raises: RuntimeError – If text is invalid somehow.

set_fsg(self, FsgModel fsg)

Set the grammar for recognition.

Parameters: fsg (FsgModel) – Previously loaded or constructed grammar.

set_jsgf_file(self, filename)

Set the grammar for recognition from a JSGF file.

Parameters: filename (str) – Path to a JSGF file to load.

set_jsgf_string(self, jsgf_string)

Set the grammar for recognition from JSGF bytes or string.

Parameters: jsgf_string (bytes) – JSGF grammar as string or UTF-8 encoded bytes.

start_utt(self)

Start processing raw audio input.

This method must be called at the beginning of each separate “utterance” of raw audio input.

Raises: RuntimeError – If processing fails to start (usually if it has already been started).

update_cmn(self)

Update current cepstral mean.

Returns: New cepstral mean as a comma-separated list of numbers.
Return type: str

Segmentation and Endpointing classes

class soundswallower.Endpointer(window=0.3, ratio=0.9, vad_mode=Vad.LOOSE, sample_rate=Vad.DEFAULT_SAMPLE_RATE, frame_length=Vad.DEFAULT_FRAME_LENGTH)

Simple endpointer using voice activity detection.

Parameters

window (float) – Length in seconds of window for decision.
ratio (float) – Fraction of window that must be speech or non-speech to make a transition.
mode (int) – Aggressiveness of voice activity detction (0-3)
sample_rate (int) – Sampling rate of input, default is 16000. Rates other than 8000, 16000, 32000, 48000 are only approximately supported, see note in frame_length. Outlandish sampling rates like 3924 and 115200 will raise a ValueError.
frame_length (float) – Desired input frame length in seconds, default is 0.03. The actual frame length may be different if an approximately supported sampling rate is requested. You must always use the frame_bytes and frame_length attributes to determine the input size.

Raises

ValueError – Invalid input parameter. Also raised if the ratio makes it impossible to do endpointing (i.e. it is more than N-1 or less than 1 frame).

end_stream(self, frame)

Read a final frame of data and return speech if any.

This function should only be called at the end of the input stream (and then, only if you are currently in a speech region). It will return any remaining speech data detected by the endpointer.

Parameters

frame (bytes) – Buffer containing speech data (16-bit signed integers). Must be of length frame_bytes (in bytes) or less.

Returns

Remaining speech data (could be more than one frame), or None if none detected.

Return type

bytes

Raises

IndexError – buf is of invalid size.
ValueError – Other internal VAD error.

frame_bytes

Number of bytes (not samples) required in an input frame.

You must pass input of this size, as bytes, to the Endpointer.

Type: int

frame_length

Length of a frame in seconds (may be different from the one requested in the constructor!)

Type: float

in_speech

Is the endpointer currently in a speech segment?

To detect transitions from non-speech to speech, check this before process. If it was False but process returns data, then speech has started:

prev_in_speech = ep.in_speech
speech = ep.process(frame)
if speech is not None:
    if not prev_in_speech:
        print("Speech started at", ep.speech_start)

Likewise, to detect transitions from speech to non-speech, call this after process. If process returned data but this returns False, then speech has stopped:

speech = ep.process(frame)
if speech is not None:
    if not ep.in_speech:
        print("Speech ended at", ep.speech_end)

Type: bool

process(self, frame)

Read a frame of data and return speech if detected.

Parameters

frame (bytes) – Buffer containing speech data (16-bit signed integers). Must be of length frame_bytes (in bytes).

Returns

Frame of speech data, or None if none detected.

Return type

bytes

Raises

IndexError – buf is of invalid size.
ValueError – Other internal VAD error.

sample_rate

Sampling rate of input data.

Type: int

speech_end

End time of current speech region.

Type: float

speech_start

Start time of current speech region.

Type: float

class soundswallower.Vad(mode=VAD_LOOSE, sample_rate=VAD_DEFAULT_SAMPLE_RATE, frame_length=VAD_DEFAULT_FRAME_LENGTH)

Voice activity detection class.

Parameters

mode (int) – Aggressiveness of voice activity detction (0-3)
sample_rate (int) – Sampling rate of input, default is 16000. Rates other than 8000, 16000, 32000, 48000 are only approximately supported, see note in frame_length. Outlandish sampling rates like 3924 and 115200 will raise a ValueError.
frame_length (float) – Desired input frame length in seconds, default is 0.03. The actual frame length may be different if an approximately supported sampling rate is requested. You must always use the frame_bytes and frame_length attributes to determine the input size.

Raises

ValueError – Invalid input parameter (see above).

frame_bytes

Number of bytes (not samples) required in an input frame.

You must pass input of this size, as bytes, to the Vad.

Type: int

frame_length

Length of a frame in seconds (may be different from the one requested in the constructor!)

Type: float

is_speech(self, frame, sample_rate=None)

Classify a frame as speech or not.

Parameters

frame (bytes) – Buffer containing speech data (16-bit signed integers). Must be of length frame_bytes (in bytes).

Returns

Classification as speech or not speech.

Return type

bool

Raises

IndexError – buf is of invalid size.
ValueError – Other internal VAD error.

sample_rate

Sampling rate of input data.

Type: int

Other classes

class soundswallower.Config(**kwargs)

Configuration object for SoundSwallower.

The SoundSwallower can be configured either implicitly, by passing keyword arguments to Decoder, or by creating and manipulating Config objects. There are a large number of parameters, most of which are not important or subject to change. These mostly correspond to the command-line arguments used by PocketSphinx.

A Config can be initialized with keyword arguments:

config = Config(hmm="path/to/things", dict="my.dict")

It can also be initialized by parsing JSON (either as bytes or str):

config = Config.parse_json('''{"hmm": "path/to/things",
                               "dict": "my.dict"}''')

The “parser” is very much not strict, so you can also pass a sort of pseudo-YAML to it, e.g.:

config = Config.parse_json("hmm: path/to/things, dict: my.dict")

You can also initialize an empty Config and set arguments in it directly:

config = Config()
config["hmm"] = "path/to/things"

In general, a Config mostly acts like a dictionary, and can be iterated over in the same fashion. However, attempting to access a parameter that does not already exist will raise a KeyError.

See Configuration parameters for a description of existing parameters.

describe(self)

Iterate over parameter descriptions.

This function returns a generator over the parameters defined in a configuration, as Arg objects.

Returns: Descriptions of parameters including their default values and documentation
Return type: Iterable[Arg]

dumps(self)

Serialize configuration to a JSON-formatted str.

This produces JSON from a configuration object, with default values included.

Returns: Serialized JSON
Return type: str
Raises: RuntimeError – if serialization fails somehow.

items(self)

static parse_json(json)

Parse JSON (or pseudo-YAML) configuration

Parameters: json (bytes|str) – JSON data.
Returns: Parsed config, or None on error.
Return type: Config