soundswallower package
Main module for the SoundSwallower speech recognizer.
SoundSwallower is a small and not particularly powerful speech
recognition engine for constrained grammars. It can also be used to
align text to audio. Most of the functionality is contained in the
Decoder
class. Basic usage:
from soundswallower import Decoder, get_model_path
decoder = Decoder(hmm=get_model_path("en-us"),
dict=get_model_path("en-us.dict"),
jsgf="some_grammar_file.gram")
hyp, seg = decoder.decode_file("example.wav")
print("Recognized text:", hyp)
for word, start, end in seg:
print("Word %s from %.3f to %.3f" % (word, start, end))
- class soundswallower.Arg(name, default, doc, type, required)
Description of a configuration parameter.
- default
Default value of parameter.
- doc
Description of parameter.
- name
Parameter name.
- required
Is this parameter required?
- type
Type (as a Python type object) of parameter value.
- class soundswallower.Config(**kwargs)
Configuration object for SoundSwallower.
The SoundSwallower can be configured either implicitly, by passing keyword arguments to
Decoder
, or by creating and manipulatingConfig
objects. There are a large number of parameters, most of which are not important or subject to change. These mostly correspond to the command-line arguments used by PocketSphinx.A
Config
can be initialized with keyword arguments:config = Config(hmm="path/to/things", dict="my.dict")
It can also be initialized by parsing JSON (either as bytes or str):
config = Config.parse_json('''{"hmm": "path/to/things", "dict": "my.dict"}''')
The “parser” is very much not strict, so you can also pass a sort of pseudo-YAML to it, e.g.:
config = Config.parse_json("hmm: path/to/things, dict: my.dict")
You can also initialize an empty
Config
and set arguments in it directly:config = Config() config["hmm"] = "path/to/things"
In general, a
Config
mostly acts like a dictionary, and can be iterated over in the same fashion. However, attempting to access a parameter that does not already exist will raise aKeyError
.See Configuration parameters for a description of existing parameters.
- describe(self)
Iterate over parameter descriptions.
This function returns a generator over the parameters defined in a configuration, as
Arg
objects.- Returns
Descriptions of parameters including their default values and documentation
- Return type
Iterable[Arg]
- dumps(self)
Serialize configuration to a JSON-formatted
str
.This produces JSON from a configuration object, with default values included.
- Returns
Serialized JSON
- Return type
- Raises
RuntimeError – if serialization fails somehow.
- items(self)
- class soundswallower.Decoder(*args, **kwargs)
Main class for speech recognition and alignment in SoundSwallower.
See Configuration parameters for a description of keyword arguments.
- Parameters
hmm (str) – Path to directory containing acoustic model files.
dict (str) – Path to pronunciation dictionary.
jsgf (str) – Path to JSGF grammar file.
fsg (str) – Path to FSG grammar file (only one of
jsgf
orfsg
should be specified).toprule (str) – Name of top-level rule in JSGF file to use as entry point.
samprate (float) – Sampling rate for raw audio data.
logfn (str) – File to write log messages to (set to
os.devnull
to silence these messages)
- Raises
ValueError – on invalid configuration options.
RuntimeError – on failure to create decoder.
- add_word(self, unicode word, unicode phones, update=True)
Add a word to the pronunciation dictionary.
- Parameters
word (str) – Text of word to be added.
phones (str) – Space-separated list of phones for this word’s pronunciation. This will depend on the underlying acoustic model but is probably in ARPABET. FIXME: Should accept IPA, duh.
update (bool) – Update the recognizer immediately. You can set this to
False
if you are adding a lot of words, to speed things up. FIXME: This API is bad and will be changed.
- Returns
Word ID of added word.
- Return type
- Raises
KeyError – If word already exists in dictionary.
- alignment
The current sub-word alignment, if any.
This property may take some time to access as it runs a second pass of decoding.
- Returns
Alignment - if an alignment exists.
- cmn
Get current cepstral mean.
- Parameters
update (boolean) – Update the mean based on current utterance.
- Returns
Cepstral mean as a comma-separated list of numbers.
- Return type
- config
Property containing configuration object.
You may use this to set or unset configuration options, but be aware that you must call
initialize
to apply them, e.g.:del decoder.config[“dict”] decoder.config[“loglevel”] = “INFO” decoder.config[“bestpath”] = True decoder.initialize()
Note that this will also remove any dictionary words or grammars which you have loaded. See
reinit_feat
for an alternative if you do not wish to do this.- Returns
decoder configuration.
- Return type
- static create(*args, **kwargs)
Create and configure, but do not initialize, the decoder.
This method exists if you wish to override or unset some of the parameters which are filled in automatically when the model is loaded, in particular
dict
, if you wish to create a dictionary programmatically rather than loading from a file. For example:d = Decoder.create() del d.config[“dict”] d.initialize() d.add_word(“word”, “W ER D”, True) # Just one word! (or more)
See Configuration parameters for a description of keyword arguments.
- Parameters
hmm (str) – Path to directory containing acoustic model files.
dict (str) – Path to pronunciation dictionary.
jsgf (str) – Path to JSGF grammar file.
fsg (str) – Path to FSG grammar file (only one of
jsgf
orfsg
should be specified).toprule (str) – Name of top-level rule in JSGF file to use as entry point.
samprate (float) – Sampling rate for raw audio data.
logfn (str) – File to write log messages to (set to
os.devnull
to silence these messages)
- Raises
ValueError – on invalid configuration options.
- create_fsg(self, name, start_state, final_state, transitions)
Create a finite-state grammar.
This method allows the creation of a grammar directly from a list of transitions. States and words will be created implicitly from the state numbers and word strings present in this list. Make sure that the pronunciation dictionary contains the words, or you will not be able to recognize. Basic usage:
fsg = decoder.create_fsg("mygrammar", start_state=0, final_state=3, transitions=[(0, 1, 0.75, "hello"), (0, 1, 0.25, "goodbye"), (1, 2, 0.75, "beautiful"), (1, 2, 0.25, "cruel"), (2, 3, 1.0, "world")])
- Parameters
name (str) – Name to give this FSG (not very important).
start_state (int) – Index of starting state.
final_state (int) – Index of end state.
transitions (list) – List of transitions, each of which is a 3- or 4-tuple of (from, to, probability[, word]). If the word is not specified, this is an epsilon (null) transition that will always be followed.
- Returns
Newly created finite-state grammar.
- Return type
- Raises
ValueError – On invalid input.
- decode_file(self, input_file)
Decode audio from a file in the filesystem.
Currently supports single-channel WAV and raw audio files. If the sampling rate for a WAV file differs from the one set in the decoder’s configuration, the configuration will be updated to match it.
Note that we always decode the entire file at once. It would have to be really huge for this to cause memory problems, in which case the decoder would explode anyway. Otherwise, CMN doesn’t work as well, which causes unnecessary recognition errors.
- dumps(self, start_time=0., align_level=0)
Get decoding result as JSON.
- end_utt(self)
Finish processing raw audio input.
This method must be called at the end of each separate “utterance” of raw audio input. It takes care of flushing any internal buffers and finalizing recognition results.
- initialize(self, Config config=None)
(Re-)initialize the decoder.
- Parameters
config (Config) – Optional new configuration to apply, otherwise the existing configuration in the
config
attribute will be reloaded.- Raises
RuntimeError – On invalid configuration or other failure to reinitialize decoder.
- lookup_word(self, unicode word)
Look up a word in the dictionary and return phone transcription for it.
- parse_jsgf(self, jsgf_string, toprule=None)
Parse a JSGF grammar from bytes or string.
Because SoundSwallower uses UTF-8 internally, it is more efficient to parse from bytes, as a string will get encoded and subsequently decoded.
- Parameters
- Returns
Newly loaded finite-state grammar.
- Return type
- Raises
ValueError – On failure to parse or find
toprule
.RuntimeError – If JSGF has no public rules.
- process_raw(self, data, no_search=False, full_utt=False)
Process a block of raw audio.
- read_fsg(self, filename)
Read a grammar from an FSG file.
- read_jsgf(self, filename)
Read a grammar from a JSGF file.
The top rule used is the one specified by the “toprule” configuration parameter.
- reinit_feat(self, Config config=None)
Reinitialize only the feature computation.
- Parameters
config (Config) – Optional new configuration to apply, otherwise the existing configuration in the
config
attribute will be reloaded.- Raises
RuntimeError – On invalid configuration or other failure to initialize feature computation.
- seg
Current word segmentation.
- Returns
Generator over word segmentations.
- Return type
Iterable[Seg]
- set_align_text(self, text)
Set a word sequence for alignment.
You must do any text normalization yourself. For word-level alignment, once you call this, simply decode and get the segmentation in the usual manner. For phone-level alignment, you can use the
alignment
property.- Parameters
text (str) – Sentence to align, as whitespace-separated words. All words must be present in the dictionary.
- Raises
RuntimeError – If text is invalid somehow.
- set_fsg(self, FsgModel fsg)
Set the grammar for recognition.
- Parameters
fsg (FsgModel) – Previously loaded or constructed grammar.
- set_jsgf_file(self, filename)
Set the grammar for recognition from a JSGF file.
- Parameters
filename (str) – Path to a JSGF file to load.
- set_jsgf_string(self, jsgf_string)
Set the grammar for recognition from JSGF bytes or string.
- Parameters
jsgf_string (bytes) – JSGF grammar as string or UTF-8 encoded bytes.
- start_utt(self)
Start processing raw audio input.
This method must be called at the beginning of each separate “utterance” of raw audio input.
- Raises
RuntimeError – If processing fails to start (usually if it has already been started).
- class soundswallower.Endpointer(window=0.3, ratio=0.9, vad_mode=Vad.LOOSE, sample_rate=Vad.DEFAULT_SAMPLE_RATE, frame_length=Vad.DEFAULT_FRAME_LENGTH)
Simple endpointer using voice activity detection.
- Parameters
window (float) – Length in seconds of window for decision.
ratio (float) – Fraction of window that must be speech or non-speech to make a transition.
mode (int) – Aggressiveness of voice activity detction (0-3)
sample_rate (int) – Sampling rate of input, default is 16000. Rates other than 8000, 16000, 32000, 48000 are only approximately supported, see note in
frame_length
. Outlandish sampling rates like 3924 and 115200 will raise aValueError
.frame_length (float) – Desired input frame length in seconds, default is 0.03. The actual frame length may be different if an approximately supported sampling rate is requested. You must always use the
frame_bytes
andframe_length
attributes to determine the input size.
- Raises
ValueError – Invalid input parameter. Also raised if the ratio makes it impossible to do endpointing (i.e. it is more than N-1 or less than 1 frame).
- DEFAULT_RATIO = 0.9
- DEFAULT_WINDOW = 0.3
- end_stream(self, frame)
Read a final frame of data and return speech if any.
This function should only be called at the end of the input stream (and then, only if you are currently in a speech region). It will return any remaining speech data detected by the endpointer.
- Parameters
frame (bytes) – Buffer containing speech data (16-bit signed integers). Must be of length
frame_bytes
(in bytes) or less.- Returns
Remaining speech data (could be more than one frame), or None if none detected.
- Return type
- Raises
IndexError –
buf
is of invalid size.ValueError – Other internal VAD error.
- frame_bytes
Number of bytes (not samples) required in an input frame.
You must pass input of this size, as
bytes
, to theEndpointer
.- Type
- frame_length
Length of a frame in seconds (may be different from the one requested in the constructor!)
- Type
- in_speech
Is the endpointer currently in a speech segment?
To detect transitions from non-speech to speech, check this before
process
. If it wasFalse
butprocess
returns data, then speech has started:prev_in_speech = ep.in_speech speech = ep.process(frame) if speech is not None: if not prev_in_speech: print("Speech started at", ep.speech_start)
Likewise, to detect transitions from speech to non-speech, call this after
process
. Ifprocess
returned data but this returnsFalse
, then speech has stopped:speech = ep.process(frame) if speech is not None: if not ep.in_speech: print("Speech ended at", ep.speech_end)
- Type
- process(self, frame)
Read a frame of data and return speech if detected.
- Parameters
frame (bytes) – Buffer containing speech data (16-bit signed integers). Must be of length
frame_bytes
(in bytes).- Returns
Frame of speech data, or None if none detected.
- Return type
- Raises
IndexError –
buf
is of invalid size.ValueError – Other internal VAD error.
- class soundswallower.FsgModel
Finite-state recognition grammar.
Note that you cannot create one of these directly, as it depends on some internal configuration from the
Decoder
. Use the factory methods such ascreate_fsg
orparse_jsgf
instead.
- class soundswallower.Hyp(text, score, prob)
Recognition hypothesis.
- prob
Posterior probability of hypothesis (often 1.0, sorry).
- score
Best path score.
- text
Recognized text.
- class soundswallower.Seg(text, start, duration, ascore, lscore)
Segment in a word segmentation.
- ascore
Acoustic match score.
- duration
Duration in seconds.
- lscore
Language (grammar) match score.
- start
Start time in the audio stream in seconds.
- text
Word text.
- class soundswallower.Vad(mode=VAD_LOOSE, sample_rate=VAD_DEFAULT_SAMPLE_RATE, frame_length=VAD_DEFAULT_FRAME_LENGTH)
Voice activity detection class.
- Parameters
mode (int) – Aggressiveness of voice activity detction (0-3)
sample_rate (int) – Sampling rate of input, default is 16000. Rates other than 8000, 16000, 32000, 48000 are only approximately supported, see note in
frame_length
. Outlandish sampling rates like 3924 and 115200 will raise aValueError
.frame_length (float) – Desired input frame length in seconds, default is 0.03. The actual frame length may be different if an approximately supported sampling rate is requested. You must always use the
frame_bytes
andframe_length
attributes to determine the input size.
- Raises
ValueError – Invalid input parameter (see above).
- DEFAULT_FRAME_LENGTH = 0.03
- DEFAULT_SAMPLE_RATE = 16000
- LOOSE = 0
- MEDIUM_LOOSE = 1
- MEDIUM_STRICT = 2
- STRICT = 3
- frame_bytes
Number of bytes (not samples) required in an input frame.
You must pass input of this size, as
bytes
, to theVad
.- Type
- frame_length
Length of a frame in seconds (may be different from the one requested in the constructor!)
- Type
- is_speech(self, frame, sample_rate=None)
Classify a frame as speech or not.
- Parameters
frame (bytes) – Buffer containing speech data (16-bit signed integers). Must be of length
frame_bytes
(in bytes).- Returns
Classification as speech or not speech.
- Return type
- Raises
IndexError –
buf
is of invalid size.ValueError – Other internal VAD error.
- soundswallower.get_audio_data(input_file: str) Tuple[bytes, Optional[int]] [source]
Try to get single-channel audio data in the most portable way possible.
Currently suports only single-channel WAV and raw audio.
- soundswallower.get_model_path(subpath: Optional[str] = None) str [source]
Return path to the model directory, or optionally, a specific file or directory within it.
- Parameters
subpath – An optional path to add to the model directory.
- Returns
The requested path within the model directory.
Decoder class
- class soundswallower.Decoder(*args, **kwargs)
Main class for speech recognition and alignment in SoundSwallower.
See Configuration parameters for a description of keyword arguments.
- Parameters
hmm (str) – Path to directory containing acoustic model files.
dict (str) – Path to pronunciation dictionary.
jsgf (str) – Path to JSGF grammar file.
fsg (str) – Path to FSG grammar file (only one of
jsgf
orfsg
should be specified).toprule (str) – Name of top-level rule in JSGF file to use as entry point.
samprate (float) – Sampling rate for raw audio data.
logfn (str) – File to write log messages to (set to
os.devnull
to silence these messages)
- Raises
ValueError – on invalid configuration options.
RuntimeError – on failure to create decoder.
- add_word(self, unicode word, unicode phones, update=True)
Add a word to the pronunciation dictionary.
- Parameters
word (str) – Text of word to be added.
phones (str) – Space-separated list of phones for this word’s pronunciation. This will depend on the underlying acoustic model but is probably in ARPABET. FIXME: Should accept IPA, duh.
update (bool) – Update the recognizer immediately. You can set this to
False
if you are adding a lot of words, to speed things up. FIXME: This API is bad and will be changed.
- Returns
Word ID of added word.
- Return type
- Raises
KeyError – If word already exists in dictionary.
- alignment
The current sub-word alignment, if any.
This property may take some time to access as it runs a second pass of decoding.
- Returns
Alignment - if an alignment exists.
- cmn
Get current cepstral mean.
- Parameters
update (boolean) – Update the mean based on current utterance.
- Returns
Cepstral mean as a comma-separated list of numbers.
- Return type
- config
Property containing configuration object.
You may use this to set or unset configuration options, but be aware that you must call
initialize
to apply them, e.g.:del decoder.config[“dict”] decoder.config[“loglevel”] = “INFO” decoder.config[“bestpath”] = True decoder.initialize()
Note that this will also remove any dictionary words or grammars which you have loaded. See
reinit_feat
for an alternative if you do not wish to do this.- Returns
decoder configuration.
- Return type
- static create(*args, **kwargs)
Create and configure, but do not initialize, the decoder.
This method exists if you wish to override or unset some of the parameters which are filled in automatically when the model is loaded, in particular
dict
, if you wish to create a dictionary programmatically rather than loading from a file. For example:d = Decoder.create() del d.config[“dict”] d.initialize() d.add_word(“word”, “W ER D”, True) # Just one word! (or more)
See Configuration parameters for a description of keyword arguments.
- Parameters
hmm (str) – Path to directory containing acoustic model files.
dict (str) – Path to pronunciation dictionary.
jsgf (str) – Path to JSGF grammar file.
fsg (str) – Path to FSG grammar file (only one of
jsgf
orfsg
should be specified).toprule (str) – Name of top-level rule in JSGF file to use as entry point.
samprate (float) – Sampling rate for raw audio data.
logfn (str) – File to write log messages to (set to
os.devnull
to silence these messages)
- Raises
ValueError – on invalid configuration options.
- create_fsg(self, name, start_state, final_state, transitions)
Create a finite-state grammar.
This method allows the creation of a grammar directly from a list of transitions. States and words will be created implicitly from the state numbers and word strings present in this list. Make sure that the pronunciation dictionary contains the words, or you will not be able to recognize. Basic usage:
fsg = decoder.create_fsg("mygrammar", start_state=0, final_state=3, transitions=[(0, 1, 0.75, "hello"), (0, 1, 0.25, "goodbye"), (1, 2, 0.75, "beautiful"), (1, 2, 0.25, "cruel"), (2, 3, 1.0, "world")])
- Parameters
name (str) – Name to give this FSG (not very important).
start_state (int) – Index of starting state.
final_state (int) – Index of end state.
transitions (list) – List of transitions, each of which is a 3- or 4-tuple of (from, to, probability[, word]). If the word is not specified, this is an epsilon (null) transition that will always be followed.
- Returns
Newly created finite-state grammar.
- Return type
- Raises
ValueError – On invalid input.
- decode_file(self, input_file)
Decode audio from a file in the filesystem.
Currently supports single-channel WAV and raw audio files. If the sampling rate for a WAV file differs from the one set in the decoder’s configuration, the configuration will be updated to match it.
Note that we always decode the entire file at once. It would have to be really huge for this to cause memory problems, in which case the decoder would explode anyway. Otherwise, CMN doesn’t work as well, which causes unnecessary recognition errors.
- dumps(self, start_time=0., align_level=0)
Get decoding result as JSON.
- end_utt(self)
Finish processing raw audio input.
This method must be called at the end of each separate “utterance” of raw audio input. It takes care of flushing any internal buffers and finalizing recognition results.
- initialize(self, Config config=None)
(Re-)initialize the decoder.
- Parameters
config (Config) – Optional new configuration to apply, otherwise the existing configuration in the
config
attribute will be reloaded.- Raises
RuntimeError – On invalid configuration or other failure to reinitialize decoder.
- lookup_word(self, unicode word)
Look up a word in the dictionary and return phone transcription for it.
- parse_jsgf(self, jsgf_string, toprule=None)
Parse a JSGF grammar from bytes or string.
Because SoundSwallower uses UTF-8 internally, it is more efficient to parse from bytes, as a string will get encoded and subsequently decoded.
- Parameters
- Returns
Newly loaded finite-state grammar.
- Return type
- Raises
ValueError – On failure to parse or find
toprule
.RuntimeError – If JSGF has no public rules.
- process_raw(self, data, no_search=False, full_utt=False)
Process a block of raw audio.
- read_fsg(self, filename)
Read a grammar from an FSG file.
- read_jsgf(self, filename)
Read a grammar from a JSGF file.
The top rule used is the one specified by the “toprule” configuration parameter.
- reinit_feat(self, Config config=None)
Reinitialize only the feature computation.
- Parameters
config (Config) – Optional new configuration to apply, otherwise the existing configuration in the
config
attribute will be reloaded.- Raises
RuntimeError – On invalid configuration or other failure to initialize feature computation.
- seg
Current word segmentation.
- Returns
Generator over word segmentations.
- Return type
Iterable[Seg]
- set_align_text(self, text)
Set a word sequence for alignment.
You must do any text normalization yourself. For word-level alignment, once you call this, simply decode and get the segmentation in the usual manner. For phone-level alignment, you can use the
alignment
property.- Parameters
text (str) – Sentence to align, as whitespace-separated words. All words must be present in the dictionary.
- Raises
RuntimeError – If text is invalid somehow.
- set_fsg(self, FsgModel fsg)
Set the grammar for recognition.
- Parameters
fsg (FsgModel) – Previously loaded or constructed grammar.
- set_jsgf_file(self, filename)
Set the grammar for recognition from a JSGF file.
- Parameters
filename (str) – Path to a JSGF file to load.
- set_jsgf_string(self, jsgf_string)
Set the grammar for recognition from JSGF bytes or string.
- Parameters
jsgf_string (bytes) – JSGF grammar as string or UTF-8 encoded bytes.
- start_utt(self)
Start processing raw audio input.
This method must be called at the beginning of each separate “utterance” of raw audio input.
- Raises
RuntimeError – If processing fails to start (usually if it has already been started).
Segmentation and Endpointing classes
- class soundswallower.Endpointer(window=0.3, ratio=0.9, vad_mode=Vad.LOOSE, sample_rate=Vad.DEFAULT_SAMPLE_RATE, frame_length=Vad.DEFAULT_FRAME_LENGTH)
Simple endpointer using voice activity detection.
- Parameters
window (float) – Length in seconds of window for decision.
ratio (float) – Fraction of window that must be speech or non-speech to make a transition.
mode (int) – Aggressiveness of voice activity detction (0-3)
sample_rate (int) – Sampling rate of input, default is 16000. Rates other than 8000, 16000, 32000, 48000 are only approximately supported, see note in
frame_length
. Outlandish sampling rates like 3924 and 115200 will raise aValueError
.frame_length (float) – Desired input frame length in seconds, default is 0.03. The actual frame length may be different if an approximately supported sampling rate is requested. You must always use the
frame_bytes
andframe_length
attributes to determine the input size.
- Raises
ValueError – Invalid input parameter. Also raised if the ratio makes it impossible to do endpointing (i.e. it is more than N-1 or less than 1 frame).
- end_stream(self, frame)
Read a final frame of data and return speech if any.
This function should only be called at the end of the input stream (and then, only if you are currently in a speech region). It will return any remaining speech data detected by the endpointer.
- Parameters
frame (bytes) – Buffer containing speech data (16-bit signed integers). Must be of length
frame_bytes
(in bytes) or less.- Returns
Remaining speech data (could be more than one frame), or None if none detected.
- Return type
- Raises
IndexError –
buf
is of invalid size.ValueError – Other internal VAD error.
- frame_bytes
Number of bytes (not samples) required in an input frame.
You must pass input of this size, as
bytes
, to theEndpointer
.- Type
- frame_length
Length of a frame in seconds (may be different from the one requested in the constructor!)
- Type
- in_speech
Is the endpointer currently in a speech segment?
To detect transitions from non-speech to speech, check this before
process
. If it wasFalse
butprocess
returns data, then speech has started:prev_in_speech = ep.in_speech speech = ep.process(frame) if speech is not None: if not prev_in_speech: print("Speech started at", ep.speech_start)
Likewise, to detect transitions from speech to non-speech, call this after
process
. Ifprocess
returned data but this returnsFalse
, then speech has stopped:speech = ep.process(frame) if speech is not None: if not ep.in_speech: print("Speech ended at", ep.speech_end)
- Type
- process(self, frame)
Read a frame of data and return speech if detected.
- Parameters
frame (bytes) – Buffer containing speech data (16-bit signed integers). Must be of length
frame_bytes
(in bytes).- Returns
Frame of speech data, or None if none detected.
- Return type
- Raises
IndexError –
buf
is of invalid size.ValueError – Other internal VAD error.
- class soundswallower.Vad(mode=VAD_LOOSE, sample_rate=VAD_DEFAULT_SAMPLE_RATE, frame_length=VAD_DEFAULT_FRAME_LENGTH)
Voice activity detection class.
- Parameters
mode (int) – Aggressiveness of voice activity detction (0-3)
sample_rate (int) – Sampling rate of input, default is 16000. Rates other than 8000, 16000, 32000, 48000 are only approximately supported, see note in
frame_length
. Outlandish sampling rates like 3924 and 115200 will raise aValueError
.frame_length (float) – Desired input frame length in seconds, default is 0.03. The actual frame length may be different if an approximately supported sampling rate is requested. You must always use the
frame_bytes
andframe_length
attributes to determine the input size.
- Raises
ValueError – Invalid input parameter (see above).
- frame_bytes
Number of bytes (not samples) required in an input frame.
You must pass input of this size, as
bytes
, to theVad
.- Type
- frame_length
Length of a frame in seconds (may be different from the one requested in the constructor!)
- Type
- is_speech(self, frame, sample_rate=None)
Classify a frame as speech or not.
- Parameters
frame (bytes) – Buffer containing speech data (16-bit signed integers). Must be of length
frame_bytes
(in bytes).- Returns
Classification as speech or not speech.
- Return type
- Raises
IndexError –
buf
is of invalid size.ValueError – Other internal VAD error.
Other classes
- class soundswallower.Config(**kwargs)
Configuration object for SoundSwallower.
The SoundSwallower can be configured either implicitly, by passing keyword arguments to
Decoder
, or by creating and manipulatingConfig
objects. There are a large number of parameters, most of which are not important or subject to change. These mostly correspond to the command-line arguments used by PocketSphinx.A
Config
can be initialized with keyword arguments:config = Config(hmm="path/to/things", dict="my.dict")
It can also be initialized by parsing JSON (either as bytes or str):
config = Config.parse_json('''{"hmm": "path/to/things", "dict": "my.dict"}''')
The “parser” is very much not strict, so you can also pass a sort of pseudo-YAML to it, e.g.:
config = Config.parse_json("hmm: path/to/things, dict: my.dict")
You can also initialize an empty
Config
and set arguments in it directly:config = Config() config["hmm"] = "path/to/things"
In general, a
Config
mostly acts like a dictionary, and can be iterated over in the same fashion. However, attempting to access a parameter that does not already exist will raise aKeyError
.See Configuration parameters for a description of existing parameters.
- describe(self)
Iterate over parameter descriptions.
This function returns a generator over the parameters defined in a configuration, as
Arg
objects.- Returns
Descriptions of parameters including their default values and documentation
- Return type
Iterable[Arg]
- dumps(self)
Serialize configuration to a JSON-formatted
str
.This produces JSON from a configuration object, with default values included.
- Returns
Serialized JSON
- Return type
- Raises
RuntimeError – if serialization fails somehow.
- items(self)