Protocol Documentation¶
techmo/asr/api/v1p1/asr.proto¶
AgeRecognitionAlternative¶
An alternative hypothesis of age recognition.
Field |
Type |
Label |
Description |
---|---|---|---|
age |
The assumed age of the person speaking in the audio, in years. For a reliable value, assure that there is only one person speaking in the audio. |
||
confidence |
optional |
The confidence estimate, ranging from 0.0 to 1.0. Support for this feature is optional. |
AgeRecognitionConfig¶
Configuration of age recognition.
Field |
Type |
Label |
Description |
---|---|---|---|
enable_age_recognition |
The switch that enables age recognition for the request. If disabled or unspecified, the related results are excluded. The service responds with the |
AgeRecognitionResult¶
A result of age recognition.
Field |
Type |
Label |
Description |
---|---|---|---|
error |
The recognition process status. It may communicate warnings. In case of an error hindering recognition, all other message fields should be left unset. |
||
recognition_alternatives |
repeated |
The confidence-ordered list of alternative recognition hypotheses. |
Audio¶
Audio contents.
Field |
Type |
Label |
Description |
---|---|---|---|
bytes |
The audio data bytes. |
AudioConfig¶
Audio configuration of
a StreamingRecognize
request.
Field |
Type |
Label |
Description |
---|---|---|---|
encoding |
The encoding of the audio data sent in the request. Single channel (mono) audio is assumed. The service should respond with the |
||
sampling_rate_hz |
The sampling rate of the audio data sent in the request. The service should silently ignore the field for encodings that are sent along wtih headers, and detect the value from them instead. The service should respond with the |
GenderRecognitionAlternative¶
An alternative hypothesis of gender recognition.
Field |
Type |
Label |
Description |
---|---|---|---|
gender |
The assumed gender of the person speaking in the audio. For a reliable value, assure that there is only one person speaking in the audio. |
||
confidence |
optional |
The confidence estimate, ranging from 0.0 to 1.0. Support for this feature is optional. |
GenderRecognitionConfig¶
Configuration of gender recognition.
Field |
Type |
Label |
Description |
---|---|---|---|
enable_gender_recognition |
The switch that enables gender recognition for the request. If disabled or unspecified, the related results are excluded. The service responds with the |
GenderRecognitionResult¶
A result of gender recognition.
Field |
Type |
Label |
Description |
---|---|---|---|
error |
The recognition process status. It may communicate warnings. In case of an error hindering recognition, all other message fields should be left unset. |
||
recognition_alternatives |
repeated |
The confidence-ordered list of alternative recognition hypotheses. |
LanguageRecognitionAlternative¶
An alternative hypothesis of language recognition.
Field |
Type |
Label |
Description |
---|---|---|---|
language |
The language spoken in the audio, a BCP-47 tag. |
||
confidence |
optional |
The confidence estimate, ranging from 0.0 to 1.0. Support for this feature is optional. |
LanguageRecognitionConfig¶
Configuration of language recognition.
Field |
Type |
Label |
Description |
---|---|---|---|
enable_language_recognition |
The switch that enables language recognition for the request. If disabled or unspecified, the related results are excluded. The service responds with the |
LanguageRecognitionResult¶
A result of language recognition.
Field |
Type |
Label |
Description |
---|---|---|---|
error |
The recognition process status. It may communicate warnings. In case of an error hindering recognition, all other message fields should be left unset. |
||
recognition_alternatives |
repeated |
The confidence-ordered list of alternative recognition hypotheses. |
ResultConfig¶
Result configuration of
a StreamingRecognize
request.
Field |
Type |
Label |
Description |
---|---|---|---|
enable_single_utterance |
The switch that toggles continuous recognition into single utterance mode. The service returns a final result for each end of utterance it detects in the audio, which may occur multiple times during a request. If enabled, the request terminates right after its first final result. |
||
enable_interim_results |
The switch that allows interim results. If enabled, results containing tentative hypotheses may be returned in addition to final ones. The service should silently ignore this field if it is unsupported. |
||
enable_held_responses_merging |
The switch to allow the service merging responses in the “hold response” state. If enabled and there is more than a single response held, the service does not return them in a batch. Instead, it tries to merge their results into a single response. The service should respond with the |
SpeechRecognitionAlternative¶
An alternative hypothesis of speech recognition.
Field |
Type |
Label |
Description |
---|---|---|---|
transcript |
The transcript of the audio. |
||
confidence |
optional |
The confidence estimate, ranging from 0.0 to 1.0. Support for this feature is optional. |
|
words |
repeated |
The details of the transcript’s words. Empty unless |
SpeechRecognitionConfig¶
Configuration for speech recognition.
Field |
Type |
Label |
Description |
---|---|---|---|
enable_speech_recognition |
The switch that enables speech recognition for the request. If disabled or unspecified, the related results are excluded. The service responds with the |
||
recognition_alternatives_limit |
The maximum number of alternative transcriptions allowed to be included per response. The actual count received can be less than the specified value and may also be equal to 0. If unspecified or 0, one alternative is allowed to be returned too. |
||
enable_time_alignment |
The switch that enables additional time alignment of recognitions in word details. If enabled, the |
||
language_group_name |
The name of a language group of models to be used. If left unspecified, it backs to the service’s default group. The service responds with the |
||
model_name |
The name of a model to be used. If left unspecified, it backs to the selected langugage group’s default. The service responds with the |
||
config_fields |
repeated |
Deprecated. The additional advanced service-dependend configuration for its speech recognizer. It may be silently ignored. |
SpeechRecognitionConfig.ConfigFieldsEntry¶
Field |
Type |
Label |
Description |
---|---|---|---|
key |
|||
value |
SpeechRecognitionResult¶
A result of speech recognition.
Field |
Type |
Label |
Description |
---|---|---|---|
error |
The recognition process status. It may communicate warnings. In case of an error hindering recognition, all other message fields should be left unset. |
||
recognition_alternatives |
repeated |
The confidence-ordered list of alternative recognition hypotheses. |
|
language_group_name |
The actual name of the language group of the model, unrelated to the actual language spoken in the audio. |
||
model_name |
The actual name of the model used to obtain the result. |
SpeechRecognitionWord¶
Details of a single word in speech recognition.
Field |
Type |
Label |
Description |
---|---|---|---|
transcript |
The transcript of the word itself. |
||
confidence |
optional |
The confidence estimate, ranging from 0.0 to 1.0. Support for this feature is optional. |
|
start_time |
The start time of the word relative to the beginning of the entire audio. |
||
end_time |
The end time of the word relative to the beginning of the entire audio. |
StreamingConfig¶
Streaming configuration of
a StreamingRecognize
request.
Field |
Type |
Label |
Description |
---|---|---|---|
enable_manual_input_timer |
The switch that enables manual control of the input timer. The timer imposes two constraints: one that finalizes recognition after a specified period unless speech is detected, and the other that limits the total time for an utterance. Manual control allows recognition to begin but delays enforcement of these constraints. The timer restarts after each detected end of utterance (each final result). If enabled, the timer does not start automatically. Instead, it can be initiated by sending a |
||
enable_auto_hold_response |
The switch to automatically set the service in the “hold response” state at the begginig of the request and after each final result. The “hold response” state means that the internal recognition process continues, but results are kept, not returned. When needed, the state can be toggled into the “give response” state by sending the |
StreamingRecognizeRequest¶
A message streamed from the client through
the StreamingRecognize
method.
Field |
Type |
Label |
Description |
---|---|---|---|
config |
The immutable initial configuration of the request. Must be sent once in the request’s first message. |
||
control_message |
The message controlling the processing flow of the request. May be sent multiple times except in the request’s first message. |
||
data |
The data contents of the request itself. May be sent multiple times except in the request’s first message. |
StreamingRecognizeRequestConfig¶
A message holding configuration of
a StreamingRecognize
request.
Field |
Type |
Label |
Description |
---|---|---|---|
audio_config |
Part of the configuration for the request’s audio content. |
||
result_config |
Part of the configuration for the request’s result form. |
||
streaming_config |
Part of the configuration for the request’s processing flow. |
||
speech_recognition_config |
Part of the configuration for speech recognition. |
||
age_recognition_config |
Part of the configuration for age recognition. |
||
gender_recognition_config |
Part of the configuration for gender recognition. |
||
language_recognition_config |
Part of the configuration for language recognition. |
StreamingRecognizeRequestControlMessage¶
A message controlling the processing flow of
a StreamingRecognize
request.
Field |
Type |
Label |
Description |
---|---|---|---|
start_input_timer |
optional |
The flag that starts the input timer on demand and resets after each final result. It is silently ignored if the manual input timer setting is disabled for the request. |
|
give_response |
optional |
The flag to allow the service to return a response. After receiving this message, the service remains in the “give response” state. Ignored when the service is already in the “give response” state. Mutually exclusive with the |
|
hold_response |
optional |
The flag to forbid the service from returning a response. After receiving this message, the service remains in the “hold response” state. Ignored when the service is already in the “hold response” state. Mutually exclusive with the |
StreamingRecognizeRequestData¶
A message that carries data contents of
a StreamingRecognizeRequest
request.
Field |
Type |
Label |
Description |
---|---|---|---|
audio |
Part of the audio to perform recognition on. |
StreamingRecognizeResponse¶
A message streamed from the service through
the StreamingRecognize
method.
Field |
Type |
Label |
Description |
---|---|---|---|
result |
The combined recognition results for another part of the audio. |
||
processed_audio_duration |
The cumulative duration of the processed audio during the request, not necessarily matching the actual length of the sent audio, mandatorily updated with each final result. |
StreamingRecognizeResult¶
Combined recognition result.
Field |
Type |
Label |
Description |
---|---|---|---|
error |
The recognition process status. It may communicate warnings. In case of an error hindering recognition, all other message fields should be left unset. |
||
is_final |
The flag indicating whether the result is interim or final. |
||
result_finalization_cause |
The field indicating the cause of result finalization. For interim results, the service should leave the field as |
||
speech_recognition_result |
The speech recognition result for another part of the processed audio, new with each final result, updates with each interim one. To obtain a complete result for all processed audio, for each final result received, a client should pick one of the result’s recognition alternatives and buffer it on its own. It must be omitted if speech recognition is disabled. |
||
age_recognition_result |
The current age recognition result for all processed audio, updated with each final result. It may be omitted in an interim result and must be omitted if age recognition is disabled. |
||
gender_recognition_result |
The current gender recognition result for all processed audio, updated with each final result. It may be omitted in an interim result and must be omitted if gender recognition is disabled. |
||
language_recognition_result |
The current language recognition result for all processed audio, updated with each final result. It may be omitted in an interim result and must be omitted if language recognition is disabled. |
AudioConfig.AudioEncoding¶
The possible audio encodings.
Name |
Number |
Description |
---|---|---|
UNSPECIFIED |
0 |
Unspecified audio encoding. |
LINEAR16 |
1 |
Linear pulse-code modulation of uncompressed 16-bit signed little-endian samples. |
FLAC |
2 |
Free Lossless Audio Codec (FLAC). The encoding requires only about half the bandwidth of |
OGG_OPUS |
6 |
Ogg Encapsulated Opus Audio Codec (OggOpus). When set, the service ignores the |
MP3 |
8 |
MP3 (ISO/IEC 11172-3 and ISO/IEC 13818-3). Only constant bitrate. When set, the service ignores the |
StreamingRecognizeResult.ResultFinalizationCause¶
The anticipated causes for the service to finalize a result.
Name |
Number |
Description |
---|---|---|
UNSPECIFIED |
0 |
The cause is not specified. |
SUCCESS |
1 |
The speech recognition result is not empty and the end of utterance is detected. |
NO_INPUT_TIMEOUT |
2 |
The speech recognition result is empty after the duration to expect a result is reached. |
SUCCESS_MAXTIME |
3 |
The speech recognition result is not empty after the utterance duration limit is reached. The returned speech recognition is incomplete and should be completed in the following result. |
PARTIAL_MATCH |
4 |
Unused. |
NO_MATCH_MAXTIME |
5 |
The speech recognition result is empty after the utterance duration limit is reached. |
Asr¶
An automatic speech recognition (ASR) service providing a solution for speech-to-text conversion extended by the assessment of additional speech and speaker features.
Method Name |
Request Type |
Response Type |
Description |
---|---|---|---|
StreamingRecognize |
StreamingRecognizeRequest stream |
StreamingRecognizeResponse stream |
Perform bidirectional streaming recognition. |
Scalar Value Types¶
.proto Type |
Notes |
C++ |
Java |
Python |
Go |
C# |
PHP |
Ruby |
---|---|---|---|---|---|---|---|---|
double |
double |
float |
float64 |
double |
float |
Float |
||
float |
float |
float |
float32 |
float |
float |
Float |
||
Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. |
int32 |
int |
int |
int32 |
int |
integer |
Bignum or Fixnum (as required) |
|
Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. |
int64 |
long |
int/long |
int64 |
long |
integer/string |
Bignum |
|
Uses variable-length encoding. |
uint32 |
int |
int/long |
uint32 |
uint |
integer |
Bignum or Fixnum (as required) |
|
Uses variable-length encoding. |
uint64 |
long |
int/long |
uint64 |
ulong |
integer/string |
Bignum or Fixnum (as required) |
|
Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. |
int32 |
int |
int |
int32 |
int |
integer |
Bignum or Fixnum (as required) |
|
Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. |
int64 |
long |
int/long |
int64 |
long |
integer/string |
Bignum |
|
Always four bytes. More efficient than uint32 if values are often greater than 2^28. |
uint32 |
int |
int |
uint32 |
uint |
integer |
Bignum or Fixnum (as required) |
|
Always eight bytes. More efficient than uint64 if values are often greater than 2^56. |
uint64 |
long |
int/long |
uint64 |
ulong |
integer/string |
Bignum |
|
Always four bytes. |
int32 |
int |
int |
int32 |
int |
integer |
Bignum or Fixnum (as required) |
|
Always eight bytes. |
int64 |
long |
int/long |
int64 |
long |
integer/string |
Bignum |
|
bool |
boolean |
boolean |
bool |
bool |
boolean |
TrueClass/FalseClass |
||
A string must always contain UTF-8 encoded or 7-bit ASCII text. |
string |
String |
str/unicode |
string |
string |
string |
String (UTF-8) |
|
May contain any arbitrary sequence of bytes. |
string |
ByteString |
str |
[]byte |
ByteString |
string |
String (ASCII-8BIT) |