Startup options and configuration

Available service startup options:

Option

Description

–help

Prints command line options.

–version

Prints TTS service version.

–generate-license-info

Prints information needed to generate customers licence. Option address must be set along with this parameter.

–licence-path arg

A path to the file with TTS licence. *)

–remote-licensing-config-path arg

A path to the JSON file with remote licensing service configuration. *)

–tls-directory-path arg

A path to the directory with SSL/TLS files (server.key, server.crt, and ca.crt) for gRPC authentication; if empty (default), use insecure connection. *)

–tls-mutual-authentication

Enables client’s certificate verification on service side (default - disabled).

–address arg

An IP endpoint on which TTS service would listen to requests, e.g. 192.168.3.110:12345 or 0.0.0.0:12345.

–status-address arg

An IP endpoint on which TTS service would listen to service-status queries on /status, e.g. 192.168.3.110:12346 or 0.0.0.0:12346.

–metrics-address arg

An IP endpoint on which TTS service would listen to metrics queries on /metrics, e.g. 192.168.3.110:12347 or 0.0.0.0:12347.

–resources-config-path arg

A path to the file with configuration of resources in json format. *)

–log-format arg (=plain)

A log format, plain (default) or json.

–log-level-console arg (=info)

A logging level for console output: trace, debug, info (default), warning, error, or fatal.

–log-level-file arg (=debug)

A logging level for file output: trace, debug (default), info, warning, error, or fatal.

–log-dir arg (=log)

A path to the directory where the logs will be stored, defaults to log/. *)

–log-prefix arg (=tts-service)

A file log prefix, defaults to tts-service.

–log-rotation-max-file-size arg (=10MB)

Specifies the maximum log file size. When the size is reached, file rotation is performed. Defaults to 10MB. **)

–log-rotation-max-files-count arg (=0)

Specifies the maximum number of stored rotated log files. If the number of files exceeds the number, the oldest files are deleted. Default is 0 (no limit).

–log-rotation-max-dir-size arg (=0)

Specifies the maximum size of the directory with logs. If the total size of the logs files exceeds the given size, the oldest files are deleted. Default is 0 (no limit). **)

–old-logs-handling-mode arg (=archive)

The way of handling old logs from previous service instance runs, archive (default), remove, or keep.

–logging-mark-interval arg (=0)

Interval for ‘heartbeat’ logging, in seconds (optional).

–runtime-limits-config-path arg

A path to the file with runtime limits configuration in json format (optional). *)

*) All paths, if relative, are specified relatively to the working directory of the service.
**) All sizes can be specified in B, kB, MB, or GB. If no unit is present, it default to bytes.

License configuration

The service requires a valid license to operate. It can be loaded from a local licence file in json format (--licence-path option), or obtained from a remote licensing server (--remote-licensing-config-path option). One of these options (but not both) has to be specified.

Local license file

Local licence is a json file containing these fields:

Field

Type

Description

application-id

string

ID of the service, must equal to TTS.

customer-name

string

Name of the customer for which the licence is issued.

channels

string

Number of available channels, either unsigned integer or unrestricted keyword.

licence-version

uint

Major version of the licence, which specifies licence file format.

valid-from

string

The GMT time when the licence becomes valid, in format ‘YYYY-MM-DD HH:MM:SS’.

valid-to

string

The GMT time when the licence stops being valid, in format ‘YYYY-MM-DD HH:MM:SS’.

mode

string

In local licence file must equal to channels.

instance-ip

string

The IP Address of network interface from which the service would accept requests, or unrestricted keyword.

instance-port

string

The port number on which service would listen for requests, or unrestricted keyword.

container-id

string

The string which uniquely identifies a Docker container where the service is run, or unrestricted keyword.

machine-id

string

The string which uniquely identifies the physical machine on which the service is run, or unrestricted keyword.

languages

variable

Allowed languages, either unrestricted keyword or array of ISO 639-1 language code strings.

allow-reconfiguration

boolean

Flag indicating if requests making permanent changes to the service state are allowed. Optional, defaults to false.

To create a licence file, it is necessary to generate a machine-unique set of licensing data first. To generate licensing data, --generate-licence-info startup option is used. This option has to be followed by --address IP:PORT (where IP:PORT is the address where the TTS service is to be available).

Received output should be similar to this one:

techmo-tts-dnn-cpu-service-3-1-0 | Techmo TTS DNN CPU Service, version 3.1.0.
techmo-tts-dnn-cpu-service-3-1-0 | Copyright (C) 2023 Techmo sp. z o.o.
techmo-tts-dnn-cpu-service-3-1-0 | 
techmo-tts-dnn-cpu-service-3-1-0 | ==========BEGIN OF LICENCE PARAMETERS==========
techmo-tts-dnn-cpu-service-3-1-0 | {
techmo-tts-dnn-cpu-service-3-1-0 |     "application-id": "TTS",
techmo-tts-dnn-cpu-service-3-1-0 |     "application-version": "3.1.0",
techmo-tts-dnn-cpu-service-3-1-0 |     "container-id": "techmo-tts-dnn-cpu-service-3-1-0",
techmo-tts-dnn-cpu-service-3-1-0 |     "instance-ip": "0.0.0.0",
techmo-tts-dnn-cpu-service-3-1-0 |     "instance-port": "12345",
techmo-tts-dnn-cpu-service-3-1-0 |     "licence-key-id": "Ex/3raXDDehgLTQbFXW4XerIXEi1qH5rpNeLSagZk=",
techmo-tts-dnn-cpu-service-3-1-0 |     "licence-version": "3",
techmo-tts-dnn-cpu-service-3-1-0 |     "machine-id": "veXohm23suR1bER+3yTH3dm3AA6lm10tpAotDerOvts="
techmo-tts-dnn-cpu-service-3-1-0 | }
techmo-tts-dnn-cpu-service-3-1-0 | ==========END OF LICENCE PARAMETERS==========

The above data should be sent to the Techmo team in order to generate an individual licence valid on the customer’s machine.

A sample licence file:

{
	"licence": {
		"application-id": "TTS",
		"channels": "100",
		"licence-version": "3",
		"valid-from": "2024-01-01 00:00:00",
		"valid-to": "2024-12-31 23:59:59",
		"mode": "channels",
		"instance-ip": "unrestricted",
		"instance-port": "unrestricted",
		"container-id": "techmo-tts-3.1.0",
		"machine-id": "veXohm23suR1bER+3yTH3dm3AA6lm10tpAotDerOvts=",
		"allow-reconfiguration": "true",
		"customer-name": "dev",
		"allow-to-phonemes": "false",
		"languages": ["pl", "en"]
	},
	"signature": "{a_licence_signature}"
}

Remote licensing

Remote licensing is accomplished by connecting to the licensing server. The licensing server can be run:

  • on Techmo infrastructure as a SaaS application,

  • on the client’s infrastructure as an on-prem application.

Additionally, it is also possible to prepare a licensing server in the SaaS model on the client’s infrastructure for cloud architectures.

If there is a need to configure and run the licensing server on the client’s machine, please contact us directly via email to receive relevant resources and instructions.

There are two accounting modes available:

  • PAYG - pay-as-you-go - the client is billed for the total number of characters in all the requests sent for synthesis (including SSML tags) by all TTS instances during the billing period.

  • HA - high availability - allows to run more than one TTS instance and verify that at a given time (1 minute resolution) all running instances have not exceeded the total of N available channels (N requests handled simultaneously)

SaaS model configuration

A typical deployment package using remote licensing in the SaaS model contains the licensing directory with following resources:

licensing
├── auth.json
├── tls
│   └── gen-certs.sh
└── storage
    └── reports
  • auth.json - configuration file

  • tls - directory for storing SSL/TLS certificates

  • gen-certs.sh - bash script for generating SSL/TLS certificates

  • storage - directory for storing reports and temporary licence obtained from the licensing server

  • reports - directory where usage reports are stored

There are a few steps required to prepare service to work in remote licensing mode:

  1. Generate SSL/TLS certificates to secure communication with the licensing server:

  • set the certificate password in the gen-certs.sh file (first line of the script)

PASSWORD=insert_password_here
  • run the script

cd /licensing/ssl
./gen-certs.sh
  1. Generate keys that sign communication with the licensing service:

openssl genrsa -out company_name_techmo_tts.pem
openssl rsa -in company_name_techmo_tts.pem -outform PEM -pubout -out company_name_techmo_tts.pub
  • copy the private key to the licensing directory and name it auth.key

  • send the public key company_name_techmo_tts.pub to Techmo via email

  1. In response we will send a message containing identifiers: client_id and key_id. Set the received values inside the auth.json file.

After Techmo confirms that the provided public key has been loaded, the service will be ready to start.

Resources configuration

Resources are configured using the file in json format, pointed by --resources-config-path option. Resources are obligatory.

A resources configuration is a json object, where fields have the following meaning:

Field

Type

Description

version-major

uint

The major version denotes incompatible changes. When the version differs from the version supported by the service, the service fails to start.

resources-id

string

Resources identifier. Can be queried by gRPC API. Should be unique among different resources.

lexicons-config-path

string

A path to the lexicons configuration file in json format. *)

normalizers-directory-path

string

A path to the directory with rules files for text normalization. *)

transcribers-directory-path

string

A path to the directory with tables for phonetic transcription. *)

voices-config-path

string

A path to the voices configuration file in json format. *)

execution-provider

string

Required execution provider in ONNX Runtime for synthesize audio, default or oneDNN. Optional, defaults to default.

pause-between-segments

uint

Time of silence inserted between sentences, in milliseconds. Recommended value is about 200, which yields clearly separated sentences, without overlong pauses.

prosody-pitch-keyword-x-max

float

The multiplier value for prosody pitch attribute equal to x-high. **)

prosody-range-keyword-x-max

float

The multiplier value for prosody range attribute equal to x-high. **)

prosody-rate-keyword-x-max

float

The multiplier value for prosody rate attribute equal to x-fast. **)

prosody-volume-keyword-x-max

float

The multiplier value for prosody volume attribute equal to x-loud. **)

transcribers-tables-encrypted

boolean

Flag indicating if phonetic transcription tables are enctypted. Optional, defaults to true.

*) All paths, if relative, are specified relatively to the resources configuration file itself.
**) The other keywords are evaluated according to the formulas: max = sqrt(x-max), min = 1/max, and x-min = 1/x-max.

A sample resources configuration:

{
	"version-major": 2,
	"resources-id": "example-resources-v2.0.5",
	"lexicons-config-path": "lexicons-config.json",
	"normalizers-directory-path": ".",
	"transcribers-directory-path": ".",
	"voices-config-path": "voices-config.json",
	"execution-provider": "default",
	"pause-between-segments": 200,
	"prosody-pitch-keyword-x-max": 1.33,
	"prosody-range-keyword-x-max": 5.0,
	"prosody-rate-keyword-x-max": 1.8,
	"prosody-volume-keyword-x-max": 2.0,
	"transcribers-tables-encrypted": true
}

Runtime limits configuration

Runtime limits are configured using the file in json format, pointed by --runtime-limits-config-path option. Runtime limits are optional.

A runtime limits configuration is a json object, where fields have the following meaning:

Field

Type

Description

cache-size

string

Maximum synthesized audio cache size. *)

max-request-size

string

Maximum size of text in Synthesize, SynthesizeStreaming, and ToPhonemes requests, specified in characters. **)

max-sentence-size

string

Maximum size of single sentence from request text, which is synthesized at once, specified in phonemes. **)

max-output-size

string

Maximum size of single output packet in SynthesizeStreaming and total size of output in non-streaming Synthesize, specified in samples. **)

max-recording-size

string

Maximum size of predefined recording which can be added using PutRecording request. *)

max-lexicon-size

string

Maximum size of pronunciation lexicon which can be added using PutLexicon request. *)

max-recordings-count

uint

Maximum number of concurrently stored predefined recordings.

max-lexicons-count

uint

Maximum number of concurrently stored pronunciation lexicons.

*) All sizes can be specified in B, kB, MB, or GB, with 1024 scaling factor. If no unit is present, it default to bytes.
**) All sizes can use a k, M, or G suffix, with 1000 scaling factor.

When runtime limits are not configured, cache is disabled, and all limits are not verified.
Setting any value to 0 means caching is disabled or particular limit is not verified.
When a limit is verified and is exceeded during a request, the request results in RESOURCE_EXHAUSTED gRPC error.

A sample runtime limits configuration:

{
	"cache-size": "1GB",
	"max-request-size": "2MB",
	"max-sentence-size": "100kB",
	"max-output-size": "100MB",
	"max-recording-size": "10MB",
	"max-lexicon-size": "1MB",
	"max-recordings-count": 10000,
	"max-lexicons-count": 500
}

Lexicons configuration

Pronunciation lexicons are configured using the file in json format, pointed by lexicons-config-path field of resources configuration.

A lexicons configuration is a json object, where each field name is an individual lexicon URI. A field value is either a string containing a path to the file with the lexicon content, or an array of two strings. The first one contains a list of attributes, and the second one is the path, identical as in the first case. If the path is relative, it is specified relatively to the lexicons configuration file itself.

The attributes list string contains a comma-separated list of attributes in square brackets, like [attribute1, …, attributeN]. However, currently only one attribute - lookup-only - is defined. When such attribute is specified, the respective lexicon cannot be used as a default lexicon, i.e. it must be enabled explicitly by <lookup> tag to participate in lexicon substitution.

A sample lexicons configuration:

{
	"#addresses_pl": ["[lookup-only]", "addresses_pl.xml"],
	"#medical_pl": ["[lookup-only]", "medical_pl.xml"],
	"#generic_pl": "generic_pl.xml",
	"#lexicon_en": "lexicon_en.xml"
}

Pronunciation lexicons themselves are xml files defined according to rules of PLS (Pronunciation Lexicon Specification). However, the TTS service interprets only a subset of what is allowed by PLS. The basic principles of use are:

  • <lexeme> tag has to contain exactly one <alias> or <phoneme> child,

  • alphabet attribute can be used, but may contain only ipa value (only the IPA is supported, so there is no need to declare it explicitly),

  • prefer and role attributes are silently ignored.

A sample lexicon with ‘alias’ and ‘phoneme’ entry definitions:

<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0" alphabet="ipa" xml:lang="pl">
	<lexeme>
		<grapheme>bordeaux</grapheme>
		<alias>bordo</alias>
	</lexeme>
	<lexeme>
		<grapheme>troyes</grapheme>
		<phoneme>tʁwa</phoneme>
	</lexeme>
</lexicon>

Additionally, <grapheme> tags can have a custom, techmo-specific tm-match attribute, which defines the way in which graphemes are matched against the text. The tm-match attribute is optional, and has the format: “match-type[, match-case]”. The match-type can be one of:

  • full (default, matching according to PLS),

  • partial,

  • full-partial,

  • partial-full, or

  • regexp,

and match-case can be either icase (case insensitive matching, default, matching according to PLS) or case (case sensitive matching).

The full match means that a matched phrase cannot be a part of a word, i.e. it has to begin and end at the word boundaries. The partial match means that phrases which are parts of words are matched as well. The full-partial match means that a phrase must start at the word boundary, but might end in the middle of a word, and partial-full match means that a phrase might start in the middle of a word, but must end at the word boundary. Alternatively, the regexp match means that matching is performed using regular expressions, according to ECMA script syntax.

A sample lexicon entry definition with a defined tm-match attribute:

<lexeme>
	<grapheme tm-match="full, case">AC</grapheme>
	<alias>a ce</alias>
</lexeme>

Voices configuration

Voices are configured using the file in json format, pointed by voices-config-path field of resources configuration.

A voices configuration is a json object, where each field name is an individual voice name. Each voice is another json object, with the following fields:

Field

Type

Description

voice-file-path

string

A path to the binary file with the voice content. *)

variants

array

An array of json objects which define extra voice properties separately for each voice variant. Optional.

*) If the path is relative, it is specified relatively to the voices configuration file itself.

When variants field is specified, its length has to be equal to the number of variants defined in the binary voice file.
The extra variant properties is again a json object, with the following fields:

Field

Type

Description

recordings-config-path

string

A path to the recordings configuration file in json format. Optional, if not provided, no recordings are configured for the variant. *)

speech-pitch

float

Scales the pitch of the systhesized speech, in the range (0, +inf). Optional, defaults to 1.

speech-range

float

Scales the pitch range (variability) of the systhesized speech, in the range [0, +inf). Optional, defaults to 1.

speech-rate

float

Scales the speaking rate of the systhesized speech, in the range (0, +inf). Optional, defaults to 1.

speech-stress

float

Scales the stress of the systhesized speech, in the range [0, +inf). Optional, defaults to 1.

speech-volume

float

default volume of the systhesized speech (valid values: any number greater than or equal to 0).

*) If the path is relative, it is specified relatively to the voices configuration file itself.

All speech-* parameters Work as multipliers for respective properties of voice defined in voice file. For example, setting speech-pitch=0.5 produces voice 2 times lower than normally, while speech-pitch=2.0 produces voice 2 times higher.

The speech-volume is a multiplier applied to the waveform, so setting speech-volume=10 increases the sound energy (perceived loudness) hundred times, which is equivalent of +20dB.

A sample voices configuration:

{
	"Masza": {
		"voice-file-path": "masza.voice",
		"variants": [
			{
			},
			{
				"speech-pitch": 0.9,
				"speech-range": 1.0,
				"speech-rate": 1.1,
				"speech-volume": 1.15
			}
		]
	},
	"Michal": {
		"voice-file-path": "michal.voice",
		"variants": [
			{
				"recordings": "michal-recordings.json"
			}
		]
	},
	"Cori": {
		"voice-file-path": "cori.voice"
	}
}

Default language and voice

The first voice in the voice configuration json file automatically becomes the service default voice. The first supported language of the service default voice automatically becomes service default language. If no language and voice switches are used in gRPC ToPhonemes, SynthesizeStreaming, or Synthesize request, the service assumes that the default language and voice shall be used.

Recordings configuration

Predefined recordings mechanism allows to use custom recorded speech instead of synthesized speech for some given phrases. Recordings are configured using the file in json format, pointed by recordings-config-path field of variant configuration inside voice configuration.

A recordings configuration is a set of json objects where names are equal to language codes and values configure the recordings itself for the given language. If the set of languages in recordings configuration mismatches the set of supported languages of the configured voice, the configuration results in error.

A configuration of recordings for a given language is a set of string pairs "key": "value". A key is either explicitly specified, starting with # mark, or a phrase which would be matched implicitly. In the first case, the recording can be used with SSML <audio> tag. In the second case, when phonetic transcription of the implicit key phrase equals to the phonetic transcription of the phrase that should be synthesized, the recording is inserted instead.

Values are paths to the audio files with recordings. The paths, if relative, are specified relatively to the recordings configuration file.

A recording should be single channel. If recording sampling rate does not match the voice sampling rate, the recording is automatically resampled after being load.

A sample recordings configuration for voice supporting Polish and English languages:

{
	"pl":
	{
		...
		"#ej1": "predefined_recordings/ej_1.wav",
		"#ej2": "predefined_recordings/ej_2.wav",
		"#ekhem1": "predefined_recordings/ekhm_1.wav",
		"#ekhem2": "predefined_recordings/ekhm_2.wav",
		"#ekhem3": "predefined_recordings/ekhm_3.wav",
		"#emm1": "predefined_recordings/eee_1.wav",
		"#emm2": "predefined_recordings/emm_1.wav",
		"#fiu1": "predefined_recordings/fiu.wav",
		"#fuj1": "predefined_recordings/fuj.wav",
		"#fuu1": "predefined_recordings/fu.wav",
		"#gwizd1": "predefined_recordings/gwizd_1.wav",
		"#gwizd2": "predefined_recordings/gwizd_2.wav",
		"#ha1": "predefined_recordings/ha.wav",
		"#haha1": "predefined_recordings/haha_1.wav",
		...
	},
	"en":
	{
		...
	}
}