Startup options and configuration¶

Available service startup options:

Option	Description
–help	Prints command line options.
–version	Prints TTS service version.
–generate-license-info	Prints information needed to generate customers licence. Option `address` must be set along with this parameter.
–licence-path arg	A path to the file with TTS licence. *)
–remote-licensing-config-path arg	A path to the JSON file with remote licensing service configuration. *)
–tls-directory-path arg	A path to the directory with SSL/TLS files (server.key, server.crt, and ca.crt) for gRPC authentication; if empty (default), use insecure connection. *)
–tls-mutual-authentication	Enables client’s certificate verification on service side (default - disabled).
–address arg	An IP endpoint on which TTS service would listen to requests, e.g. `192.168.3.110:12345` or `0.0.0.0:12345`.
–status-address arg	An IP endpoint on which TTS service would listen to service-status queries on `/status`, e.g. `192.168.3.110:12346` or `0.0.0.0:12346`.
–metrics-address arg	An IP endpoint on which TTS service would listen to metrics queries on `/metrics`, e.g. `192.168.3.110:12347` or `0.0.0.0:12347`.
–resources-config-path arg	A path to the file with configuration of resources in json format. *)
–log-format arg (=plain)	A log format, `plain` (default) or `json`.
–log-level-console arg (=info)	A logging level for console output: `trace`, `debug`, `info` (default), `warning`, `error`, or `fatal`.
–log-level-file arg (=debug)	A logging level for file output: `trace`, `debug` (default), `info`, `warning`, `error`, or `fatal`.
–log-dir arg (=log)	A path to the directory where the logs will be stored, defaults to `log/`. *)
–log-prefix arg (=tts-service)	A file log prefix, defaults to `tts-service`.
–log-rotation-max-file-size arg (=10MB)	Specifies the maximum log file size. When the size is reached, file rotation is performed. Defaults to 10MB. **)
–log-rotation-max-files-count arg (=0)	Specifies the maximum number of stored rotated log files. If the number of files exceeds the number, the oldest files are deleted. Default is 0 (no limit).
–log-rotation-max-dir-size arg (=0)	Specifies the maximum size of the directory with logs. If the total size of the logs files exceeds the given size, the oldest files are deleted. Default is 0 (no limit). **)
–old-logs-handling-mode arg (=archive)	The way of handling old logs from previous service instance runs, `archive` (default), `remove`, or `keep`.
–logging-mark-interval arg (=0)	Interval for ‘heartbeat’ logging, in seconds (optional).
–runtime-limits-config-path arg	A path to the file with runtime limits configuration in json format (optional). *)

*) All paths, if relative, are specified relatively to the working directory of the service.
**) All sizes can be specified in B, kB, MB, or GB. If no unit is present, it default to bytes.

License configuration¶

The service requires a valid license to operate. It can be loaded from a local licence file in json format (--licence-path option), or obtained from a remote licensing server (--remote-licensing-config-path option). One of these options (but not both) has to be specified.

Local license file¶

Local licence is a json file containing these fields:

Field	Type	Description
application-id	string	ID of the service, must equal to TTS.
customer-name	string	Name of the customer for which the licence is issued.
channels	string	Number of available channels, either unsigned integer or unrestricted keyword.
licence-version	uint	Major version of the licence, which specifies licence file format.
valid-from	string	The GMT time when the licence becomes valid, in format ‘YYYY-MM-DD HH:MM:SS’.
valid-to	string	The GMT time when the licence stops being valid, in format ‘YYYY-MM-DD HH:MM:SS’.
mode	string	In local licence file must equal to channels.
instance-ip	string	The IP Address of network interface from which the service would accept requests, or unrestricted keyword.
instance-port	string	The port number on which service would listen for requests, or unrestricted keyword.
container-id	string	The string which uniquely identifies a Docker container where the service is run, or unrestricted keyword.
machine-id	string	The string which uniquely identifies the physical machine on which the service is run, or unrestricted keyword.
languages	variable	Allowed languages, either unrestricted keyword or array of ISO 639-1 language code strings.
allow-reconfiguration	boolean	Flag indicating if requests making permanent changes to the service state are allowed. Optional, defaults to false.

To create a licence file, it is necessary to generate a machine-unique set of licensing data first. To generate licensing data, --generate-licence-info startup option is used. This option has to be followed by --address IP:PORT (where IP:PORT is the address where the TTS service is to be available).

Received output should be similar to this one:

techmo-tts-dnn-cpu-service-3-1-0 | Techmo TTS DNN CPU Service, version 3.1.0.
techmo-tts-dnn-cpu-service-3-1-0 | Copyright (C) 2023 Techmo sp. z o.o.
techmo-tts-dnn-cpu-service-3-1-0 | 
techmo-tts-dnn-cpu-service-3-1-0 | ==========BEGIN OF LICENCE PARAMETERS==========
techmo-tts-dnn-cpu-service-3-1-0 | {
techmo-tts-dnn-cpu-service-3-1-0 |     "application-id": "TTS",
techmo-tts-dnn-cpu-service-3-1-0 |     "application-version": "3.1.0",
techmo-tts-dnn-cpu-service-3-1-0 |     "container-id": "techmo-tts-dnn-cpu-service-3-1-0",
techmo-tts-dnn-cpu-service-3-1-0 |     "instance-ip": "0.0.0.0",
techmo-tts-dnn-cpu-service-3-1-0 |     "instance-port": "12345",
techmo-tts-dnn-cpu-service-3-1-0 |     "licence-key-id": "Ex/3raXDDehgLTQbFXW4XerIXEi1qH5rpNeLSagZk=",
techmo-tts-dnn-cpu-service-3-1-0 |     "licence-version": "3",
techmo-tts-dnn-cpu-service-3-1-0 |     "machine-id": "veXohm23suR1bER+3yTH3dm3AA6lm10tpAotDerOvts="
techmo-tts-dnn-cpu-service-3-1-0 | }
techmo-tts-dnn-cpu-service-3-1-0 | ==========END OF LICENCE PARAMETERS==========

The above data should be sent to the Techmo team in order to generate an individual licence valid on the customer’s machine.

A sample licence file:

{
	"licence": {
		"application-id": "TTS",
		"channels": "100",
		"licence-version": "3",
		"valid-from": "2024-01-01 00:00:00",
		"valid-to": "2024-12-31 23:59:59",
		"mode": "channels",
		"instance-ip": "unrestricted",
		"instance-port": "unrestricted",
		"container-id": "techmo-tts-3.1.0",
		"machine-id": "veXohm23suR1bER+3yTH3dm3AA6lm10tpAotDerOvts=",
		"allow-reconfiguration": "true",
		"customer-name": "dev",
		"allow-to-phonemes": "false",
		"languages": ["pl", "en"]
	},
	"signature": "{a_licence_signature}"
}

Remote licensing¶

Remote licensing is accomplished by connecting to the licensing server. The licensing server can be run:

on Techmo infrastructure as a SaaS application,
on the client’s infrastructure as an on-prem application.

Additionally, it is also possible to prepare a licensing server in the SaaS model on the client’s infrastructure for cloud architectures.

If there is a need to configure and run the licensing server on the client’s machine, please contact us directly via email to receive relevant resources and instructions.

There are two accounting modes available:

PAYG - pay-as-you-go - the client is billed for the total number of characters in all the requests sent for synthesis (including SSML tags) by all TTS instances during the billing period.
HA - high availability - allows to run more than one TTS instance and verify that at a given time (1 minute resolution) all running instances have not exceeded the total of N available channels (N requests handled simultaneously)

SaaS model configuration

A typical deployment package using remote licensing in the SaaS model contains the licensing directory with following resources:

licensing
├── auth.json
├── tls
│   └── gen-certs.sh
└── storage
    └── reports

auth.json - configuration file
tls - directory for storing SSL/TLS certificates
gen-certs.sh - bash script for generating SSL/TLS certificates
storage - directory for storing reports and temporary licence obtained from the licensing server
reports - directory where usage reports are stored

There are a few steps required to prepare service to work in remote licensing mode:

Generate SSL/TLS certificates to secure communication with the licensing server:

set the certificate password in the gen-certs.sh file (first line of the script)

PASSWORD=insert_password_here

run the script

cd /licensing/ssl
./gen-certs.sh

Generate keys that sign communication with the licensing service:

openssl genrsa -out company_name_techmo_tts.pem
openssl rsa -in company_name_techmo_tts.pem -outform PEM -pubout -out company_name_techmo_tts.pub

copy the private key to the licensing directory and name it auth.key
send the public key company_name_techmo_tts.pub to Techmo via email

In response we will send a message containing identifiers: client_id and key_id. Set the received values inside the auth.json file.

After Techmo confirms that the provided public key has been loaded, the service will be ready to start.

Resources configuration¶

Resources are configured using the file in json format, pointed by --resources-config-path option. Resources are obligatory.

A resources configuration is a json object, where fields have the following meaning:

Field	Type	Description
version-major	uint	The major version denotes incompatible changes. When the version differs from the version supported by the service, the service fails to start.
resources-id	string	Resources identifier. Can be queried by gRPC API. Should be unique among different resources.
lexicons-config-path	string	A path to the lexicons configuration file in json format. *)
normalizers-directory-path	string	A path to the directory with rules files for text normalization. *)
transcribers-directory-path	string	A path to the directory with tables for phonetic transcription. *)
voices-config-path	string	A path to the voices configuration file in json format. *)
execution-provider	string	Required execution provider in ONNX Runtime for synthesize audio, default or oneDNN. Optional, defaults to default.
pause-between-segments	uint	Time of silence inserted between sentences, in milliseconds. Recommended value is about 200, which yields clearly separated sentences, without overlong pauses.
prosody-pitch-keyword-x-max	float	The multiplier value for prosody pitch attribute equal to x-high. **)
prosody-range-keyword-x-max	float	The multiplier value for prosody range attribute equal to x-high. **)
prosody-rate-keyword-x-max	float	The multiplier value for prosody rate attribute equal to x-fast. **)
prosody-volume-keyword-x-max	float	The multiplier value for prosody volume attribute equal to x-loud. **)
transcribers-tables-encrypted	boolean	Flag indicating if phonetic transcription tables are enctypted. Optional, defaults to true.

*) All paths, if relative, are specified relatively to the resources configuration file itself.
**) The other keywords are evaluated according to the formulas: max = sqrt(x-max), min = 1/max, and x-min = 1/x-max.

A sample resources configuration:

{
	"version-major": 2,
	"resources-id": "example-resources-v2.0.5",
	"lexicons-config-path": "lexicons-config.json",
	"normalizers-directory-path": ".",
	"transcribers-directory-path": ".",
	"voices-config-path": "voices-config.json",
	"execution-provider": "default",
	"pause-between-segments": 200,
	"prosody-pitch-keyword-x-max": 1.33,
	"prosody-range-keyword-x-max": 5.0,
	"prosody-rate-keyword-x-max": 1.8,
	"prosody-volume-keyword-x-max": 2.0,
	"transcribers-tables-encrypted": true
}

Runtime limits configuration¶

Runtime limits are configured using the file in json format, pointed by --runtime-limits-config-path option. Runtime limits are optional.

A runtime limits configuration is a json object, where fields have the following meaning:

Field	Type	Description
cache-size	string	Maximum synthesized audio cache size. *)
max-request-size	string	Maximum size of text in Synthesize, SynthesizeStreaming, and ToPhonemes requests, specified in characters. **)
max-sentence-size	string	Maximum size of single sentence from request text, which is synthesized at once, specified in phonemes. **)
max-output-size	string	Maximum size of single output packet in SynthesizeStreaming and total size of output in non-streaming Synthesize, specified in samples. **)
max-recording-size	string	Maximum size of predefined recording which can be added using PutRecording request. *)
max-lexicon-size	string	Maximum size of pronunciation lexicon which can be added using PutLexicon request. *)
max-recordings-count	uint	Maximum number of concurrently stored predefined recordings.
max-lexicons-count	uint	Maximum number of concurrently stored pronunciation lexicons.

*) All sizes can be specified in B, kB, MB, or GB, with 1024 scaling factor. If no unit is present, it default to bytes.
**) All sizes can use a k, M, or G suffix, with 1000 scaling factor.

When runtime limits are not configured, cache is disabled, and all limits are not verified.
Setting any value to 0 means caching is disabled or particular limit is not verified.
When a limit is verified and is exceeded during a request, the request results in RESOURCE_EXHAUSTED gRPC error.

A sample runtime limits configuration:

{
	"cache-size": "1GB",
	"max-request-size": "2MB",
	"max-sentence-size": "100kB",
	"max-output-size": "100MB",
	"max-recording-size": "10MB",
	"max-lexicon-size": "1MB",
	"max-recordings-count": 10000,
	"max-lexicons-count": 500
}

Lexicons configuration¶

Pronunciation lexicons are configured using the file in json format, pointed by lexicons-config-path field of resources configuration.

A lexicons configuration is a json object, where each field name is an individual lexicon URI. A field value is either a string containing a path to the file with the lexicon content, or an array of two strings. The first one contains a list of attributes, and the second one is the path, identical as in the first case. If the path is relative, it is specified relatively to the lexicons configuration file itself.

The attributes list string contains a comma-separated list of attributes in square brackets, like [attribute1, …, attributeN]. However, currently only one attribute - lookup-only - is defined. When such attribute is specified, the respective lexicon cannot be used as a default lexicon, i.e. it must be enabled explicitly by <lookup> tag to participate in lexicon substitution.

A sample lexicons configuration:

{
	"#addresses_pl": ["[lookup-only]", "addresses_pl.xml"],
	"#medical_pl": ["[lookup-only]", "medical_pl.xml"],
	"#generic_pl": "generic_pl.xml",
	"#lexicon_en": "lexicon_en.xml"
}

Pronunciation lexicons themselves are xml files defined according to rules of PLS (Pronunciation Lexicon Specification). However, the TTS service interprets only a subset of what is allowed by PLS. The basic principles of use are:

<lexeme> tag has to contain exactly one <alias> or <phoneme> child,
alphabet attribute can be used, but may contain only ipa value (only the IPA is supported, so there is no need to declare it explicitly),
prefer and role attributes are silently ignored.

A sample lexicon with ‘alias’ and ‘phoneme’ entry definitions:

<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0" alphabet="ipa" xml:lang="pl">
	<lexeme>
		<grapheme>bordeaux</grapheme>
		<alias>bordo</alias>
	</lexeme>
	<lexeme>
		<grapheme>troyes</grapheme>
		<phoneme>tʁwa</phoneme>
	</lexeme>
</lexicon>

Additionally, <grapheme> tags can have a custom, techmo-specific tm-match attribute, which defines the way in which graphemes are matched against the text. The tm-match attribute is optional, and has the format: “match-type[, match-case]”. The match-type can be one of:

full (default, matching according to PLS),
partial,
full-partial,
partial-full, or
regexp,

and match-case can be either icase (case insensitive matching, default, matching according to PLS) or case (case sensitive matching).

The full match means that a matched phrase cannot be a part of a word, i.e. it has to begin and end at the word boundaries. The partial match means that phrases which are parts of words are matched as well. The full-partial match means that a phrase must start at the word boundary, but might end in the middle of a word, and partial-full match means that a phrase might start in the middle of a word, but must end at the word boundary. Alternatively, the regexp match means that matching is performed using regular expressions, according to ECMA script syntax.

A sample lexicon entry definition with a defined tm-match attribute:

<lexeme>
	<grapheme tm-match="full, case">AC</grapheme>
	<alias>a ce</alias>
</lexeme>

Voices configuration¶

Voices are configured using the file in json format, pointed by voices-config-path field of resources configuration.

A voices configuration is a json object, where each field name is an individual voice name. Each voice is another json object, with the following fields:

Field	Type	Description
voice-file-path	string	A path to the binary file with the voice content. *)
variants	array	An array of json objects which define extra voice properties separately for each voice variant. Optional.

*) If the path is relative, it is specified relatively to the voices configuration file itself.

When variants field is specified, its length has to be equal to the number of variants defined in the binary voice file.
The extra variant properties is again a json object, with the following fields:

Field	Type	Description
recordings-config-path	string	A path to the recordings configuration file in json format. Optional, if not provided, no recordings are configured for the variant. *)
speech-pitch	float	Scales the pitch of the systhesized speech, in the range (0, +inf). Optional, defaults to 1.
speech-range	float	Scales the pitch range (variability) of the systhesized speech, in the range [0, +inf). Optional, defaults to 1.
speech-rate	float	Scales the speaking rate of the systhesized speech, in the range (0, +inf). Optional, defaults to 1.
speech-stress	float	Scales the stress of the systhesized speech, in the range [0, +inf). Optional, defaults to 1.
speech-volume	float	default volume of the systhesized speech (valid values: any number greater than or equal to 0).

*) If the path is relative, it is specified relatively to the voices configuration file itself.

All speech-* parameters Work as multipliers for respective properties of voice defined in voice file. For example, setting speech-pitch=0.5 produces voice 2 times lower than normally, while speech-pitch=2.0 produces voice 2 times higher.

The speech-volume is a multiplier applied to the waveform, so setting speech-volume=10 increases the sound energy (perceived loudness) hundred times, which is equivalent of +20dB.

A sample voices configuration:

{
	"Masza": {
		"voice-file-path": "masza.voice",
		"variants": [
			{
			},
			{
				"speech-pitch": 0.9,
				"speech-range": 1.0,
				"speech-rate": 1.1,
				"speech-volume": 1.15
			}
		]
	},
	"Michal": {
		"voice-file-path": "michal.voice",
		"variants": [
			{
				"recordings": "michal-recordings.json"
			}
		]
	},
	"Cori": {
		"voice-file-path": "cori.voice"
	}
}

Default language and voice

The first voice in the voice configuration json file automatically becomes the service default voice. The first supported language of the service default voice automatically becomes service default language. If no language and voice switches are used in gRPC ToPhonemes, SynthesizeStreaming, or Synthesize request, the service assumes that the default language and voice shall be used.

Recordings configuration¶

Predefined recordings mechanism allows to use custom recorded speech instead of synthesized speech for some given phrases. Recordings are configured using the file in json format, pointed by recordings-config-path field of variant configuration inside voice configuration.

A recordings configuration is a set of json objects where names are equal to language codes and values configure the recordings itself for the given language. If the set of languages in recordings configuration mismatches the set of supported languages of the configured voice, the configuration results in error.

A configuration of recordings for a given language is a set of string pairs "key": "value". A key is either explicitly specified, starting with # mark, or a phrase which would be matched implicitly. In the first case, the recording can be used with SSML <audio> tag. In the second case, when phonetic transcription of the implicit key phrase equals to the phonetic transcription of the phrase that should be synthesized, the recording is inserted instead.

Values are paths to the audio files with recordings. The paths, if relative, are specified relatively to the recordings configuration file.

A recording should be single channel. If recording sampling rate does not match the voice sampling rate, the recording is automatically resampled after being load.

A sample recordings configuration for voice supporting Polish and English languages:

{
	"pl":
	{
		...
		"#ej1": "predefined_recordings/ej_1.wav",
		"#ej2": "predefined_recordings/ej_2.wav",
		"#ekhem1": "predefined_recordings/ekhm_1.wav",
		"#ekhem2": "predefined_recordings/ekhm_2.wav",
		"#ekhem3": "predefined_recordings/ekhm_3.wav",
		"#emm1": "predefined_recordings/eee_1.wav",
		"#emm2": "predefined_recordings/emm_1.wav",
		"#fiu1": "predefined_recordings/fiu.wav",
		"#fuj1": "predefined_recordings/fuj.wav",
		"#fuu1": "predefined_recordings/fu.wav",
		"#gwizd1": "predefined_recordings/gwizd_1.wav",
		"#gwizd2": "predefined_recordings/gwizd_2.wav",
		"#ha1": "predefined_recordings/ha.wav",
		"#haha1": "predefined_recordings/haha_1.wav",
		...
	},
	"en":
	{
		...
	}
}