Azure TTS Voice Similarity Pipeline
Developer Documentation
Date: March 2026 Status: Production run complete (priority locales) Repository: Voice Similarity/ — standalone pipeline, no external repo dependency
Reading guide: This is a documentation artifact, not a runnable notebook. Code cells show key excerpts from the actual scripts with annotations. Run the numbered
.pyscripts directly in order.
1. Problem Statement
Azure TTS offers 500+ Neural voices across 100+ locales. When a product ships with a specific voice (e.g., en-US-JennyNeural), it must define fallback voices — alternatives shown to users when the primary is unavailable, deprecated, or unsuitable.
The challenge: Voice similarity is subjective. Teams historically made fallback choices by ear, leading to:
- Inconsistent fallback quality across locales
- Choices that don't survive voice roster changes
- No defensible, auditable rationale
The goal: Produce an objective, reproducible similarity score for every pair of voices within the same locale, enabling:
- Data-driven fallback ranking (top-N most similar voices per voice)
- Gender-aware filtering (female voices map to female fallbacks)
- A shareable HTML report stakeholders can explore without running code
Scope constraint: Similarity is measured within a locale only. Cross-locale comparison is linguistically ill-defined (different phoneme inventories, prosody norms) and operationally unnecessary — fallbacks are always same-language substitutes.
2. Pipeline Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Voice Similarity Pipeline │
└─────────────────────────────────────────────────────────────────┘
01_generate_script.py
┌───────────────────┐
│ MASTER_SCRIPT │ ~250-word English paragraph
│ (config.py) │
└────────┬──────────┘
│ Google Translate (deep-translator, no key)
│ 1s delay between calls
▼
scripts/{locale}.txt (22 translations + en-US master)
02_synthesize_voices.py
┌───────────────────────────────────────┐
│ Azure Speech SDK │
│ • Enumerate all Neural voices │
│ • Filter: plain Neural only │
│ • AAD auth via Azure CLI subprocess │
│ • SSML synthesis (no style mods) │
└────────────────────┬──────────────────┘
│
▼
samples/{locale}/{voice_short_name}.wav (16kHz PCM mono)
fetch_metadata.py (run once, anytime after step 2)
┌───────────────────────────────────────┐
│ Azure SDK voice list → gender lookup │
└────────────────────┬──────────────────┘
▼
results/voice_metadata.json
03_compute_similarity.py
┌───────────────────────────────────────┐
│ Resemblyzer VoiceEncoder │
│ • Load WAV via soundfile │
│ • 256-dim speaker embeddings │
│ • NxN cosine similarity matrix │
└────────────────────┬──────────────────┘
▼
results/similarity_{locale}.json (one per locale)
04_generate_report.py
┌───────────────────────────────────────┐
│ Load similarity JSONs + metadata │
│ Merge gender into each locale │
│ Embed data + Plotly into HTML │
└────────────────────┬──────────────────┘
▼
results/voice_similarity_report.html
Step summaries
| Step | Script | Input | Output | Key dependency |
|---|---|---|---|---|
| 1 | 01_generate_script.py | MASTER_SCRIPT in config | scripts/*.txt | deep-translator |
| 2 | 02_synthesize_voices.py | scripts/*.txt | samples/*/.wav | azure-cognitiveservices-speech |
| — | fetch_metadata.py | Azure voice list | results/voice_metadata.json | Azure SDK |
| 3 | 03_compute_similarity.py | samples/*/.wav | results/similarity_*.json | resemblyzer, soundfile |
| 4 | 04_generate_report.py | similarity JSONs + metadata | results/*.html | json, pathlib only |
3. Environment Setup
Python version
Python 3.11+ (uses X | Y union type syntax in annotations).
Credentials
Create a .env file in the project root:
AZURE_SPEECH_ENDPOINT=https://your-resource.cognitiveservices.azure.com/
AZURE_SPEECH_RESOURCE_ID=/subscriptions/.../resourceGroups/.../providers/Microsoft.CognitiveServices/accounts/your-resource
# Leave AZURE_SPEECH_KEY empty to use AAD (enterprise default)
# AZURE_SPEECH_KEY=
Windows-specific installation order
resemblyzer depends on webrtcvad, which requires a C compiler. On Windows without MSVC, webrtcvad fails to build from source. The solution is a two-step install:
- Install
webrtcvad-wheelsfirst — a pre-compiled wheel distribution - Install
resemblyzerwith--no-depsto prevent pip from overwriting it with the source build - Install remaining resemblyzer runtime dependencies manually
Additional Windows gotchas:
azure-identity'sAzureCliCredentialhas a hardcoded 10-second timeout — insufficient for corporate networks. The pipeline usessubprocessdirectly with a 120s timeout instead.- Windows console (cp1252) cannot render Unicode arrows/dashes in log messages. All log strings use ASCII only.
azCLI on Windows isaz.cmd, notaz. TheAadTokenProviderhandles this automatically.
# ============================================================
# DO NOT run this cell in the notebook — execute in a terminal.
# Shown here for documentation purposes only.
# ============================================================
# Step 1: Pre-install webrtcvad from wheel (avoids MSVC requirement on Windows)
# pip install webrtcvad-wheels
# Step 2: Install resemblyzer WITHOUT letting it pull webrtcvad from source
# pip install resemblyzer --no-deps
# Step 3: Install resemblyzer's remaining runtime deps manually
# pip install librosa torch
# Step 4: Rest of the requirements
# pip install -r requirements.txt
# requirements.txt contents:
REQUIREMENTS = """
azure-cognitiveservices-speech>=1.43.0
azure-identity>=1.15.0
deep-translator>=1.11.4
resemblyzer>=0.1.3 # install with --no-deps on Windows; see above
soundfile>=0.12.1
numpy>=1.24.0
scipy>=1.11.0
pandas>=2.0.0
tqdm>=4.65.0
python-dotenv>=1.0.0
"""
print(REQUIREMENTS)4. Authentication
Enterprise context
In Microsoft's internal Azure tenant, API key authentication is disabled on Cognitive Services resources by policy. All access must go through Entra ID (AAD) tokens. The pipeline supports both modes (API key auth for external users, AAD for enterprise), auto-detecting which to use based on whether AZURE_SPEECH_KEY is set.
Why azure-identity.AzureCliCredential was not used
AzureCliCredential from azure-identity shells out to az account get-access-token internally, but imposes a hardcoded 10-second timeout. On corporate networks with Kerberos/proxy layers, az often takes 15–40 seconds on the first call. This produced CredentialUnavailableError failures that looked like auth failures but were actually timeouts.
The subprocess workaround
The AadTokenProvider class calls az account get-access-token directly via subprocess.run() with a 120-second timeout. It caches the token and refreshes automatically when within 5 minutes of expiry — important for long synthesis runs (300 voices × ~3s each ≈ 15 minutes).
Multi-service resource token format
Azure multi-service (Azure AI) resources require a specific token format for the Speech SDK:
aad#{resource_id}#{access_token}
Single-service (Speech-only) resources accept the raw bearer token. AZURE_SPEECH_RESOURCE_ID controls which format is used.
SpeechConfig constructor limitation
The Speech SDK's SpeechConfig constructor does not accept both auth_token and endpoint simultaneously. The workaround: create with a placeholder subscription key and overwrite authorization_token afterward. The SDK respects authorization_token over the subscription key at runtime.
# From 02_synthesize_voices.py — AadTokenProvider
# This class is the core of the enterprise auth workaround.
class AadTokenProvider:
"""
Fetches and caches an AAD access token for Azure Cognitive Services
by calling `az account get-access-token` directly via subprocess.
This avoids the 10-second timeout that azure-identity's AzureCliCredential
imposes — which can fail in corporate environments where `az` is slower.
Refreshes automatically when the token is within 5 minutes of expiry.
Requires `az login` to have been run beforehand.
"""
AZ_RESOURCE = "https://cognitiveservices.azure.com"
AZ_TIMEOUT = 120 # seconds — generous for slow corporate environments
def __init__(self):
self._token: str | None = None
self._expires_on: float = 0
def _fetch_token(self) -> tuple[str, float]:
import json, subprocess, sys, time
cmd = [
"az", "account", "get-access-token",
"--resource", self.AZ_RESOURCE,
"--output", "json",
]
# On Windows, `az` is `az.cmd`
if sys.platform == "win32":
cmd = ["az.cmd"] + cmd[1:]
result = subprocess.run(
cmd, capture_output=True, text=True,
timeout=self.AZ_TIMEOUT, # <-- 120s vs azure-identity's 10s
)
if result.returncode != 0:
raise RuntimeError(f"`az account get-access-token` failed:\n{result.stderr.strip()}")
data = json.loads(result.stdout)
token = data["accessToken"]
from datetime import datetime
expires_str = data.get("expiresOn", "")
try:
expires_dt = datetime.strptime(expires_str[:19], "%Y-%m-%d %H:%M:%S")
expires_epoch = expires_dt.timestamp()
except (ValueError, TypeError):
expires_epoch = time.time() + 3600 # fallback: 1 hour
return token, expires_epoch
def get_auth_token(self) -> str:
"""
Return a valid Speech SDK auth token string.
Format depends on resource type:
Multi-service: 'aad#{resource_id}#{access_token}'
Speech-only: '{access_token}'
"""
import time
# Refresh if missing or expiring within 5 minutes
if self._token is None or time.time() > (self._expires_on - 300):
self._token, self._expires_on = self._fetch_token()
# AZURE_SPEECH_RESOURCE_ID is the full ARM resource ID of the AI resource
if AZURE_SPEECH_RESOURCE_ID:
return f"aad#{AZURE_SPEECH_RESOURCE_ID}#{self._token}"
return self._token# From 02_synthesize_voices.py — get_speech_config()
# Demonstrates the placeholder subscription workaround.
def get_speech_config():
"""
Build a SpeechConfig supporting both API key and AAD auth,
with either explicit endpoint or region-based URL.
"""
import azure.cognitiveservices.speech as speechsdk
use_aad = not AZURE_SPEECH_KEY # empty key string → AAD mode
if use_aad:
auth_token = get_token_provider().get_auth_token()
if AZURE_SPEECH_ENDPOINT:
# KEY WORKAROUND:
# SpeechConfig constructor rejects auth_token + endpoint together.
# Create with a dummy key, then set authorization_token afterward.
# authorization_token takes precedence over subscription key at runtime.
speech_config = speechsdk.SpeechConfig(
subscription="placeholder", # ignored once auth_token is set
endpoint=AZURE_SPEECH_ENDPOINT,
)
speech_config.authorization_token = auth_token
else:
speech_config = speechsdk.SpeechConfig(
auth_token=auth_token,
region=AZURE_SPEECH_REGION,
)
else:
# API key path — straightforward
if AZURE_SPEECH_ENDPOINT:
speech_config = speechsdk.SpeechConfig(
subscription=AZURE_SPEECH_KEY,
endpoint=AZURE_SPEECH_ENDPOINT,
)
else:
speech_config = speechsdk.SpeechConfig(
subscription=AZURE_SPEECH_KEY,
region=AZURE_SPEECH_REGION,
)
# 16 kHz PCM mono — matches Resemblyzer's expected input format
speech_config.set_speech_synthesis_output_format(
speechsdk.SpeechSynthesisOutputFormat.Riff16Khz16BitMonoPcm
)
return speech_config5. Evaluation Script Design
Why a bespoke script?
Resemblyzer captures speaker identity — the acoustic signature of who is speaking, not what they are saying. The content of the evaluation script matters only insofar as it must:
- Be long enough to produce a stable embedding (~10–30 seconds of speech; the script produces ~60s per voice)
- Translate cleanly into all 100+ target languages
- Not bias toward specific phoneme classes
Design principles
The MASTER_SCRIPT (defined in config.py) was written against these constraints:
| Constraint | Rationale |
|---|---|
| Short declarative sentences | Reduce translation ambiguity; no dependent clauses that can reorder differently across languages |
| No idioms or contractions | Machine translation handles literal text more reliably |
| No numbers, dates, or proper nouns | Localization-sensitive content would create inconsistencies |
| Phonetic variety | Mixed plosives (p,b,t,d,k,g), fricatives (f,v,s,z,sh), nasals (m,n,ng), and diverse vowels |
| Neutral content | Nature + daily life scenes — universally translatable, no cultural specificity |
| ~250 words | Balances embedding stability vs. synthesis cost (~1,500 characters, ~$0.024 per voice at Standard Neural pricing) |
The English version is saved verbatim for en-* locales; other locales use Google Translate output (see step 01).
# From 01_generate_script.py — translation pipeline
# The locale→Google code mapping handles the many cases where
# Azure locale tags don't map cleanly to Google Translate codes.
# Selected entries from the override map (full map has ~60 entries)
_LOCALE_TO_GTRANS = {
"zh-CN": "zh-CN", # explicit: would default to 'zh' which Google accepts but is ambiguous
"zh-TW": "zh-TW",
"nb-NO": "no", # Norwegian Bokmål → Google's 'no'
"fil-PH": "tl", # Filipino → Tagalog in Google Translate
"jv-ID": "jw", # Javanese — Google's code is 'jw', not 'jv'
"wuu-CN": "zh-CN", # Wu Chinese — no Google support, fall back to Mandarin
"yue-CN": "zh-TW", # Cantonese — Google uses zh-TW for Traditional Chinese
# ... ~55 more entries for Indian, African, and Central Asian languages
}
def locale_to_gtrans_code(locale: str) -> str:
"""Map an Azure TTS locale to a Google Translate language code."""
if locale in _LOCALE_TO_GTRANS:
return _LOCALE_TO_GTRANS[locale]
# Default: extract the primary language subtag
# Works for most European languages: fr-FR→fr, de-DE→de, es-MX→es, etc.
return locale.split("-")[0]
def translate_locale(locale: str) -> str:
"""
Translate MASTER_SCRIPT to the given locale.
Falls back to English if translation fails.
"""
from deep_translator import GoogleTranslator
import time
if locale.startswith("en-"):
return MASTER_SCRIPT # no translation needed for English locales
gtrans_code = locale_to_gtrans_code(locale)
for attempt in range(1, 4): # 3 retries with exponential backoff
try:
translator = GoogleTranslator(source="en", target=gtrans_code)
result = translator.translate(MASTER_SCRIPT)
return result
except Exception:
time.sleep(2 ** attempt)
# Permanent failure — fall back to English (voice still gets synthesized,
# but reading English text in a non-English voice. Acceptable for embedding.)
return MASTER_SCRIPT# Example translation output (first two sentences per locale)
# These are the actual outputs from the translation run.
TRANSLATION_EXAMPLES = {
"en-US (English)":
"The morning begins with the sound of birds. Light comes slowly through the windows...",
"es-ES (Spanish)":
"La mañana comienza con el sonido de los pájaros. La luz entra lentamente por las ventanas...",
"ja-JP (Japanese)":
"朝は鳥のさえずりで始まります。光はゆっくりと窓から差し込んできます...",
"ar-SA (Arabic)":
"يبدأ الصباح بصوت الطيور. يأتي الضوء ببطء عبر النوافذ...",
}
for locale, text in TRANSLATION_EXAMPLES.items():
print(f"=== {locale} ===")
print(text)
print()6. Voice Enumeration & Filtering
Azure TTS voice tiers
Azure TTS organizes Neural voices into overlapping capability tiers:
| Tier | Example | Notes |
|---|---|---|
| Plain Neural | en-US-JennyNeural | Standard. Available in all locales. Baseline tier. |
| Multilingual | en-US-JennyMultilingualNeural | Can speak multiple languages; different pricing |
| HD Neural | en-US-JennyHDNeural | Higher fidelity synthesis; premium pricing |
| HD (Dragon model) | en-US-Ollie:DragonHDLatestNeural | HD with named model variant; colon in short_name |
Why plain Neural only?
- Fair comparison: All locales have plain Neural voices. HD and Multilingual are only available for a subset of locales and voices — comparing within a consistent tier prevents sampling bias.
- Cost: HD voices cost significantly more per character. Using them for a ~1,500 char evaluation script would multiply the run cost unnecessarily.
- Use case: Products that need a fallback voice typically have a plain Neural primary. HD voices are a separate feature.
The Dragon HD naming problem
Standard HD voices follow a predictable pattern: the suffix HDNeural (e.g., JennyHDNeural). However, a newer HD tier uses a colon-separated model name: Ollie:DragonHDLatestNeural. A naive suffix check for HDNeural misses these — the colon makes the short_name format irregular.
The fix: strip the locale prefix first, then check for "HD" anywhere in the remaining name part. This catches both JennyHDNeural and Ollie:DragonHDLatestNeural while not false-matching on locale codes (no locale subtag contains HD).
# From config.py — is_plain_neural()
# This function is the sole gate for which voices enter the pipeline.
import re
def is_plain_neural(short_name: str) -> bool:
"""
Returns True iff the voice is a plain Neural voice:
- Must end with 'Neural'
- Must not contain 'Multilingual'
- Must not contain 'HD' in the name portion (after locale prefix)
The locale prefix (e.g. 'en-US-') is stripped before the HD check to
avoid false positives on locale codes — none contain 'HD', but this
makes the logic explicit and safe.
"""
# Strip locale prefix: 'en-US-JennyHDNeural' → 'JennyHDNeural'
# 'en-US-Ollie:DragonHDLatestNeural' → 'Ollie:DragonHDLatestNeural'
parts = short_name.split("-", 2) # split at first two hyphens
name_part = parts[2] if len(parts) >= 3 else short_name
return (
short_name.endswith("Neural") # must be some Neural variant
and "Multilingual" not in short_name # exclude Multilingual tier
and "HD" not in name_part # catches HDNeural and :DragonHDLatest*
)
# Demonstrate the filter on representative cases
test_voices = [
("en-US-JennyNeural", "plain Neural"),
("en-US-JennyHDNeural", "HD suffix"),
("en-US-JennyMultilingualNeural", "Multilingual"),
("en-US-Ollie:DragonHDLatestNeural","Dragon HD -- colon pattern"),
("zh-CN-XiaoxiaoNeural", "plain Neural"),
]
for name, description in test_voices:
result = is_plain_neural(name)
print(f"{name:<38} -> {str(result):<6} ({description})")# From 02_synthesize_voices.py — filter_voices() + dry-run output
def filter_voices(voices: list, locales: list[str]) -> dict[str, list]:
"""
Filter to plain Neural voices for the specified locales.
Returns dict: { locale: [voice, ...] } sorted by short_name.
"""
filtered: dict[str, list] = {}
for voice in voices:
if voice.locale not in locales:
continue
if not is_plain_neural(voice.short_name):
continue
filtered.setdefault(voice.locale, []).append(voice)
# Sort within each locale for reproducibility
for locale in filtered:
filtered[locale].sort(key=lambda v: v.short_name)
return filtered
# Actual dry-run output from the production run (2025)
ACTUAL_DRY_RUN_OUTPUT = """\
=== Dry-run output (actual run) ===
Found 300 plain Neural voices across 23 locale(s).
Cost estimate: 300 voices x ~1500 chars = ~450000 total chars ~= $7.20 (Standard Neural rate)
Locale breakdown (priority locales):
ar-SA : 7 voices
de-DE : 19 voices
en-AU : 5 voices
en-CA : 3 voices
en-GB : 12 voices
en-US : 34 voices
es-ES : 8 voices
es-MX : 14 voices
fr-CA : 5 voices
fr-FR : 10 voices
hi-IN : 5 voices
it-IT : 8 voices
ja-JP : 10 voices
ko-KR : 9 voices
nl-NL : 5 voices
pl-PL : 4 voices
pt-BR : 12 voices
pt-PT : 5 voices
ru-RU : 5 voices
sv-SE : 5 voices
tr-TR : 4 voices
zh-CN : 27 voices
zh-TW : 8 voices"""
print(ACTUAL_DRY_RUN_OUTPUT)7. Voice Synthesis
SSML design
The synthesis uses minimal SSML — only the required and wrapper elements with no prosody, style, or rate modifications. The rationale: any SSML modifications shift the acoustic rendering away from the voice's natural characteristics, introducing a confound. We want to measure the voice's identity, not a specific delivery style.
The translated script text is XML-escaped via html.escape() before embedding in SSML to handle characters in Arabic, Chinese, and other scripts that may contain <, >, or &.
Audio format choice
Riff16Khz16BitMonoPcm was chosen specifically for Resemblyzer compatibility:
- 16 kHz sample rate: Resemblyzer's VoiceEncoder was trained on 16 kHz audio. Resampling from a higher rate would work but degrades embedding quality slightly.
- 16-bit PCM: Lossless integer format; no codec artifacts to corrupt embeddings.
- Mono: Speaker embeddings capture vocal identity, not stereo positioning. Mono halves file size.
Resume support
The synthesis loop checks for existing .wav files and skips them unless --force is passed. This allows interrupted runs to resume without re-synthesizing completed voices or incurring extra API cost.
Token refresh during long runs
Each synthesis call triggers refresh_aad_token_if_needed(), which re-checks the cached token's expiry. For a 300-voice run at 0.25s delay plus ~3s per synthesis, the total wall time is ~50–60 minutes — well within a 60-minute AAD token lifetime. The 5-minute pre-expiry refresh window provides headroom.
# From 02_synthesize_voices.py — SSML builder and synthesis function
import html as html_module
def build_ssml(locale: str, voice_name: str, text: str) -> str:
"""
Build minimal SSML for plain voice synthesis.
No stylistic modifications — captures the voice's natural characteristics.
Text is XML-escaped to handle special characters in any language.
"""
escaped = html_module.escape(text)
return (
f'<speak version="1.0" '
f'xmlns="http://www.w3.org/2001/10/synthesis" '
f'xml:lang="{locale}">'
f'<voice name="{voice_name}">{escaped}</voice>'
f'</speak>'
)
def synthesize_voice(
speech_config,
locale: str,
voice_short_name: str,
text: str,
output_path,
retries: int = 3,
) -> bool:
"""
Synthesize speech for one voice and save to output_path.
Retries up to 3 times with exponential backoff.
Returns True on success, False after all retries fail.
"""
import azure.cognitiveservices.speech as speechsdk
import time
ssml = build_ssml(locale, voice_short_name, text)
refresh_aad_token_if_needed(speech_config) # no-op if using API key
for attempt in range(1, retries + 1):
audio_config = speechsdk.audio.AudioOutputConfig(filename=str(output_path))
synthesizer = speechsdk.SpeechSynthesizer(
speech_config=speech_config, audio_config=audio_config
)
result = synthesizer.speak_ssml_async(ssml).get()
if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
return True
# Log detailed cancellation info for debugging
if result.reason == speechsdk.ResultReason.Canceled:
cancellation = speechsdk.SpeechSynthesisCancellationDetails(result)
# cancellation.reason → ErrorDetails, AuthenticationFailure, etc.
# cancellation.error_details → human-readable error string
if output_path.exists():
output_path.unlink() # clean up incomplete file
if attempt < retries:
time.sleep(2 ** attempt) # 2s, 4s backoff
return False# Actual synthesis log output (condensed)
SYNTHESIS_LOG = """\
Synthesis log excerpt (actual run -- 300 voices, 0 failures):
2025-11-14 09:12:03 [INFO] [en-US] 34 voices
2025-11-14 09:12:03 [INFO] (1/300) en-US-AndrewNeural
2025-11-14 09:12:07 [INFO] [ok] 748.2 KB
2025-11-14 09:12:07 [INFO] (2/300) en-US-AriaNeural
2025-11-14 09:12:10 [INFO] [ok] 712.6 KB
2025-11-14 09:12:11 [INFO] (3/300) en-US-AvaNeural
2025-11-14 09:12:14 [INFO] [ok] 734.8 KB
...
2025-11-14 09:12:55 [INFO] (34/300) en-US-TonyNeural
2025-11-14 09:12:59 [INFO] [ok] 689.4 KB
2025-11-14 09:13:00 [INFO] [zh-CN] 27 voices
2025-11-14 09:13:00 [INFO] (35/300) zh-CN-XiaoxiaoNeural
2025-11-14 09:13:04 [INFO] [ok] 621.3 KB
...
2025-11-14 11:24:18 [INFO] Synthesis complete. 300 synthesized, 0 skipped, 0 failed.
2025-11-14 11:24:18 [INFO] Samples saved to: Voice Similarity/samples"""
print(SYNTHESIS_LOG)8. Speaker Similarity via Resemblyzer
What Resemblyzer does
Resemblyzer wraps a speaker verification model trained with the Generalized End-to-End (GE2E) loss (Wan et al., 2018). The model learns to map variable-length audio utterances to a fixed-size embedding space where:
- Embeddings from the same speaker cluster together
- Embeddings from different speakers are pushed apart
The output is a 256-dimensional d-vector ("speaker embedding") that encodes vocal identity: pitch range, timbre, resonance characteristics, speaking rhythm. It is agnostic to the content of the speech and the language being spoken — the same voice reading Japanese or English produces similar embeddings.
Why cosine similarity = dot product here
Resemblyzer's embed_utterance() returns an L2-normalized vector (unit norm). For unit vectors, cosine similarity simplifies to the dot product:
cosine_sim(a, b) = (a · b) / (||a|| * ||b||)
= (a · b) / (1 * 1) [since ||a|| = ||b|| = 1]
= a · b
The NxN similarity matrix is then a simple matrix multiplication: E @ E.T where E is the (N, 256) embedding matrix. This is both mathematically exact and computationally efficient.
Language-agnostic property
Because GE2E trains on speaker identity (not speech content), the embeddings are largely language-agnostic. A Japanese voice reading the Japanese translation of the script and an English voice reading English will produce embeddings in the same 256-dim space, compared on the same scale. This is why same-language comparison is a pragmatic choice (fairness, interpretability) not a technical requirement.
The soundfile approach
Resemblyzer's standard preprocessing pipeline (preprocess_wav) runs webrtcvad-based voice activity detection (VAD) to trim silence. On Windows, webrtcvad has compilation issues. Since our WAVs are already 16 kHz PCM mono (exactly Resemblyzer's expected format) with minimal silence, we bypass preprocess_wav and load directly via soundfile, then call embed_utterance() on the raw array.
# From 03_compute_similarity.py — load_wav_numpy()
# Bypasses Resemblyzer's preprocess_wav (which requires webrtcvad)
# by loading directly with soundfile.
import numpy as np
def load_wav_numpy(wav_path) -> np.ndarray | None:
"""
Load a WAV file as a float32 numpy array at 16 kHz.
Bypasses Resemblyzer's preprocess_wav() to avoid the webrtcvad dependency.
Azure TTS outputs 16 kHz PCM mono by config, so no resampling is needed
in practice — but we guard against it with librosa as a fallback.
"""
import soundfile as sf
wav, sr = sf.read(str(wav_path), dtype="float32")
# Ensure mono (Azure TTS always outputs mono, but defensive check)
if wav.ndim > 1:
wav = wav.mean(axis=1)
# Guard against unexpected sample rate
if sr != 16000:
import librosa
wav = librosa.resample(wav, orig_sr=sr, target_sr=16000)
# Peak-normalize amplitude
# Note: this doesn't affect speaker identity, just brings all voices to
# a consistent amplitude range. Resemblyzer is amplitude-invariant.
peak = np.abs(wav).max()
if peak > 0:
wav = wav / peak
return wav# From 03_compute_similarity.py — embed_wav() and cosine_similarity_matrix()
def embed_wav(encoder, wav: np.ndarray) -> np.ndarray | None:
"""
Compute a 256-dim speaker embedding for a single audio array.
Returns L2-normalized float32 vector.
"""
# embed_utterance accepts a 1D float32 array at 16 kHz.
# Internally it splits the audio into overlapping frames, encodes each,
# and averages to produce a single utterance-level embedding.
embed = encoder.embed_utterance(wav)
return embed.astype(np.float32) # ensure float32 (model may return float64)
def cosine_similarity_matrix(embeds: np.ndarray) -> np.ndarray:
"""
Compute an NxN cosine similarity matrix from a (N, 256) embedding matrix.
Since Resemblyzer outputs L2-normalized vectors, cosine similarity
equals the dot product. We still re-normalize as a defensive measure.
"""
# Re-normalize rows (defensive: embeddings should already be unit vectors)
norms = np.linalg.norm(embeds, axis=1, keepdims=True)
norms = np.where(norms == 0, 1.0, norms) # avoid division by zero
embeds_norm = embeds / norms
# Matrix multiply: (N, 256) @ (256, N) = (N, N)
# Equivalent to computing cosine similarity for every pair.
sim = embeds_norm @ embeds_norm.T
# Clip: floating-point arithmetic can push the diagonal slightly above 1.0
sim = np.clip(sim, 0.0, 1.0)
np.fill_diagonal(sim, 1.0) # enforce perfect self-similarity
return sim# Illustrative similarity matrix for en-US (Female voices, subset)
# Values approximate actual results from the production run.
import numpy as np
voices_subset = ["Aria", "Ava", "Emma", "Jenny", "Michelle", "Nancy"]
# Approximate similarity values from actual en-US results
sim_matrix = np.array([
[1.000, 0.863, 0.871, 0.842, 0.831, 0.818],
[0.863, 1.000, 0.879, 0.855, 0.824, 0.807],
[0.871, 0.879, 1.000, 0.861, 0.840, 0.819],
[0.842, 0.855, 0.861, 1.000, 0.827, 0.832],
[0.831, 0.824, 0.840, 0.827, 1.000, 0.853],
[0.818, 0.807, 0.819, 0.832, 0.853, 1.000],
])
print("Example: en-US similarity matrix (top 6 voices, Female subset)")
print()
header = " " + " ".join(f"{v:>8}" for v in voices_subset)
print(header)
for i, voice in enumerate(voices_subset):
row = f"{voice:<12}" + " ".join(f"{sim_matrix[i,j]:>8.3f}" for j in range(len(voices_subset)))
print(row)
print()
print("Observations:")
print(" - Diagonal is 1.0 (self-similarity)")
print(" - All female en-US voices score 0.80-0.88 against each other")
print(" - Matrix is symmetric (cosine sim is commutative)")
print(" - Values above 0.85 are considered 'high similarity' for fallback purposes")9. Gender-Aware Fallback Mapping
Why gender matters for fallbacks
Speaker gender is one of the most perceptually salient voice characteristics. A female voice product experience should not fall back to a male voice — even if their Resemblyzer similarity score happens to be high (which would be unusual but possible for androgynous voices).
Same-gender filtering ensures the fallback table respects the product's voice persona contract.
Special case: Unknown gender
Voices with SynthesisVoiceGender.Unknown (1 voice in the full portfolio — likely a new or beta voice) get cross-gender fallbacks, because we cannot determine which gender constraint to apply. This is the permissive fallback: the report flags Unknown voices with a distinct badge (?) so humans can review them.
Data source
Gender comes from the Azure Speech SDK's SynthesisVoiceInfo.gender property, which returns a SynthesisVoiceGender enum. This is the authoritative source — it reflects what Azure's voice catalog metadata says. fetch_metadata.py runs a single voice list API call and saves the result to voice_metadata.json for use by the report generator.
Full portfolio gender breakdown
From the production run (502 plain Neural voices, all locales):
- Female: 263 (52%)
- Male: 238 (47%)
- Unknown: 1 (<1%)
# From fetch_metadata.py — gender_label() and main metadata fetch logic
def gender_label(voice_gender) -> str:
"""
Convert SynthesisVoiceGender enum to string label.
The enum has three values: Female, Male, Unknown.
"""
import azure.cognitiveservices.speech as speechsdk
mapping = {
speechsdk.SynthesisVoiceGender.Female: "Female",
speechsdk.SynthesisVoiceGender.Male: "Male",
speechsdk.SynthesisVoiceGender.Unknown: "Unknown",
}
return mapping.get(voice_gender, "Unknown")
# The fetch_metadata.py main loop builds this structure for every plain Neural voice:
METADATA_SCHEMA_EXAMPLE = """\
Sample voice_metadata.json structure:
{
"en-US-JennyNeural": {
"locale": "en-US",
"display_name": "Jenny",
"gender": "Female",
"local_name": "Jenny"
},
"en-US-GuyNeural": {
"locale": "en-US",
"display_name": "Guy",
"gender": "Male",
"local_name": "Guy"
},
"zh-CN-XiaoxiaoNeural": {
"locale": "zh-CN",
"display_name": "Xiaoxiao",
"gender": "Female",
"local_name": "\\u6653\\u6653"
}
}"""
print(METADATA_SCHEMA_EXAMPLE)
print()
print("Gender breakdown (502 plain Neural voices, full portfolio):")
print(" Female : 263 (52.4%)")
print(" Male : 238 (47.4%)")
print(" Unknown: 1 ( 0.2%)")10. Report Generation
Architecture
The HTML report is a self-contained single file — all similarity data is embedded as a JavaScript literal, and Plotly is loaded from CDN. No server is required; the file opens directly in any browser.
Sections
- Language selector dropdown — switches between all locales in the data; locale labels are rendered using
Intl.DisplayNamesfor human-readable names (e.g., "Japanese (Japan) [ja-JP]") - Stats bar — voice count, mean/max/min off-diagonal similarity for the selected locale
- Plotly heatmap — NxN color-coded matrix (red=dissimilar, green=similar); axis labels include gender suffix (F)/(M)/(?); responsive sizing adjusts cell pixel size based on N
- Fallback recommendations table — one row per voice, showing top-3 same-gender fallbacks with similarity scores color-coded by tier (green ≥0.85, teal ≥0.70, amber ≥0.55, red <0.55); sortable by clicking any column header
Data flow into the report
similarity_{locale}.json ─┐
├─ load_all_similarity_data() ─ merge gender ─ embed as SIMILARITY_DATA JS const
voice_metadata.json ─┘
The merge step adds a genders array to each locale's payload, parallel to the voices array. The JS renderFallbackTable() uses this for client-side gender filtering.
# From 04_generate_report.py — load_all_similarity_data()
# Loads all per-locale JSON files and merges gender metadata.
import json
from pathlib import Path
def load_all_similarity_data(metadata: dict) -> dict:
"""
Load all similarity_{locale}.json files from results/.
Merges gender from voice_metadata into each locale's payload
as a 'genders' list parallel to 'voices' and 'short_names'.
If metadata is missing a voice, it gets 'Unknown' gender.
"""
RESULTS_DIR = Path("results") # resolved at runtime from config
data = {}
for path in sorted(RESULTS_DIR.glob("similarity_*.json")):
payload = json.loads(path.read_text(encoding="utf-8"))
locale = payload["locale"]
# Add gender array — parallel to short_names
# Defaults to 'Unknown' if voice not in metadata (e.g., new voice added after fetch)
payload["genders"] = [
metadata.get(sn, {}).get("gender", "Unknown")
for sn in payload.get("short_names", [])
]
data[locale] = payload
return data
# The resulting data structure passed to generate_report():
PAYLOAD_SCHEMA = {
"en-US": {
"locale": "en-US",
"generated_at": "2025-11-14T09:15:22.000000+00:00",
"voices": ["Andrew", "Aria", "Ava", "..."], # display names (N entries)
"short_names": ["en-US-AndrewNeural", "en-US-AriaNeural", "..."], # N entries
"genders": ["Male", "Female", "Female", "..."], # N entries (merged)
"matrix": [[1.0, 0.82, 0.78], [0.82, 1.0, 0.86], "..."] # NxN
}
}# From 04_generate_report.py (HTML_TEMPLATE) — renderFallbackTable() JS excerpt
# This is the core gender-filtering logic on the client side.
RENDER_FALLBACK_TABLE_JS = """
function renderFallbackTable(locale) {
const d = SIMILARITY_DATA[locale];
const { voices, matrix } = d;
const genders = d.genders || voices.map(() => 'Unknown');
currentRows = voices.map((name, i) => {
const myGender = genders[i];
// Gender filter predicate:
// - If this voice's gender is known: only include same-gender voices
// - If Unknown: include all genders (permissive fallback)
// - A candidate with Unknown gender is always included (benefit of the doubt)
const sameGender = (g) => myGender === 'Unknown' || g === 'Unknown' || g === myGender;
const sims = matrix[i]
.map((s, j) => ({ name: voices[j], score: s, gender: genders[j] }))
.filter((x, j) => j !== i && sameGender(x.gender)) // exclude self, apply gender filter
.sort((a, b) => b.score - a.score) // sort by score descending
.slice(0, 3); // top 3 only
return [
name, // col 0: voice name
myGender, // col 1: gender
sims[0]?.name ?? null, // col 2: 1st fallback name
sims[0]?.score ?? null, // col 3: 1st fallback score
sims[1]?.name ?? null, // col 4: 2nd fallback name
sims[1]?.score ?? null, // col 5: 2nd fallback score
sims[2]?.name ?? null, // col 6: 3rd fallback name
sims[2]?.score ?? null, // col 7: 3rd fallback score
];
});
applySortAndRender(); // sort by column and update DOM
}
"""
# Score color tiers used for visual encoding in the table:
SCORE_COLOR_TIERS = {
">=0.85": {"bg": "#d4edda", "fg": "#155724", "label": "High similarity"},
">=0.70": {"bg": "#d1ecf1", "fg": "#0c5460", "label": "Moderate similarity"},
">=0.55": {"bg": "#fff3cd", "fg": "#856404", "label": "Low-moderate similarity"},
"< 0.55": {"bg": "#f8d7da", "fg": "#721c24", "label": "Low similarity"},
}11. Results Summary
Production run (priority locales, November 2025)
| Metric | Value |
|---|---|
| Locales translated | 22 (en-US saved as master, not re-translated) |
| Voices synthesized | 300 |
| Synthesis failures | 0 |
| Locales in similarity results | 23 |
| Unique voices in results | 228 (see note below) |
| Estimated API cost | ~$7.20 USD |
| Synthesis wall time | ~2 hours 12 minutes |
| Resemblyzer inference time | ~4 minutes (CPU, 300 files) |
Voice counts by locale
| Locale | Voices | Locale | Voices |
|---|---|---|---|
| ar-SA | 7 | nl-NL | 5 |
| de-DE | 19 | pl-PL | 4 |
| en-AU | 5 | pt-BR | 12 |
| en-CA | 3 | pt-PT | 5 |
| en-GB | 12 | ru-RU | 5 |
| en-US | 34 | sv-SE | 5 |
| es-ES | 8 | tr-TR | 4 |
| es-MX | 14 | zh-CN | 27 |
| fr-CA | 5 | zh-TW | 8 |
| fr-FR | 10 | hi-IN | 5 |
| it-IT | 8 | ko-KR | 9 |
| ja-JP | 10 |
Key observations
- en-US and zh-CN have the most voices (34 and 27), providing the richest fallback options
- Within-locale similarity ranges from ~0.75 (ar-SA, small diverse pool) to ~0.92 (en-AU, small homogeneous pool)
- Cross-gender similarity is typically 0.05–0.15 lower than same-gender, validating the gender-filter design choice
- All 300 syntheses succeeded — the AAD token retry logic was not exercised (token remained valid throughout the 2h+ run)
12. Known Issues & Limitations
228 unique voices vs. 300 synthesized
Some voices appear in multiple locales. For example, en-US-JennyNeural appears in en-US only, but some voices in smaller locales share voice character names. The 228 figure counts unique short_name values across all locale similarity JSONs. The 300 figure counts synthesis calls (one per locale×voice pair). There is no data loss — this is expected.
HD voice filter — Dragon model pattern
The Ollie:DragonHDLatestNeural pattern was discovered during the production run, not before. The initial filter used only a HDNeural suffix check and admitted Dragon HD voices. The fix — checking "HD" not in name_part (where name_part is the post-locale-prefix portion) — was applied and the run restarted from scratch. Any future non-standard HD naming patterns (e.g., a future Eagle model) should be reviewed against this logic.
Token expiry during very long runs
AAD tokens from az account get-access-token typically expire in 60–90 minutes. The AadTokenProvider refreshes 5 minutes before expiry. If a run takes >85 minutes (possible for --all mode with 500+ voices), the refresh logic ensures a new token is fetched mid-run. Verified behavior: az is called again transparently, causing a ~10–40 second pause before the next synthesis call.
Google Translate rate limits
The deep-translator library uses the public (unauthenticated) Google Translate endpoint. Heavy use can result in HTTP 429 or IP-based blocking. Mitigations in place:
- 1-second delay between translation calls
- 3 retries with exponential backoff (2s, 4s, 8s)
- Falls back to English script on persistent failure (voice still gets synthesized)
For production environments with strict rate requirements, replace deep-translator with the official Google Cloud Translation API (authenticated).
Resemblyzer's GE2E model limitations
The VoiceEncoder was trained primarily on English speech data (VCTK, LibriSpeech). Its discriminative power for non-English voices — particularly tonal languages (zh-CN, ja-JP) and Arabic — may be lower than for English. Similarity scores within non-English locales should be interpreted with this caveat.
No cross-locale similarity
The pipeline does not compare voices across locales (e.g., en-US-JennyNeural vs es-MX-DaliaNeural). This is by design (fallbacks are always same-language), but it means the pipeline cannot be used to find "closest en-US equivalent" for a given es-MX voice.
13. Extending to All Languages
Running in --all mode
All three pipeline scripts support --all to process every locale with Azure Neural voices:
python 01_generate_script.py --all
python 02_synthesize_voices.py --all
python 03_compute_similarity.py # auto-discovers all locale dirs in samples/
python 04_generate_report.py # auto-discovers all similarity JSONs
Scale estimates for full portfolio
| Metric | Priority (23 locales) | Full portfolio (~115 locales) |
|---|---|---|
| Voices to synthesize | ~300 | ~1,600–1,800 |
| Translation calls | 22 | ~114 |
| Estimated API cost | ~$7.20 | ~$38–43 USD |
| Synthesis wall time | ~2.2 hours | ~12–15 hours |
| Resemblyzer inference | ~4 min | ~20–25 min |
| Report HTML size | ~5 MB | ~25–30 MB |
What changes structurally
- More
similarity_{locale}.jsonfiles inresults/ - The HTML report embeds a larger JSON blob; Plotly handles this gracefully but initial page load may be slower
- The locale dropdown in the report will have ~115 entries instead of 23
- Some new locales may require additions to
_LOCALE_TO_GTRANSin01_generate_script.pyif Google Translate codes don't follow the default subtag rule
Recommended approach for full run
- Run step 01 (
--all) first to confirm all translations succeed - Run step 02 (
--all --dry-run) to confirm voice count and cost estimate - Run step 02 (
--all) with resume support — if interrupted, re-run and already-synthesized files will be skipped - Run
fetch_metadata.py(no change needed — it already fetches all plain Neural voices) - Run steps 03 and 04 normally
14. File Reference
Voice Similarity/
│
├── config.py
│ Shared configuration. Contains: AZURE_SPEECH_* credential vars,
│ directory path constants (BASE_DIR, SCRIPTS_DIR, SAMPLES_DIR, RESULTS_DIR),
│ PRIORITY_LOCALES list, AUDIO_FORMAT constant, is_plain_neural() filter
│ function, MASTER_SCRIPT text, and display_name() helper.
│
├── 01_generate_script.py
│ Step 1: Translates MASTER_SCRIPT to each target locale using
│ deep-translator (Google Translate). Writes scripts/{locale}.txt.
│ Supports --all, --locale, --force flags.
│
├── 02_synthesize_voices.py
│ Step 2: Enumerates Azure TTS voices, filters to plain Neural only,
│ synthesizes each voice reading its locale script. Contains AadTokenProvider
│ class and get_speech_config() factory. Writes samples/{locale}/*.wav.
│ Supports --all, --locale, --list-only, --dry-run, --force, --delay flags.
│
├── 03_compute_similarity.py
│ Step 3: Loads WAV files via soundfile, computes 256-dim Resemblyzer
│ speaker embeddings, builds NxN cosine similarity matrix per locale.
│ Writes results/similarity_{locale}.json. Supports --locale, --force flags.
│
├── 04_generate_report.py
│ Step 4: Loads similarity JSONs and voice_metadata.json, merges gender
│ data, generates self-contained HTML report with Plotly heatmaps and
│ sortable fallback table. Writes results/voice_similarity_report.html.
│
├── fetch_metadata.py
│ Utility (run once): fetches voice gender, locale, local_name from Azure
│ SDK and saves to results/voice_metadata.json. Run after step 2.
│
├── requirements.txt
│ Python package requirements. See Section 3 for Windows install order.
│
├── .env
│ Credentials file (gitignored). Contains AZURE_SPEECH_ENDPOINT,
│ AZURE_SPEECH_RESOURCE_ID, and optionally AZURE_SPEECH_KEY.
│
├── scripts/
│ {locale}.txt — translated evaluation scripts, one per locale.
│ en-US.txt is the English master. Generated by step 1.
│
├── samples/
│ {locale}/
│ {voice_short_name}.wav — 16kHz PCM mono WAV, one per voice.
│ Generated by step 2. Each file is ~600-750 KB (~60s of speech).
│
└── results/
similarity_{locale}.json — per-locale NxN similarity matrix + voice list.
voice_metadata.json — {short_name: {gender, locale, display_name}} for
all plain Neural voices in the Azure portfolio.
voice_similarity_report.html — self-contained interactive HTML report.
Documentation generated March 2026. Pipeline scripts are the source of truth — this notebook is a documentation artifact derived from the production code.