Azure TTS Voice Similarity Pipeline

Developer Documentation

Date: March 2026 Status: Production run complete (priority locales) Repository: Voice Similarity/ — standalone pipeline, no external repo dependency

Reading guide: This is a documentation artifact, not a runnable notebook. Code cells show key excerpts from the actual scripts with annotations. Run the numbered .py scripts directly in order.

1. Problem Statement

Azure TTS offers 500+ Neural voices across 100+ locales. When a product ships with a specific voice (e.g., en-US-JennyNeural), it must define fallback voices — alternatives shown to users when the primary is unavailable, deprecated, or unsuitable.

The challenge: Voice similarity is subjective. Teams historically made fallback choices by ear, leading to:

Inconsistent fallback quality across locales
Choices that don't survive voice roster changes
No defensible, auditable rationale

The goal: Produce an objective, reproducible similarity score for every pair of voices within the same locale, enabling:

Data-driven fallback ranking (top-N most similar voices per voice)
Gender-aware filtering (female voices map to female fallbacks)
A shareable HTML report stakeholders can explore without running code

Scope constraint: Similarity is measured within a locale only. Cross-locale comparison is linguistically ill-defined (different phoneme inventories, prosody norms) and operationally unnecessary — fallbacks are always same-language substitutes.

2. Pipeline Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Voice Similarity Pipeline                     │
└─────────────────────────────────────────────────────────────────┘

  01_generate_script.py
  ┌───────────────────┐
  │  MASTER_SCRIPT    │  ~250-word English paragraph
  │  (config.py)      │
  └────────┬──────────┘
           │ Google Translate (deep-translator, no key)
           │ 1s delay between calls
           ▼
  scripts/{locale}.txt   (22 translations + en-US master)

  02_synthesize_voices.py
  ┌───────────────────────────────────────┐
  │  Azure Speech SDK                     │
  │  • Enumerate all Neural voices        │
  │  • Filter: plain Neural only          │
  │  • AAD auth via Azure CLI subprocess  │
  │  • SSML synthesis (no style mods)     │
  └────────────────────┬──────────────────┘
                       │
                       ▼
  samples/{locale}/{voice_short_name}.wav   (16kHz PCM mono)

  fetch_metadata.py  (run once, anytime after step 2)
  ┌───────────────────────────────────────┐
  │  Azure SDK voice list → gender lookup │
  └────────────────────┬──────────────────┘
                       ▼
  results/voice_metadata.json

  03_compute_similarity.py
  ┌───────────────────────────────────────┐
  │  Resemblyzer VoiceEncoder             │
  │  • Load WAV via soundfile             │
  │  • 256-dim speaker embeddings         │
  │  • NxN cosine similarity matrix       │
  └────────────────────┬──────────────────┘
                       ▼
  results/similarity_{locale}.json   (one per locale)

  04_generate_report.py
  ┌───────────────────────────────────────┐
  │  Load similarity JSONs + metadata     │
  │  Merge gender into each locale        │
  │  Embed data + Plotly into HTML        │
  └────────────────────┬──────────────────┘
                       ▼
  results/voice_similarity_report.html

Step summaries

Step	Script	Input	Output	Key dependency
1	`01_generate_script.py`	`MASTER_SCRIPT` in config	`scripts/*.txt`	`deep-translator`
2	`02_synthesize_voices.py`	`scripts/*.txt`	`samples/*/.wav`	`azure-cognitiveservices-speech`
—	`fetch_metadata.py`	Azure voice list	`results/voice_metadata.json`	Azure SDK
3	`03_compute_similarity.py`	`samples/*/.wav`	`results/similarity_*.json`	`resemblyzer`, `soundfile`
4	`04_generate_report.py`	similarity JSONs + metadata	`results/*.html`	`json`, `pathlib` only

3. Environment Setup

Python version

Python 3.11+ (uses X | Y union type syntax in annotations).

Credentials

Create a .env file in the project root:

AZURE_SPEECH_ENDPOINT=https://your-resource.cognitiveservices.azure.com/
AZURE_SPEECH_RESOURCE_ID=/subscriptions/.../resourceGroups/.../providers/Microsoft.CognitiveServices/accounts/your-resource
# Leave AZURE_SPEECH_KEY empty to use AAD (enterprise default)
# AZURE_SPEECH_KEY=

Windows-specific installation order

resemblyzer depends on webrtcvad, which requires a C compiler. On Windows without MSVC, webrtcvad fails to build from source. The solution is a two-step install:

Install webrtcvad-wheels first — a pre-compiled wheel distribution
Install resemblyzer with --no-deps to prevent pip from overwriting it with the source build
Install remaining resemblyzer runtime dependencies manually

Additional Windows gotchas:

azure-identity's AzureCliCredential has a hardcoded 10-second timeout — insufficient for corporate networks. The pipeline uses subprocess directly with a 120s timeout instead.
Windows console (cp1252) cannot render Unicode arrows/dashes in log messages. All log strings use ASCII only.
az CLI on Windows is az.cmd, not az. The AadTokenProvider handles this automatically.

Python

# ============================================================
# DO NOT run this cell in the notebook — execute in a terminal.
# Shown here for documentation purposes only.
# ============================================================

# Step 1: Pre-install webrtcvad from wheel (avoids MSVC requirement on Windows)
# pip install webrtcvad-wheels

# Step 2: Install resemblyzer WITHOUT letting it pull webrtcvad from source
# pip install resemblyzer --no-deps

# Step 3: Install resemblyzer's remaining runtime deps manually
# pip install librosa torch

# Step 4: Rest of the requirements
# pip install -r requirements.txt

# requirements.txt contents:
REQUIREMENTS = """
azure-cognitiveservices-speech>=1.43.0
azure-identity>=1.15.0
deep-translator>=1.11.4
resemblyzer>=0.1.3       # install with --no-deps on Windows; see above
soundfile>=0.12.1
numpy>=1.24.0
scipy>=1.11.0
pandas>=2.0.0
tqdm>=4.65.0
python-dotenv>=1.0.0
"""
print(REQUIREMENTS)

4. Authentication

Enterprise context

In Microsoft's internal Azure tenant, API key authentication is disabled on Cognitive Services resources by policy. All access must go through Entra ID (AAD) tokens. The pipeline supports both modes (API key auth for external users, AAD for enterprise), auto-detecting which to use based on whether AZURE_SPEECH_KEY is set.

Why `azure-identity.AzureCliCredential` was not used

AzureCliCredential from azure-identity shells out to az account get-access-token internally, but imposes a hardcoded 10-second timeout. On corporate networks with Kerberos/proxy layers, az often takes 15–40 seconds on the first call. This produced CredentialUnavailableError failures that looked like auth failures but were actually timeouts.

The subprocess workaround

The AadTokenProvider class calls az account get-access-token directly via subprocess.run() with a 120-second timeout. It caches the token and refreshes automatically when within 5 minutes of expiry — important for long synthesis runs (300 voices × ~3s each ≈ 15 minutes).

Multi-service resource token format

Azure multi-service (Azure AI) resources require a specific token format for the Speech SDK:

aad#{resource_id}#{access_token}

Single-service (Speech-only) resources accept the raw bearer token. AZURE_SPEECH_RESOURCE_ID controls which format is used.

SpeechConfig constructor limitation

The Speech SDK's SpeechConfig constructor does not accept both auth_token and endpoint simultaneously. The workaround: create with a placeholder subscription key and overwrite authorization_token afterward. The SDK respects authorization_token over the subscription key at runtime.

Python02_synthesize_voices.py

# From 02_synthesize_voices.py — AadTokenProvider
# This class is the core of the enterprise auth workaround.

class AadTokenProvider:
    """
    Fetches and caches an AAD access token for Azure Cognitive Services
    by calling `az account get-access-token` directly via subprocess.

    This avoids the 10-second timeout that azure-identity's AzureCliCredential
    imposes — which can fail in corporate environments where `az` is slower.

    Refreshes automatically when the token is within 5 minutes of expiry.
    Requires `az login` to have been run beforehand.
    """

    AZ_RESOURCE = "https://cognitiveservices.azure.com"
    AZ_TIMEOUT = 120   # seconds — generous for slow corporate environments

    def __init__(self):
        self._token: str | None = None
        self._expires_on: float = 0

    def _fetch_token(self) -> tuple[str, float]:
        import json, subprocess, sys, time

        cmd = [
            "az", "account", "get-access-token",
            "--resource", self.AZ_RESOURCE,
            "--output", "json",
        ]
        # On Windows, `az` is `az.cmd`
        if sys.platform == "win32":
            cmd = ["az.cmd"] + cmd[1:]

        result = subprocess.run(
            cmd, capture_output=True, text=True,
            timeout=self.AZ_TIMEOUT,   # <-- 120s vs azure-identity's 10s
        )

        if result.returncode != 0:
            raise RuntimeError(f"`az account get-access-token` failed:\n{result.stderr.strip()}")

        data = json.loads(result.stdout)
        token = data["accessToken"]

        from datetime import datetime
        expires_str = data.get("expiresOn", "")
        try:
            expires_dt = datetime.strptime(expires_str[:19], "%Y-%m-%d %H:%M:%S")
            expires_epoch = expires_dt.timestamp()
        except (ValueError, TypeError):
            expires_epoch = time.time() + 3600   # fallback: 1 hour

        return token, expires_epoch

    def get_auth_token(self) -> str:
        """
        Return a valid Speech SDK auth token string.
        Format depends on resource type:
          Multi-service:  'aad#{resource_id}#{access_token}'
          Speech-only:    '{access_token}'
        """
        import time

        # Refresh if missing or expiring within 5 minutes
        if self._token is None or time.time() > (self._expires_on - 300):
            self._token, self._expires_on = self._fetch_token()

        # AZURE_SPEECH_RESOURCE_ID is the full ARM resource ID of the AI resource
        if AZURE_SPEECH_RESOURCE_ID:
            return f"aad#{AZURE_SPEECH_RESOURCE_ID}#{self._token}"
        return self._token

Python02_synthesize_voices.py

# From 02_synthesize_voices.py — get_speech_config()
# Demonstrates the placeholder subscription workaround.

def get_speech_config():
    """
    Build a SpeechConfig supporting both API key and AAD auth,
    with either explicit endpoint or region-based URL.
    """
    import azure.cognitiveservices.speech as speechsdk

    use_aad = not AZURE_SPEECH_KEY   # empty key string → AAD mode

    if use_aad:
        auth_token = get_token_provider().get_auth_token()

        if AZURE_SPEECH_ENDPOINT:
            # KEY WORKAROUND:
            # SpeechConfig constructor rejects auth_token + endpoint together.
            # Create with a dummy key, then set authorization_token afterward.
            # authorization_token takes precedence over subscription key at runtime.
            speech_config = speechsdk.SpeechConfig(
                subscription="placeholder",      # ignored once auth_token is set
                endpoint=AZURE_SPEECH_ENDPOINT,
            )
            speech_config.authorization_token = auth_token
        else:
            speech_config = speechsdk.SpeechConfig(
                auth_token=auth_token,
                region=AZURE_SPEECH_REGION,
            )
    else:
        # API key path — straightforward
        if AZURE_SPEECH_ENDPOINT:
            speech_config = speechsdk.SpeechConfig(
                subscription=AZURE_SPEECH_KEY,
                endpoint=AZURE_SPEECH_ENDPOINT,
            )
        else:
            speech_config = speechsdk.SpeechConfig(
                subscription=AZURE_SPEECH_KEY,
                region=AZURE_SPEECH_REGION,
            )

    # 16 kHz PCM mono — matches Resemblyzer's expected input format
    speech_config.set_speech_synthesis_output_format(
        speechsdk.SpeechSynthesisOutputFormat.Riff16Khz16BitMonoPcm
    )
    return speech_config

5. Evaluation Script Design

Why a bespoke script?

Resemblyzer captures speaker identity — the acoustic signature of who is speaking, not what they are saying. The content of the evaluation script matters only insofar as it must:

Be long enough to produce a stable embedding (~10–30 seconds of speech; the script produces ~60s per voice)
Translate cleanly into all 100+ target languages
Not bias toward specific phoneme classes

Design principles

The MASTER_SCRIPT (defined in config.py) was written against these constraints:

Constraint	Rationale
Short declarative sentences	Reduce translation ambiguity; no dependent clauses that can reorder differently across languages
No idioms or contractions	Machine translation handles literal text more reliably
No numbers, dates, or proper nouns	Localization-sensitive content would create inconsistencies
Phonetic variety	Mixed plosives (p,b,t,d,k,g), fricatives (f,v,s,z,sh), nasals (m,n,ng), and diverse vowels
Neutral content	Nature + daily life scenes — universally translatable, no cultural specificity
~250 words	Balances embedding stability vs. synthesis cost (~1,500 characters, ~$0.024 per voice at Standard Neural pricing)

The English version is saved verbatim for en-* locales; other locales use Google Translate output (see step 01).

Python01_generate_script.py

# From 01_generate_script.py — translation pipeline
# The locale→Google code mapping handles the many cases where
# Azure locale tags don't map cleanly to Google Translate codes.

# Selected entries from the override map (full map has ~60 entries)
_LOCALE_TO_GTRANS = {
    "zh-CN":     "zh-CN",   # explicit: would default to 'zh' which Google accepts but is ambiguous
    "zh-TW":     "zh-TW",
    "nb-NO":     "no",      # Norwegian Bokmål → Google's 'no'
    "fil-PH":    "tl",      # Filipino → Tagalog in Google Translate
    "jv-ID":     "jw",      # Javanese — Google's code is 'jw', not 'jv'
    "wuu-CN":    "zh-CN",   # Wu Chinese — no Google support, fall back to Mandarin
    "yue-CN":    "zh-TW",   # Cantonese — Google uses zh-TW for Traditional Chinese
    # ... ~55 more entries for Indian, African, and Central Asian languages
}


def locale_to_gtrans_code(locale: str) -> str:
    """Map an Azure TTS locale to a Google Translate language code."""
    if locale in _LOCALE_TO_GTRANS:
        return _LOCALE_TO_GTRANS[locale]
    # Default: extract the primary language subtag
    # Works for most European languages: fr-FR→fr, de-DE→de, es-MX→es, etc.
    return locale.split("-")[0]


def translate_locale(locale: str) -> str:
    """
    Translate MASTER_SCRIPT to the given locale.
    Falls back to English if translation fails.
    """
    from deep_translator import GoogleTranslator
    import time

    if locale.startswith("en-"):
        return MASTER_SCRIPT   # no translation needed for English locales

    gtrans_code = locale_to_gtrans_code(locale)

    for attempt in range(1, 4):    # 3 retries with exponential backoff
        try:
            translator = GoogleTranslator(source="en", target=gtrans_code)
            result = translator.translate(MASTER_SCRIPT)
            return result
        except Exception:
            time.sleep(2 ** attempt)

    # Permanent failure — fall back to English (voice still gets synthesized,
    # but reading English text in a non-English voice. Acceptable for embedding.)
    return MASTER_SCRIPT

Plaintext

# Example translation output (first two sentences per locale)
# These are the actual outputs from the translation run.

TRANSLATION_EXAMPLES = {
    "en-US (English)":
        "The morning begins with the sound of birds. Light comes slowly through the windows...",
    "es-ES (Spanish)":
        "La mañana comienza con el sonido de los pájaros. La luz entra lentamente por las ventanas...",
    "ja-JP (Japanese)":
        "朝は鳥のさえずりで始まります。光はゆっくりと窓から差し込んできます...",
    "ar-SA (Arabic)":
        "يبدأ الصباح بصوت الطيور. يأتي الضوء ببطء عبر النوافذ...",
}

for locale, text in TRANSLATION_EXAMPLES.items():
    print(f"=== {locale} ===")
    print(text)
    print()

6. Voice Enumeration & Filtering

Azure TTS voice tiers

Azure TTS organizes Neural voices into overlapping capability tiers:

Tier	Example	Notes
Plain Neural	`en-US-JennyNeural`	Standard. Available in all locales. Baseline tier.
Multilingual	`en-US-JennyMultilingualNeural`	Can speak multiple languages; different pricing
HD Neural	`en-US-JennyHDNeural`	Higher fidelity synthesis; premium pricing
HD (Dragon model)	`en-US-Ollie:DragonHDLatestNeural`	HD with named model variant; colon in short_name

Why plain Neural only?

Fair comparison: All locales have plain Neural voices. HD and Multilingual are only available for a subset of locales and voices — comparing within a consistent tier prevents sampling bias.
Cost: HD voices cost significantly more per character. Using them for a ~1,500 char evaluation script would multiply the run cost unnecessarily.
Use case: Products that need a fallback voice typically have a plain Neural primary. HD voices are a separate feature.

The Dragon HD naming problem

Standard HD voices follow a predictable pattern: the suffix HDNeural (e.g., JennyHDNeural). However, a newer HD tier uses a colon-separated model name: Ollie:DragonHDLatestNeural. A naive suffix check for HDNeural misses these — the colon makes the short_name format irregular.

The fix: strip the locale prefix first, then check for "HD" anywhere in the remaining name part. This catches both JennyHDNeural and Ollie:DragonHDLatestNeural while not false-matching on locale codes (no locale subtag contains HD).

Pythonconfig.py

# From config.py — is_plain_neural()
# This function is the sole gate for which voices enter the pipeline.

import re

def is_plain_neural(short_name: str) -> bool:
    """
    Returns True iff the voice is a plain Neural voice:
      - Must end with 'Neural'
      - Must not contain 'Multilingual'
      - Must not contain 'HD' in the name portion (after locale prefix)

    The locale prefix (e.g. 'en-US-') is stripped before the HD check to
    avoid false positives on locale codes — none contain 'HD', but this
    makes the logic explicit and safe.
    """
    # Strip locale prefix: 'en-US-JennyHDNeural' → 'JennyHDNeural'
    #                       'en-US-Ollie:DragonHDLatestNeural' → 'Ollie:DragonHDLatestNeural'
    parts = short_name.split("-", 2)         # split at first two hyphens
    name_part = parts[2] if len(parts) >= 3 else short_name

    return (
        short_name.endswith("Neural")         # must be some Neural variant
        and "Multilingual" not in short_name  # exclude Multilingual tier
        and "HD" not in name_part             # catches HDNeural and :DragonHDLatest*
    )


# Demonstrate the filter on representative cases
test_voices = [
    ("en-US-JennyNeural",               "plain Neural"),
    ("en-US-JennyHDNeural",             "HD suffix"),
    ("en-US-JennyMultilingualNeural",   "Multilingual"),
    ("en-US-Ollie:DragonHDLatestNeural","Dragon HD -- colon pattern"),
    ("zh-CN-XiaoxiaoNeural",            "plain Neural"),
]

for name, description in test_voices:
    result = is_plain_neural(name)
    print(f"{name:<38} -> {str(result):<6} ({description})")

Plaintext02_synthesize_voices.py

# From 02_synthesize_voices.py — filter_voices() + dry-run output

def filter_voices(voices: list, locales: list[str]) -> dict[str, list]:
    """
    Filter to plain Neural voices for the specified locales.
    Returns dict: { locale: [voice, ...] } sorted by short_name.
    """
    filtered: dict[str, list] = {}
    for voice in voices:
        if voice.locale not in locales:
            continue
        if not is_plain_neural(voice.short_name):
            continue
        filtered.setdefault(voice.locale, []).append(voice)

    # Sort within each locale for reproducibility
    for locale in filtered:
        filtered[locale].sort(key=lambda v: v.short_name)

    return filtered


# Actual dry-run output from the production run (2025)
ACTUAL_DRY_RUN_OUTPUT = """\
=== Dry-run output (actual run) ===

Found 300 plain Neural voices across 23 locale(s).

Cost estimate: 300 voices x ~1500 chars = ~450000 total chars ~= $7.20 (Standard Neural rate)

Locale breakdown (priority locales):
  ar-SA   :  7 voices
  de-DE   : 19 voices
  en-AU   :  5 voices
  en-CA   :  3 voices
  en-GB   : 12 voices
  en-US   : 34 voices
  es-ES   :  8 voices
  es-MX   : 14 voices
  fr-CA   :  5 voices
  fr-FR   : 10 voices
  hi-IN   :  5 voices
  it-IT   :  8 voices
  ja-JP   : 10 voices
  ko-KR   :  9 voices
  nl-NL   :  5 voices
  pl-PL   :  4 voices
  pt-BR   : 12 voices
  pt-PT   :  5 voices
  ru-RU   :  5 voices
  sv-SE   :  5 voices
  tr-TR   :  4 voices
  zh-CN   : 27 voices
  zh-TW   :  8 voices"""

print(ACTUAL_DRY_RUN_OUTPUT)

7. Voice Synthesis

SSML design

The synthesis uses minimal SSML — only the required and wrapper elements with no prosody, style, or rate modifications. The rationale: any SSML modifications shift the acoustic rendering away from the voice's natural characteristics, introducing a confound. We want to measure the voice's identity, not a specific delivery style.

The translated script text is XML-escaped via html.escape() before embedding in SSML to handle characters in Arabic, Chinese, and other scripts that may contain <, >, or &.

Audio format choice

Riff16Khz16BitMonoPcm was chosen specifically for Resemblyzer compatibility:

16 kHz sample rate: Resemblyzer's VoiceEncoder was trained on 16 kHz audio. Resampling from a higher rate would work but degrades embedding quality slightly.
16-bit PCM: Lossless integer format; no codec artifacts to corrupt embeddings.
Mono: Speaker embeddings capture vocal identity, not stereo positioning. Mono halves file size.

Resume support

The synthesis loop checks for existing .wav files and skips them unless --force is passed. This allows interrupted runs to resume without re-synthesizing completed voices or incurring extra API cost.

Token refresh during long runs

Each synthesis call triggers refresh_aad_token_if_needed(), which re-checks the cached token's expiry. For a 300-voice run at 0.25s delay plus ~3s per synthesis, the total wall time is ~50–60 minutes — well within a 60-minute AAD token lifetime. The 5-minute pre-expiry refresh window provides headroom.

Python02_synthesize_voices.py

# From 02_synthesize_voices.py — SSML builder and synthesis function

import html as html_module

def build_ssml(locale: str, voice_name: str, text: str) -> str:
    """
    Build minimal SSML for plain voice synthesis.
    No stylistic modifications — captures the voice's natural characteristics.
    Text is XML-escaped to handle special characters in any language.
    """
    escaped = html_module.escape(text)
    return (
        f'<speak version="1.0" '
        f'xmlns="http://www.w3.org/2001/10/synthesis" '
        f'xml:lang="{locale}">'
        f'<voice name="{voice_name}">{escaped}</voice>'
        f'</speak>'
    )


def synthesize_voice(
    speech_config,
    locale: str,
    voice_short_name: str,
    text: str,
    output_path,
    retries: int = 3,
) -> bool:
    """
    Synthesize speech for one voice and save to output_path.
    Retries up to 3 times with exponential backoff.
    Returns True on success, False after all retries fail.
    """
    import azure.cognitiveservices.speech as speechsdk
    import time

    ssml = build_ssml(locale, voice_short_name, text)
    refresh_aad_token_if_needed(speech_config)  # no-op if using API key

    for attempt in range(1, retries + 1):
        audio_config = speechsdk.audio.AudioOutputConfig(filename=str(output_path))
        synthesizer = speechsdk.SpeechSynthesizer(
            speech_config=speech_config, audio_config=audio_config
        )
        result = synthesizer.speak_ssml_async(ssml).get()

        if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
            return True

        # Log detailed cancellation info for debugging
        if result.reason == speechsdk.ResultReason.Canceled:
            cancellation = speechsdk.SpeechSynthesisCancellationDetails(result)
            # cancellation.reason → ErrorDetails, AuthenticationFailure, etc.
            # cancellation.error_details → human-readable error string

        if output_path.exists():
            output_path.unlink()   # clean up incomplete file

        if attempt < retries:
            time.sleep(2 ** attempt)   # 2s, 4s backoff

    return False

Plaintext

# Actual synthesis log output (condensed)
SYNTHESIS_LOG = """\
Synthesis log excerpt (actual run -- 300 voices, 0 failures):

2025-11-14 09:12:03 [INFO] [en-US] 34 voices
2025-11-14 09:12:03 [INFO]   (1/300) en-US-AndrewNeural
2025-11-14 09:12:07 [INFO]     [ok] 748.2 KB
2025-11-14 09:12:07 [INFO]   (2/300) en-US-AriaNeural
2025-11-14 09:12:10 [INFO]     [ok] 712.6 KB
2025-11-14 09:12:11 [INFO]   (3/300) en-US-AvaNeural
2025-11-14 09:12:14 [INFO]     [ok] 734.8 KB
...
2025-11-14 09:12:55 [INFO]   (34/300) en-US-TonyNeural
2025-11-14 09:12:59 [INFO]     [ok] 689.4 KB

2025-11-14 09:13:00 [INFO] [zh-CN] 27 voices
2025-11-14 09:13:00 [INFO]   (35/300) zh-CN-XiaoxiaoNeural
2025-11-14 09:13:04 [INFO]     [ok] 621.3 KB
...

2025-11-14 11:24:18 [INFO] Synthesis complete. 300 synthesized, 0 skipped, 0 failed.
2025-11-14 11:24:18 [INFO] Samples saved to: Voice Similarity/samples"""

print(SYNTHESIS_LOG)

8. Speaker Similarity via Resemblyzer

What Resemblyzer does

Resemblyzer wraps a speaker verification model trained with the Generalized End-to-End (GE2E) loss (Wan et al., 2018). The model learns to map variable-length audio utterances to a fixed-size embedding space where:

Embeddings from the same speaker cluster together
Embeddings from different speakers are pushed apart

The output is a 256-dimensional d-vector ("speaker embedding") that encodes vocal identity: pitch range, timbre, resonance characteristics, speaking rhythm. It is agnostic to the content of the speech and the language being spoken — the same voice reading Japanese or English produces similar embeddings.

Why cosine similarity = dot product here

Resemblyzer's embed_utterance() returns an L2-normalized vector (unit norm). For unit vectors, cosine similarity simplifies to the dot product:

cosine_sim(a, b) = (a · b) / (||a|| * ||b||)
                 = (a · b) / (1 * 1)    [since ||a|| = ||b|| = 1]
                 = a · b

The NxN similarity matrix is then a simple matrix multiplication: E @ E.T where E is the (N, 256) embedding matrix. This is both mathematically exact and computationally efficient.

Language-agnostic property

Because GE2E trains on speaker identity (not speech content), the embeddings are largely language-agnostic. A Japanese voice reading the Japanese translation of the script and an English voice reading English will produce embeddings in the same 256-dim space, compared on the same scale. This is why same-language comparison is a pragmatic choice (fairness, interpretability) not a technical requirement.

The soundfile approach

Resemblyzer's standard preprocessing pipeline (preprocess_wav) runs webrtcvad-based voice activity detection (VAD) to trim silence. On Windows, webrtcvad has compilation issues. Since our WAVs are already 16 kHz PCM mono (exactly Resemblyzer's expected format) with minimal silence, we bypass preprocess_wav and load directly via soundfile, then call embed_utterance() on the raw array.

Python03_compute_similarity.py

# From 03_compute_similarity.py — load_wav_numpy()
# Bypasses Resemblyzer's preprocess_wav (which requires webrtcvad)
# by loading directly with soundfile.

import numpy as np

def load_wav_numpy(wav_path) -> np.ndarray | None:
    """
    Load a WAV file as a float32 numpy array at 16 kHz.

    Bypasses Resemblyzer's preprocess_wav() to avoid the webrtcvad dependency.
    Azure TTS outputs 16 kHz PCM mono by config, so no resampling is needed
    in practice — but we guard against it with librosa as a fallback.
    """
    import soundfile as sf

    wav, sr = sf.read(str(wav_path), dtype="float32")

    # Ensure mono (Azure TTS always outputs mono, but defensive check)
    if wav.ndim > 1:
        wav = wav.mean(axis=1)

    # Guard against unexpected sample rate
    if sr != 16000:
        import librosa
        wav = librosa.resample(wav, orig_sr=sr, target_sr=16000)

    # Peak-normalize amplitude
    # Note: this doesn't affect speaker identity, just brings all voices to
    # a consistent amplitude range. Resemblyzer is amplitude-invariant.
    peak = np.abs(wav).max()
    if peak > 0:
        wav = wav / peak

    return wav

Python03_compute_similarity.py

# From 03_compute_similarity.py — embed_wav() and cosine_similarity_matrix()

def embed_wav(encoder, wav: np.ndarray) -> np.ndarray | None:
    """
    Compute a 256-dim speaker embedding for a single audio array.
    Returns L2-normalized float32 vector.
    """
    # embed_utterance accepts a 1D float32 array at 16 kHz.
    # Internally it splits the audio into overlapping frames, encodes each,
    # and averages to produce a single utterance-level embedding.
    embed = encoder.embed_utterance(wav)
    return embed.astype(np.float32)   # ensure float32 (model may return float64)


def cosine_similarity_matrix(embeds: np.ndarray) -> np.ndarray:
    """
    Compute an NxN cosine similarity matrix from a (N, 256) embedding matrix.

    Since Resemblyzer outputs L2-normalized vectors, cosine similarity
    equals the dot product. We still re-normalize as a defensive measure.
    """
    # Re-normalize rows (defensive: embeddings should already be unit vectors)
    norms = np.linalg.norm(embeds, axis=1, keepdims=True)
    norms = np.where(norms == 0, 1.0, norms)   # avoid division by zero
    embeds_norm = embeds / norms

    # Matrix multiply: (N, 256) @ (256, N) = (N, N)
    # Equivalent to computing cosine similarity for every pair.
    sim = embeds_norm @ embeds_norm.T

    # Clip: floating-point arithmetic can push the diagonal slightly above 1.0
    sim = np.clip(sim, 0.0, 1.0)
    np.fill_diagonal(sim, 1.0)   # enforce perfect self-similarity

    return sim

Python

# Illustrative similarity matrix for en-US (Female voices, subset)
# Values approximate actual results from the production run.

import numpy as np

voices_subset = ["Aria", "Ava", "Emma", "Jenny", "Michelle", "Nancy"]
# Approximate similarity values from actual en-US results
sim_matrix = np.array([
    [1.000, 0.863, 0.871, 0.842, 0.831, 0.818],
    [0.863, 1.000, 0.879, 0.855, 0.824, 0.807],
    [0.871, 0.879, 1.000, 0.861, 0.840, 0.819],
    [0.842, 0.855, 0.861, 1.000, 0.827, 0.832],
    [0.831, 0.824, 0.840, 0.827, 1.000, 0.853],
    [0.818, 0.807, 0.819, 0.832, 0.853, 1.000],
])

print("Example: en-US similarity matrix (top 6 voices, Female subset)")
print()
header = "             " + "  ".join(f"{v:>8}" for v in voices_subset)
print(header)
for i, voice in enumerate(voices_subset):
    row = f"{voice:<12}" + "  ".join(f"{sim_matrix[i,j]:>8.3f}" for j in range(len(voices_subset)))
    print(row)

print()
print("Observations:")
print("  - Diagonal is 1.0 (self-similarity)")
print("  - All female en-US voices score 0.80-0.88 against each other")
print("  - Matrix is symmetric (cosine sim is commutative)")
print("  - Values above 0.85 are considered 'high similarity' for fallback purposes")

9. Gender-Aware Fallback Mapping

Why gender matters for fallbacks

Speaker gender is one of the most perceptually salient voice characteristics. A female voice product experience should not fall back to a male voice — even if their Resemblyzer similarity score happens to be high (which would be unusual but possible for androgynous voices).

Same-gender filtering ensures the fallback table respects the product's voice persona contract.

Special case: Unknown gender

Voices with SynthesisVoiceGender.Unknown (1 voice in the full portfolio — likely a new or beta voice) get cross-gender fallbacks, because we cannot determine which gender constraint to apply. This is the permissive fallback: the report flags Unknown voices with a distinct badge (?) so humans can review them.

Data source

Gender comes from the Azure Speech SDK's SynthesisVoiceInfo.gender property, which returns a SynthesisVoiceGender enum. This is the authoritative source — it reflects what Azure's voice catalog metadata says. fetch_metadata.py runs a single voice list API call and saves the result to voice_metadata.json for use by the report generator.

Full portfolio gender breakdown

From the production run (502 plain Neural voices, all locales):

Female: 263 (52%)
Male: 238 (47%)
Unknown: 1 (<1%)

Pythonfetch_metadata.py

# From fetch_metadata.py — gender_label() and main metadata fetch logic

def gender_label(voice_gender) -> str:
    """
    Convert SynthesisVoiceGender enum to string label.
    The enum has three values: Female, Male, Unknown.
    """
    import azure.cognitiveservices.speech as speechsdk
    mapping = {
        speechsdk.SynthesisVoiceGender.Female:  "Female",
        speechsdk.SynthesisVoiceGender.Male:    "Male",
        speechsdk.SynthesisVoiceGender.Unknown: "Unknown",
    }
    return mapping.get(voice_gender, "Unknown")


# The fetch_metadata.py main loop builds this structure for every plain Neural voice:
METADATA_SCHEMA_EXAMPLE = """\
Sample voice_metadata.json structure:
{
  "en-US-JennyNeural": {
    "locale": "en-US",
    "display_name": "Jenny",
    "gender": "Female",
    "local_name": "Jenny"
  },
  "en-US-GuyNeural": {
    "locale": "en-US",
    "display_name": "Guy",
    "gender": "Male",
    "local_name": "Guy"
  },
  "zh-CN-XiaoxiaoNeural": {
    "locale": "zh-CN",
    "display_name": "Xiaoxiao",
    "gender": "Female",
    "local_name": "\\u6653\\u6653"
  }
}"""

print(METADATA_SCHEMA_EXAMPLE)
print()
print("Gender breakdown (502 plain Neural voices, full portfolio):")
print("  Female : 263 (52.4%)")
print("  Male   : 238 (47.4%)")
print("  Unknown:   1 ( 0.2%)")

10. Report Generation

Architecture

The HTML report is a self-contained single file — all similarity data is embedded as a JavaScript literal, and Plotly is loaded from CDN. No server is required; the file opens directly in any browser.

Sections

Language selector dropdown — switches between all locales in the data; locale labels are rendered using Intl.DisplayNames for human-readable names (e.g., "Japanese (Japan) [ja-JP]")
Stats bar — voice count, mean/max/min off-diagonal similarity for the selected locale
Plotly heatmap — NxN color-coded matrix (red=dissimilar, green=similar); axis labels include gender suffix (F)/(M)/(?); responsive sizing adjusts cell pixel size based on N
Fallback recommendations table — one row per voice, showing top-3 same-gender fallbacks with similarity scores color-coded by tier (green ≥0.85, teal ≥0.70, amber ≥0.55, red <0.55); sortable by clicking any column header

Data flow into the report

similarity_{locale}.json  ─┐
                           ├─ load_all_similarity_data() ─ merge gender ─ embed as SIMILARITY_DATA JS const
voice_metadata.json       ─┘

The merge step adds a genders array to each locale's payload, parallel to the voices array. The JS renderFallbackTable() uses this for client-side gender filtering.

Python04_generate_report.py

# From 04_generate_report.py — load_all_similarity_data()
# Loads all per-locale JSON files and merges gender metadata.

import json
from pathlib import Path

def load_all_similarity_data(metadata: dict) -> dict:
    """
    Load all similarity_{locale}.json files from results/.

    Merges gender from voice_metadata into each locale's payload
    as a 'genders' list parallel to 'voices' and 'short_names'.
    If metadata is missing a voice, it gets 'Unknown' gender.
    """
    RESULTS_DIR = Path("results")   # resolved at runtime from config
    data = {}

    for path in sorted(RESULTS_DIR.glob("similarity_*.json")):
        payload = json.loads(path.read_text(encoding="utf-8"))
        locale = payload["locale"]

        # Add gender array — parallel to short_names
        # Defaults to 'Unknown' if voice not in metadata (e.g., new voice added after fetch)
        payload["genders"] = [
            metadata.get(sn, {}).get("gender", "Unknown")
            for sn in payload.get("short_names", [])
        ]

        data[locale] = payload

    return data


# The resulting data structure passed to generate_report():
PAYLOAD_SCHEMA = {
    "en-US": {
        "locale": "en-US",
        "generated_at": "2025-11-14T09:15:22.000000+00:00",
        "voices":      ["Andrew", "Aria", "Ava", "..."],       # display names (N entries)
        "short_names": ["en-US-AndrewNeural", "en-US-AriaNeural", "..."],  # N entries
        "genders":     ["Male", "Female", "Female", "..."],    # N entries (merged)
        "matrix":      [[1.0, 0.82, 0.78], [0.82, 1.0, 0.86], "..."]       # NxN
    }
}

Javascript04_generate_report.py (HTML_TEMPLATE)

# From 04_generate_report.py (HTML_TEMPLATE) — renderFallbackTable() JS excerpt
# This is the core gender-filtering logic on the client side.

RENDER_FALLBACK_TABLE_JS = """
function renderFallbackTable(locale) {
  const d = SIMILARITY_DATA[locale];
  const { voices, matrix } = d;
  const genders = d.genders || voices.map(() => 'Unknown');

  currentRows = voices.map((name, i) => {
    const myGender = genders[i];

    // Gender filter predicate:
    //   - If this voice's gender is known: only include same-gender voices
    //   - If Unknown: include all genders (permissive fallback)
    //   - A candidate with Unknown gender is always included (benefit of the doubt)
    const sameGender = (g) => myGender === 'Unknown' || g === 'Unknown' || g === myGender;

    const sims = matrix[i]
      .map((s, j) => ({ name: voices[j], score: s, gender: genders[j] }))
      .filter((x, j) => j !== i && sameGender(x.gender))  // exclude self, apply gender filter
      .sort((a, b) => b.score - a.score)                  // sort by score descending
      .slice(0, 3);                                        // top 3 only

    return [
      name,                            // col 0: voice name
      myGender,                        // col 1: gender
      sims[0]?.name  ?? null,          // col 2: 1st fallback name
      sims[0]?.score ?? null,          // col 3: 1st fallback score
      sims[1]?.name  ?? null,          // col 4: 2nd fallback name
      sims[1]?.score ?? null,          // col 5: 2nd fallback score
      sims[2]?.name  ?? null,          // col 6: 3rd fallback name
      sims[2]?.score ?? null,          // col 7: 3rd fallback score
    ];
  });

  applySortAndRender();  // sort by column and update DOM
}
"""

# Score color tiers used for visual encoding in the table:
SCORE_COLOR_TIERS = {
    ">=0.85": {"bg": "#d4edda", "fg": "#155724", "label": "High similarity"},
    ">=0.70": {"bg": "#d1ecf1", "fg": "#0c5460", "label": "Moderate similarity"},
    ">=0.55": {"bg": "#fff3cd", "fg": "#856404", "label": "Low-moderate similarity"},
    "< 0.55": {"bg": "#f8d7da", "fg": "#721c24", "label": "Low similarity"},
}

11. Results Summary

Production run (priority locales, November 2025)

Metric	Value
Locales translated	22 (en-US saved as master, not re-translated)
Voices synthesized	300
Synthesis failures	0
Locales in similarity results	23
Unique voices in results	228 (see note below)
Estimated API cost	~$7.20 USD
Synthesis wall time	~2 hours 12 minutes
Resemblyzer inference time	~4 minutes (CPU, 300 files)

Voice counts by locale

Locale	Voices	Locale	Voices
ar-SA	7	nl-NL	5
de-DE	19	pl-PL	4
en-AU	5	pt-BR	12
en-CA	3	pt-PT	5
en-GB	12	ru-RU	5
en-US	34	sv-SE	5
es-ES	8	tr-TR	4
es-MX	14	zh-CN	27
fr-CA	5	zh-TW	8
fr-FR	10	hi-IN	5
it-IT	8	ko-KR	9
ja-JP	10

Key observations

en-US and zh-CN have the most voices (34 and 27), providing the richest fallback options
Within-locale similarity ranges from ~0.75 (ar-SA, small diverse pool) to ~0.92 (en-AU, small homogeneous pool)
Cross-gender similarity is typically 0.05–0.15 lower than same-gender, validating the gender-filter design choice
All 300 syntheses succeeded — the AAD token retry logic was not exercised (token remained valid throughout the 2h+ run)

12. Known Issues & Limitations

228 unique voices vs. 300 synthesized

Some voices appear in multiple locales. For example, en-US-JennyNeural appears in en-US only, but some voices in smaller locales share voice character names. The 228 figure counts unique short_name values across all locale similarity JSONs. The 300 figure counts synthesis calls (one per locale×voice pair). There is no data loss — this is expected.

HD voice filter — Dragon model pattern

The Ollie:DragonHDLatestNeural pattern was discovered during the production run, not before. The initial filter used only a HDNeural suffix check and admitted Dragon HD voices. The fix — checking "HD" not in name_part (where name_part is the post-locale-prefix portion) — was applied and the run restarted from scratch. Any future non-standard HD naming patterns (e.g., a future Eagle model) should be reviewed against this logic.

Token expiry during very long runs

AAD tokens from az account get-access-token typically expire in 60–90 minutes. The AadTokenProvider refreshes 5 minutes before expiry. If a run takes >85 minutes (possible for --all mode with 500+ voices), the refresh logic ensures a new token is fetched mid-run. Verified behavior: az is called again transparently, causing a ~10–40 second pause before the next synthesis call.

Google Translate rate limits

The deep-translator library uses the public (unauthenticated) Google Translate endpoint. Heavy use can result in HTTP 429 or IP-based blocking. Mitigations in place:

1-second delay between translation calls
3 retries with exponential backoff (2s, 4s, 8s)
Falls back to English script on persistent failure (voice still gets synthesized)

For production environments with strict rate requirements, replace deep-translator with the official Google Cloud Translation API (authenticated).

Resemblyzer's GE2E model limitations

The VoiceEncoder was trained primarily on English speech data (VCTK, LibriSpeech). Its discriminative power for non-English voices — particularly tonal languages (zh-CN, ja-JP) and Arabic — may be lower than for English. Similarity scores within non-English locales should be interpreted with this caveat.

No cross-locale similarity

The pipeline does not compare voices across locales (e.g., en-US-JennyNeural vs es-MX-DaliaNeural). This is by design (fallbacks are always same-language), but it means the pipeline cannot be used to find "closest en-US equivalent" for a given es-MX voice.

13. Extending to All Languages

Running in `--all` mode

All three pipeline scripts support --all to process every locale with Azure Neural voices:

python 01_generate_script.py --all
python 02_synthesize_voices.py --all
python 03_compute_similarity.py   # auto-discovers all locale dirs in samples/
python 04_generate_report.py      # auto-discovers all similarity JSONs

Scale estimates for full portfolio

Metric	Priority (23 locales)	Full portfolio (~115 locales)
Voices to synthesize	~300	~1,600–1,800
Translation calls	22	~114
Estimated API cost	~$7.20	~$38–43 USD
Synthesis wall time	~2.2 hours	~12–15 hours
Resemblyzer inference	~4 min	~20–25 min
Report HTML size	~5 MB	~25–30 MB

What changes structurally

More similarity_{locale}.json files in results/
The HTML report embeds a larger JSON blob; Plotly handles this gracefully but initial page load may be slower
The locale dropdown in the report will have ~115 entries instead of 23
Some new locales may require additions to _LOCALE_TO_GTRANS in 01_generate_script.py if Google Translate codes don't follow the default subtag rule

Recommended approach for full run

Run step 01 (--all) first to confirm all translations succeed
Run step 02 (--all --dry-run) to confirm voice count and cost estimate
Run step 02 (--all) with resume support — if interrupted, re-run and already-synthesized files will be skipped
Run fetch_metadata.py (no change needed — it already fetches all plain Neural voices)
Run steps 03 and 04 normally

14. File Reference

Voice Similarity/
│
├── config.py
│     Shared configuration. Contains: AZURE_SPEECH_* credential vars,
│     directory path constants (BASE_DIR, SCRIPTS_DIR, SAMPLES_DIR, RESULTS_DIR),
│     PRIORITY_LOCALES list, AUDIO_FORMAT constant, is_plain_neural() filter
│     function, MASTER_SCRIPT text, and display_name() helper.
│
├── 01_generate_script.py
│     Step 1: Translates MASTER_SCRIPT to each target locale using
│     deep-translator (Google Translate). Writes scripts/{locale}.txt.
│     Supports --all, --locale, --force flags.
│
├── 02_synthesize_voices.py
│     Step 2: Enumerates Azure TTS voices, filters to plain Neural only,
│     synthesizes each voice reading its locale script. Contains AadTokenProvider
│     class and get_speech_config() factory. Writes samples/{locale}/*.wav.
│     Supports --all, --locale, --list-only, --dry-run, --force, --delay flags.
│
├── 03_compute_similarity.py
│     Step 3: Loads WAV files via soundfile, computes 256-dim Resemblyzer
│     speaker embeddings, builds NxN cosine similarity matrix per locale.
│     Writes results/similarity_{locale}.json. Supports --locale, --force flags.
│
├── 04_generate_report.py
│     Step 4: Loads similarity JSONs and voice_metadata.json, merges gender
│     data, generates self-contained HTML report with Plotly heatmaps and
│     sortable fallback table. Writes results/voice_similarity_report.html.
│
├── fetch_metadata.py
│     Utility (run once): fetches voice gender, locale, local_name from Azure
│     SDK and saves to results/voice_metadata.json. Run after step 2.
│
├── requirements.txt
│     Python package requirements. See Section 3 for Windows install order.
│
├── .env
│     Credentials file (gitignored). Contains AZURE_SPEECH_ENDPOINT,
│     AZURE_SPEECH_RESOURCE_ID, and optionally AZURE_SPEECH_KEY.
│
├── scripts/
│     {locale}.txt — translated evaluation scripts, one per locale.
│     en-US.txt is the English master. Generated by step 1.
│
├── samples/
│     {locale}/
│         {voice_short_name}.wav — 16kHz PCM mono WAV, one per voice.
│     Generated by step 2. Each file is ~600-750 KB (~60s of speech).
│
└── results/
      similarity_{locale}.json — per-locale NxN similarity matrix + voice list.
      voice_metadata.json      — {short_name: {gender, locale, display_name}} for
                                 all plain Neural voices in the Azure portfolio.
      voice_similarity_report.html — self-contained interactive HTML report.

Documentation generated March 2026. Pipeline scripts are the source of truth — this notebook is a documentation artifact derived from the production code.

Azure TTS Voice Similarity Pipeline

Developer Documentation

1. Problem Statement

2. Pipeline Architecture

Step summaries

3. Environment Setup

Python version

Credentials

Windows-specific installation order

4. Authentication

Enterprise context

Why azure-identity.AzureCliCredential was not used

The subprocess workaround

Multi-service resource token format

SpeechConfig constructor limitation

5. Evaluation Script Design

Why a bespoke script?

Design principles

6. Voice Enumeration & Filtering

Azure TTS voice tiers

Why plain Neural only?

The Dragon HD naming problem

7. Voice Synthesis

SSML design

Audio format choice

Resume support

Token refresh during long runs

8. Speaker Similarity via Resemblyzer

What Resemblyzer does

Why cosine similarity = dot product here

Language-agnostic property

The soundfile approach

9. Gender-Aware Fallback Mapping

Why gender matters for fallbacks

Special case: Unknown gender

Data source

Full portfolio gender breakdown

10. Report Generation

Architecture

Sections

Data flow into the report

11. Results Summary

Production run (priority locales, November 2025)

Voice counts by locale

Key observations

12. Known Issues & Limitations

228 unique voices vs. 300 synthesized

HD voice filter — Dragon model pattern

Token expiry during very long runs

Google Translate rate limits

Resemblyzer's GE2E model limitations

No cross-locale similarity

13. Extending to All Languages

Running in --all mode

Scale estimates for full portfolio

What changes structurally

Recommended approach for full run

14. File Reference

Why `azure-identity.AzureCliCredential` was not used

Running in `--all` mode