Sound demos for "Data Redaction from Conditional Generative Models"



Section Ⅰ: Redacting speaker 1040

The samples for redacting speaker 1040 are shown below. With the base configuration we can redact a fraction of conditionals but the generation quality is much worse than the pre-trained model. By improving capacity both generation and redaction quality are improved. Improved voice cloning improves the perceptual generation quality. The non-uniform distillation loss ($\lambda_i$-order) improves generation quality by a huge gap.

Synthesized LJSpeech samples (training voice): we expect samples after redaction have similar quality as the pre-trained ones.
pre-trained model redaction (base) redaction (improved capacity) redaction (improved voice cloning) redaction (non-uniform distillation)
  sample 1
  sample 2
  sample 3


Synthesized LibriTTS speaker-1040 samples (redacted voice): we expect samples after redaction are very different as the pre-trained ones and more similar to LJSpeech voices.
pre-trained model redaction (base) redaction (improved capacity) redaction (improved voice cloning) redaction (non-uniform distillation)
  sample 1
  sample 2
  sample 3
  sample 4
  sample 5


Section Ⅱ: Redacting other speakers

The samples for redacting other speakers are shown below. We use the improved capacity configuration. Both generation and redaction quality are good.

Synthesized LJSpeech samples (training voice): we expect samples after redaction have similar quality as the pre-trained ones.
pre-trained model redaction (improved capacity)
  sample 1
  sample 2
  sample 3
  sample 4
  sample 5
  sample 6


Synthesized LibriTTS samples (redacted voices): we expect samples after redaction are very different as the pre-trained ones and more similar to LJSpeech voices.
pre-trained model redaction (improved capacity)
  sample 1
  sample 2
  sample 3
  sample 4
  sample 5
  sample 6