Sound demos for "Data Redaction from Conditional Generative Models"
Section Ⅰ: Redacting speaker 1040
The samples for redacting speaker 1040 are shown below.
With the base configuration we can redact a fraction of conditionals but the generation quality is much worse than the pre-trained model.
By improving capacity both generation and redaction quality are improved.
Improved voice cloning improves the perceptual generation quality.
The non-uniform distillation loss ($\lambda_i$-order) improves generation quality by a huge gap.
Synthesized LJSpeech samples (training voice): we expect samples after redaction have similar quality as the pre-trained ones.
pre-trained model
redaction (base)
redaction (improved capacity)
redaction (improved voice cloning)
redaction (non-uniform distillation)
sample 1
sample 2
sample 3
Synthesized LibriTTS speaker-1040 samples (redacted voice): we expect samples after redaction are very different as the pre-trained ones and more similar to LJSpeech voices.
pre-trained model
redaction (base)
redaction (improved capacity)
redaction (improved voice cloning)
redaction (non-uniform distillation)
sample 1
sample 2
sample 3
sample 4
sample 5
Section Ⅱ: Redacting other speakers
The samples for redacting other speakers are shown below. We use the improved capacity configuration.
Both generation and redaction quality are good.
Synthesized LJSpeech samples (training voice): we expect samples after redaction have similar quality as the pre-trained ones.
pre-trained model
redaction (improved capacity)
sample 1
sample 2
sample 3
sample 4
sample 5
sample 6
Synthesized LibriTTS samples (redacted voices): we expect samples after redaction are very different as the pre-trained ones and more similar to LJSpeech voices.