Voice Conversion

如题所述

第1个回答  2022-07-26

VC aims to convert the non-linguistic information of the speech signals while maintaining the linguistic content the same, which is helpful for:

The encoder encode the source to the latent representation . The latent representation only contains the linguistic content so that the decoder could recover the source with the speaker identity while the classifier-1 cannot distingush the speakers [13] .

So the feature distenganlement means the separation of linguistic content or phonetic content and speaker identity . The objective of the classifier or discriminator is to adversarially guide the model to learn to disentangle features. But how to maintain the linguistic content? One way is to use the (ASR task) pre-trained content encoder [14] [15] [16] [17] . For the speaker identity, one hot vector for each speaker, or make use of speaker embedding ( i-vector, d-vector, x-vector, etc… )

Phonetic posteriorgram (PPG) is obtained from a speaker-independent automatic speech recognition (SI-ASR) system [15] [18] . These PPGs can represent articulation of speech sounds in a speaker-normalized space and correspond to spoken content speaker-independently.

As illustrated in Figure above, the proposed approach is divided into three stages: training stage 1, training stage 2 and the conversion stage. The role of the SI-ASR model is to obtain a PPGs representation of the input speech. Training stage 2 models the relationships between the PPGs and Mel-cepstral coefficients (MCEPs) features of the target speaker for speech parameter generation. The conversion stage drives the trained DBLSTM model with PPGs of the source speech (obtained from the same SI-ASR) for VC.

In [3] , the authors proposed a PPG system for foreign accent conversion (FAC). They use an acoustic model trained on a native speech corpus to extract speaker-independent phonetic posteriorgrams (PPGs), and then train a speech synthesizer to map PPGs from the non-native speaker into the corresponding spectral fea- tures, which in turn are converted into the audio waveform using a high-quality neural vocoder. At runtime, we drive the synthesizer with the PPG extracted from a native reference ut- terance.

In [17] , two different systems to achieve any-to-any VC:

Both systems train a bidirectional long-short term memory (DBLSTM) based multi-speaker voice conversion (MSVC) model. The IVC system uses i-vectors to encode speakerIDs, while the SEVC system uses learnable speaker embeddings to encode speakerIDs.

The key observations from the results are as follows:

A singing voice conversion method proposed in [10] . The proposed PitchNet added an adversarially trained pitch regression network to enforce the encoder network to learn pitch invariant phoneme representation, and a separate module to feed pitch extracted from the source audio to the decoder network.
PitchNet consists of five parts, an encoder, a decoder, a Look Up Table (LUT) of singer embedding vectors, a singer classification network and a pitch regression network. The audio waveform is directly fed into the encoder. The output of the encoder, the singer embedding vector retrieved from LUT and the input pitch are concatenated together to condition on the WaveNet [19] decoder to output audio waveform.

Instance Normalization (IN) for feature disentanglement is applied in [20] . It shows that simply adding instance normalization without affine transformation to can remove the speaker information while preserving the content information. To further enforce the speaker encoder to generate speaker representation, we provide the speaker information to decoder by adaptive instance normalization (adaIN) layer. The idea is from style transfer in computer vision [21] .

The idea of IN is also applied in [4] , where the speech signal is decomposed into an emotion-invariant content code and an emotion-related style code in latent space. Emotion conversion is performed by extracting and recombining the content code of the source speech and the style code of the target emotion.

A straightforwad method is to use the cycle consistency loss by cycleGAN [22] [23] [1] and StarGAN-VC [24] [25] .

In [5] , the Wasserstein distance metric (WGAN loss) with gradient penalty is considered.

For each pair of conversion , CycleGAN need unique discriminator. For 100 speakers a group of generators and discriminators are required. StarGAN-VC [24] [25] is introduced to solve the problem, in which all speakers share the same generator and discriminator. The aim of StarGAN-VC is to obtain a single generator that learns mappings among multiple domains. To achieve this, StarGAN-VC extends CycleGAN-VC to a conditional setting with a domain code (e.g., a speaker iden- tifier). More precisely, StarGAN-VC learns a generator that converts an input acoustic feature into an output feature conditioned on the target domain code , i.e., .
In [24] , for the CycleGAN and StarGAN are compared.

In [25] , a source-and-target conditional adversarial loss is developed.

For the objective evaluation, use Mel-cepstral distortion (MCD) and modulation spectra distance (MSD).

For the subjective evaluation, a mean opinion score (MOS) test (5: excellent to 1: bad) is conducted.

The conversion function is reformulated as an auto-encoder. The encoder is designed to be speaker-independent and converts an observed frame into speaker-independent latent variable or code . Presumably, contains information that is irrelevant to speaker, such as phonetic variations, is also refered to as phonetic representation [26] .
The decoder reconstructs speaker-dependent frames with speaker representation as another latent variable and with . The reconstruct a speaker-dependent frame .

The VAE guarrentee that the latent variable is Gaussian, that is speakers normalized or speaker-independent and can be regarded as linguistic information.
In [27] , CycleVAE-based VC is proposed. CycleVAE is capable of recycling the converted spectra back into the system, so that the conversion flow is indirectly considered in the parameter optimization.

Conditional VAEs (CVAEs) are an extended version of VAEs with the only difference being that the encoder and decoder networks can take an auxiliary variable as an additional input [28] .
The regular CVAEs impose no restrictions on the manner in which the encoder and decoder may use the attribute class label c. Hence, the encoder and decoder are free to ignore c by finding distributions satisfying and . This can occur for instance when the encoder and decoder have sufficient capacity to recon- struct any data without using . To avoid such situations, in ACVAE [29] , an information-theoretic regularization [30] from InfoGAN is introduced to assist the decoder output to be correlated as far as possible with . ACVAE means VAE with an auxiliary classifie (AC), is shown as bellow.

In [31] , a non-parallel VC framework with a variational autoencoding Wasserstein generative adversarial network (VAW-GAN) is proposed. The model explicitly considers a VC objective when building the speech model. The speech model is modeling with CVAE and improved with WGAN by modifying the loss function.

相似回答
大家正在搜