Voice Style Transfer Based on Improved CycleGAN Network

Ling Lei, Ruomu Wei, Xinran Wu

Voice Style Transfer Based on Improved CycleGAN Network

Download as PDF

DOI: 10.25236/icceme.2023.013

Author(s)

Ling Lei, Ruomu Wei, Xinran Wu

Corresponding Author

Ling Lei

Abstract

Voice style transfer refers to transferring the timbre style of the source speaker's voice to the tonal style of the target speaker while keeping the speech content intact. Deep learning technology has promoted voice technology's advancement and large-scale application in recent years. Among them, the CycleGan network, used for the first time in image transformation, also shows advantages in voice style transfer tasks. However, during the speech type conversion of the CycleGan network, the generated voice quality is often low, and the effect is not good, so based on this, this paper proposes three methods for improvement. In particular, a second adversarial loss is introduced to alleviate the problem of over-smoothing in statistical models. The generator and discriminator structures are optimized, and the inputs are optimized using 2D-1D-2D convolutional structures and PatchGAN, optimizing input feature details and reducing spectral distortion. In addition, auxiliary technology Missing Frame Fill (FIF), is applied to make the model pay more attention to the time-frequency structure of the sound. Then, based on the AISHELL-3 dataset, the traditional CycleGAN and the improved CycleGAN network were used to conduct tests on the voice style transfer, respectively. The test results show that compared with the traditional CycleGAN network, the improved CycleGAN network has achieved significant improvement in the subjective evaluation indicators of voice naturalness and similarity scores, as well as the objective indicators MCD and MSD, which verifies the effectiveness of the above three improvement measures.

Keywords

CycleGAN; Voice style transfer; Missing frame filling; Second adversarial losses