This paper proposes a novel Chinese speech cloning model named VStyclone, which consists of three stages: multi-speaker training, target speaker encoding, and target speaker synthesis. In this work, we design an efficient tone extractor, which can reallocate resources to the sequences of log-mel spectrogram frames obtained from multiple speakers’ speech, thus allowing the network to learn multiple speakers’ features differently. This approach allows the network to focus more on the voice features of the target speaker and extract the target f...