In Speech Synthesis, Recognition, and More With SpeechT5, Mathijs Hollemans writes that HuggingFace has released an easy-to-use implementation of SpeechT5, a spoken-language processing model. SpeechT5 is not one, not two, but three kinds of speech models in one architecture.
SpeechT5 can do speech-to-text, text-to-speech, and speech-to-speech (for voice changing). Unlike most models, SpeechT5 can perform a variety of tasks with the same architecture: it is the pre-nets and post-nets that change.
In the text-to-speech task, the pre- and post-nets use:
- pre-net: a text embedding layer that maps text tokens to hidden representations
- pre-net: layers to compress the speech spectrogram to hidden representations
- post-net: an output pass that predicts a residual to add to the output spectrogram
In the speech-to-text task, the pre- and post-nets use:
- pre-net: a convolutional neural network feature encoder
- pre-net: a layer which maps text to hidden representations using an embedding layer
- post-net: an output pass that projects the hidden units to vocabulary probabilities
The article goes into detail on how to use the models and the code.