In Speech Synthesis, Recognition, and More With SpeechT5, Mathijs Hollemans writes that  HuggingFace has released an easy-to-use implementation of SpeechT5, a spoken-language processing model. SpeechT5 is not one, not two, but three kinds of speech models in one architecture.

SpeechT5 can do speech-to-text, text-to-speech, and speech-to-speech (for voice changing). Unlike most models, SpeechT5 can perform a variety of tasks with the same architecture: it is the pre-nets and post-nets that change.

Here is the architecture:

In the text-to-speech task, the pre- and post-nets use:

  • pre-net: a text embedding layer that maps text tokens to hidden representations
  • pre-net: layers to compress the speech spectrogram to hidden representations
  • post-net: an output pass that predicts a residual to add to the output spectrogram

In the speech-to-text task, the pre- and post-nets use:

  • pre-net: a convolutional neural network feature encoder
  • pre-net: a layer which maps text to hidden representations using an embedding layer
  • post-net: an output pass that projects the hidden units to vocabulary probabilities

The article goes into detail on how to use the models and the code.