☢ Introduction to UTAU ☢

This is a series of text tutorials and explanations aimed at absolute beginners. The goal to teach you what UTAU is all about, the basics of using the software, and how to create UTAUs of your own.

Visual aids will be added soon!

Last updated: 12/25/2023

☆ What is UTAU? ☆

UTAU is a Japanese singing synthesizer application created by Ameya/Ayame (飴屋/菖蒲) in 2008 (Wikipedia). A singing sythesizer is a kind of virtual instrument meant to replicate human singing vocals. These can be broadly sorted in three categories: those that are completely artificial with no human input, those that generate audio using a machine learning model, and those that sample from audio recordings. UTAU is the third kind.

Unlike other major singing synthesis softwards, like VOCALOID, Synthesizer V, and CeVIO, UTAU is not only completely free to use, but also allows users to make their own synthesizers. A few competitors have shown up over the years, like Niaoniao and DeepVocal, but so far none have been able to match the quality and flexibility that UTAU provides.

That being said, the software does have its limitations, and there haven't been any major updates since 2013. There are, however, many tools and plugins available to expand and improve software capabilities, and there is an open-source clone and improvement project called OpenUTAU currently in development, which we will also be covering here.

Part of the appeal is that the voices are often represented by original characters called UTAUloids (or just UTAUs), which adds another layer to the creative process of developing them. If you're reading this, there's a good chance you've at least heard of the more prolific UTAUloids like Teto Kasane, who has been included in official VOCALOID media and recently recieved a Synth V voicebank, Momo Momone, who sang the nyan cat song, or Defoko (also known as Uta Utane), the default voice of the software.

Using UTAU you can...

  • Create libraries of audio files called voicebanks
  • Share your voicebanks online and download ones created by other people
  • Create covers of songs you like
  • Create vocal tracks for original music
  • Design original characters to represent your voicebanks
  • Use various methods to manipulate the sound of a voicebank, even beyond what humans can do
  • Use and create plugins to expand the capabilities of the software
  • Interact with a huge, diverse community of creators from all around the world!

How Voice Synthesis Works

Voicebanks are libraries of sound files created by either generating audio frequencies or by recording a human vocalist. These voicebanks can then be sampled and rendered by an engine/software/application in order to synthesize vocals.

Some of them function by sampling entire words and phrases, but these are often for more specialized purposes that only need a set vocabulary. For a synthesizer that is intended to be able to say/sing anything in a given language, it is almost always more efficient to break the language down into its phonetic components, meaning that speech sounds (phones) are sampled in small groups and stitched together by the software to form larger words and phrases.

This doesn't mean that every phone exists in a separate recording, however; for more natural synthesis, sequences of phones are usually extracted from longer utterance strings, and configured so that the end of one sample blends as smoothly as possible with the beginning of the next. As we'll go over later, there are a few different methods that UTAU voicebanks use in order to do this effectively.

Some engines have built-in dictionaries that will convert words written in a language's orthography (spelling) into a set phonetic output, often with some way to "guess" the pronunciation of words it doesn't "know". Others require a user to manually select the phonetic samples needed and will not recognize orthographic input — this is the kind that UTAU is. OpenUTAU, however, does allow orthographic input for languages which have supported phonemizers, and still allows for adjustment of the phones it selects for each word.

Note on Sampling From Other Sources

While your UTAU doesn't have to be voiced and designed by you specifically, you need to have permission for any asset you use for it, be it audio or visual. Using audio in UTAU without explicit consent of the owner/vocalist is a violation of UTAU's terms of use, AKA what you inherently agree to by using the software. Doing so could create potential legal trouble for both you and for the developer due to things like privacy laws. This also includes sampling and porting other singing synths into UTAU. All of this holds true even if you do not distribute the voicebank.

"What about [insert open-source synth here]" — check the usage agreement and/or contact the developer.

My own feelings on this are irrelevant, but it would be irresponsible for me not to issue this warning. I also want to add that creating a voicebank out of samples unintended for singing synth is vastly more difficult than just doing you own work.

Similarly, using artwork for your voicebank that was not made by or for you is at best disrespectful to the artist, and at worst a violation of copyright. This includes tracing or copying another artist's work, even if you change the details of the character. This also generally includes "AI" generated images, such as those made with Midjourney, as the majority of these models were trained on stolen assets.

I'm not trying to scare anybody off here, these are just important things to know before making and distributing voicebanks. As long as your UTAU is 100% yours, you don't have anything to worry about. You can use audio and artwork provided by other people as long as they've given you permission.