☢ How to Record UTAU Voicebanks ☢

☆ Reclists ☆

To get started making your own voicebank, the first thing you'll need is a reclist, which is short for recording list. As the name implies, this is a list of everything that will need to be recorded in order for the voicebank to function.

There are a lot of different reclists out there, as you may have gathered from the variety of different voicebank types. I've written several myself just for Japanese.

I usually recommend for beginners to start off with Japanese CV, particularly if you don't have any formal vocal training, because they are overall the easiest to make and use. Most CV reclists require simply recording each sample in a separate .wav file, so additional tools aren't necessary to use; the only thing you need is a microphone and a recording software that will allow you to export .wav files — the type of audio files UTAU voicebanks need to use.

If there are multiple syllables (or morae) contained in the same recording, we call this a string, which is what the majority of VCV and CVVC voicebanks require. I'm going to focus primarily on CV in this tutorial, but I plan to make tutorials on creating VCV and CVVC voicebanks in the (hopefully) near future. CV reclists can also arranged into strings, either with beats of silence between each syllable, or recorded continuously (the rentan technique), but we'll just be going over unstringed CV here.

In most cases, the characters in the string (or single syllable) will be the same as those in the file name of the .wav. Japanese voicebnks can be encoded (written) in either hiragana or romaji, but this is not always the same as the alising of the samples extracted from the .wav files. The encoding won't typically affect the usage of the voicebank, so it's mostly up to creator preference. This means that even if you don't know how to read hiragana, you can encode your Japanese voicebanks in romaji while using hiragana for the aliasing. Even a lot of experienced users (like myself) end up prefering romaji encoding as it can be more convenient.

Japanese CV Reclists

The bare minimum needed to synthesize Japanese includes all the standard 五十音 (gojuu-on) morae, which is almost every combination of the nine consonants [k, s, t, n, h, m, y, r, w] and the five vowels [a, i, u, e, o], as well as standalone vowels and the syllabic nasal [n] (which is treated like a vowel).

Some of these morae have special pronunciation rules, however; [s + i] becomes [shi], [t + i] becomes [chi], [t + u] becomes [tsu], [ye, wi, we] aren't found in any native words, and [yi, wu] don't occur at all.

Then we have the additional consonsants [g, z, d, b, p, j] represented with diacritics on the hiragana characters, and all of the allowed C + [y] clusters. That's a phoneme inventory of 18 "C"s and 6 "V"s, though when accounting for restrictions on sound combinations, we end up with a minimum of 102 samples.

Many Japanese reclists additionally contain the non-native consonant [v] and a few extra non-native morae for balancing. While these aren't necessary, they're not too much more effort to include if you want your voicebank to have a little more flexibility. These usually have about 150 to 200 samples.

If you only want to record what's absolutely necessary, here's a minimal Japanese CV reclist, but if you don't mind including a couple of extras, here's a more standard one. You can also include as many additional extras as you want, but you don't need to go overboard right out the gate; every sample recorded is another one you'll have to configure later.

Japanese Pronunciation

While romaji is fairly intuitive for English speakers, the pronunciation of Japanese does not exactly match that of English words. Check out this pronunciation chart for a quick reference, but here's a more detailed explanation of a few things:

The vowels [e] and [o] are pretty close to the ones in <face> and <goat>, but unlike in English, they are not diphthongs. In English, these vowels start with the tongue in one position but move higher towards the end of the vowel, but the Japanese equivalents don't do this. Try thinking of it like clipping off the vowel before it gets to the end.
The vowel [u] is similar to the vowel in <goose>, but the lips should be unrounded, and likewise the tongue should not raise at the end of the utterance.
The vowel [a] is similar to the vowel in <father> but it should be more fronted in the mouth, almost like the vowel in <back>.
The nasal [n] is represented with a single consonant, but the sound of it actually changes depending of the consonant that follows it; [n] before [k, g] becomes [ng], and [n] before [p, b, m, f, v] becomes [m]. This difference isn't always reflected in voicebanks, so [n] is default.
[k, t, p] are all unaspirated. In English, we release a small puff of air when these consonants occur at the start of a syllable, but this is not the case for Japanese. This pronunciation difference isn't super noticeable, but it can help your voicebank sound closer to native if you're able to do it.
[sh, ch, j] all have the tongue slightly further back than their English counterparts.
[f, v] are actually produced with just the lips, not the lips + teeth as is the case for English. Think of it kind of like blowing out a candle.
[r] is very different from the English <R> sound, but is actually still very close to a sound many English varieties have; rather than <r> as in <red>, it's like the <dd> in <ladder> when said quickly; just a quick flap of the tongue. If you struggle to produce this, it is often more natural-sounding to approximate it with an English <L>.

Know that a perfect accent is not a requirement for a good voicebank, though! While you don't want your voicebank's pronunciation to be so off it sounds like it's singing the wrong words, you don't need to stress about how much you sound like a native (a lofty goal for any L2 speaker, let alone someone who doesn't speak at all). In fact, many Japanese speakers find non-native accents fun and cute, so won't necessarily be turned off by a voicebank that has one.

Just try your best and your voicebank will sound fine; your accent will improve the more exposure you have to the language. If you are really worried about it, I recommend listening to native speakers sing and trying to immitate their pronunciation.

☆ General Recording Tips ☆

First, consider doing vocal warmups and being properly hydrated before you begin your recording session to help avoid vocal strain. Avoid recording immediately after you've eaten, too, especially dairy, chocolate, and other foods that coat your throat and can affect vocal quality.

Try to set up in a relatively quiet space with minimal background noise, as anything getting picked up by your microphone will also be in your voicebank. Fans whirring, people talking, dogs barking, TVs on in the other room, your neighbor mowing their lawn, an airplane flying over your house, etc. can all distort the sound quality.

Wear headphones so you can listen back to your recordings, and so any tools you are using to help you record don't get picked up by your mic.

Hard surfaces like walls, desks, windows, and hard floors will bounce the sound waves of your voice off of them and right back into your mic, creating a "boxy" sort of sound. While most of us don't have any fancy equipment, if you want some makeshift sound absorption, try recording under a heavy blanket, laying a towel over your desk, recording in a carpeted room, or even recording in a closet full of clothes. I literally do those first three myself.

Timbre

Voicebanks depend on smoothly transitioning from one sample to the next, so consistency is key when it comes to making good quality singing synths. In other words, we (usually) want our voicebank to sound like a complete vocalist, not a bunch of other vocalists mashed together. In order to achieve this, we need to keep our timbre as stable as possible when we record.

Timbre is a fancy French word that refers to the distinctive sound of a specific instrument or vocalist. A violin has a different timbre from a saxophone even when they are playing the same note; it's how we can tell they are different instruments. Likewise, your voice presumably has a different timbre from mine, and thus we sound like different people.

The human voice is quite flexible, though, and the same vocalist is capable of producing different timbres. For an example, consider my UTAU KATSU, who samples my natural singing voice, compared to my UTAU HIRO, who is heavily voice acted; despite them having the same voice provider, they sound quite different from one another. Consider also KATSU's multiexpression and growl voicebanks, which, while not voice acted in the sense that HIRO is, utilize different vocal techniques in order to achieve different sounds.

One's timbre can also change by simply moving up and down one's range. You may have heard of head voice compared to chest voice / full voice, the main register divide of human vocals. The head voice is the point where, as you approach the top of your range, your voice feels like it switches to being lighter and softer. While this is the most dramatic change, you likely have changes in timbre throughout the rest of your range as well; your tone of voice on a very low note is not going to be exactly the same as one in your mid-range. For people like me, the difference is pretty noticable, but this isn't the case for everybody.

All of this to say, when you're recording a voicebank, keeping the timbre consistant means trying to stick to the same pitch and vocal style for every recording. Of course, as mentioned, multipitch/expression voicebanks exist, but even these depend on consistency within a given pitch/expression. Pitchless vocals, like growls, screams, and whispers need to be consistant, too.

Pitch

First off, regardless of vocal style, you'll typically want your voicebank to be sung on a set pitch rather than simply spoken in a monotone. What you put into it is what you get out of it, so if you want your UTAU to sound like it's singing, you gotta sing yourself. Think of Mr. Brightside (or a lot of Killers songs, frankly); the pitch doesn't change much throughout the song, but it is still very much being sung. That being said, generally avoid adding vibrato or extreme modulation to your samples, as those won't always translate well into the software.

When selecting what pitch(es) to record at, you'll first want to be familiar with your vocal range — that is, what notes you can comfortably and consistantly sing. Remember, you're going to have to produce these notes for the entire recording session; pushing your voice too far can cause it to become strained and affect the quality of your performance, and in the worst case scenario can actually damage your vocal chords.

When recording monopitch, pick a note somewhere in your comfortable range that is towards the middle of what you want your UTAU's range to be. When I record monopitch for KATSU I usually go for E3 or F#3, which are in the lower-middle section of my range, so that his voice will be strong on the low notes but still sound good for the whole 3rd octave. For HIRO, I want her to be able to sing higher parts compared to KATSU, so I record monopitch for her at B3, where I can comfortably maintain her tone without my voice cracking or switching into head voice.

To figure out your range and also keep pitch consistant while you record, you can use a digital tuning fork, like this one, which will play a tone at the indicated pitch that you can sing along with to match. You probably won't need to have this going the entire time you record, just check in with it periodically to make sure you're not getting wildly out of tune.

Tempo

While tempo (how fast you sing) isn't super important for unstringed voicebanks, it is important for stringed ones, so getting into the habit of recording with a metronome of some kind now might save you some trouble if you go on to make more complex voicebanks. Essentially, we want to make sure that the duration (length of time) of each sample is as consistant as possible, to ensure easier configuration and more consistant vocal quality overall.

For unstringed CVs, each syllable is usually held for one to four beats at a tempo of 100 to 150 BPM, or about 500 to 2000 milliseconds (0.5 to 2 seconds) — we don't typically want to go outside of this range. Very short samples aren't going to have very much vowel to loop, stretch, or blend by the resampler, so they won't sound as natural, and very long samples are more likely to capture inconsistencies in the human vocal performance. Either extreme can result in poor articulation of consonants, thus effecting the voicebank's clarity and pronunciation.

You can find several metronomes for free online, like this one.

☆ Recording Your Voicebank ☆

Before you start, you'll need to create a voicebank folder to store all of your files in. Inside of UTAU\voice, create a new folder and name it whatever you like; you can rename this later if you don't know what your voicebank will be called yet. If you want, you can create a subfolder inside of this to store the audio files themselves, but this isn't a requirement.

Software & Tools

The main softwares you will see used for UTAU voicebanks are Audacity, a free general-purpose audio recording & editing software, and OREMO, another free software specifically made for recording UTAU voicebanks. There's also Akorin, an alternative to OREMO created by my friend KLAD, though it's still fairly early in development and I admittedly haven't used it myself.

OREMO has many advantages to Audacity:

Built-in metronome to help you stay in time
Built-in tuning fork to help you stay on pitch
Reclist hosted natively in the software window
Files are automatically named and exported
Ability to use guideBGMs

However, it can also require more set up and isn't quite as intuitive to use, so I'll be going over both softwares here.

Recording in Audacity

Arrange the windows on your computer so that you can see the software and the reclist at the same time. If you are using any external tools like a metronome or tuning fork, have those open as well.

Once you are ready, match your pitch with the tuning fork, start your metronome at the desired tempo, hit the big red Record button, and begin singing the first CV. Hold it for one to four beats, pause for a few beats, then sing the next one, pause again, sing the next one, and so on. Continue until you are finished with the list.

If you mess up or just aren't happy with a sample, you can redo it as many times as you need to. You don't need to record the entire list in one take, either; you can always stop the recording and start a new one later. You can even record each sample one at a time if you prefer.

Once you have your samples recorded, you'll need to export them as .wav files into your voicebank folder. To do this, highlight the sample with the Selection Tool, making sure capture a little bit of silence on either side of it so you don't clip off any of the consonant or vowel. These will be all of the dark sections on the waveform, and should be fairly easy to see if your audio is clean.

Next, navigate to File > Export > Export Selected Audio*. When the export window pops up, select WAV (Microsoft) signed 16-bit PCM from the Save-as type dropdown menu. Navigate to your voicebank folder in the file window, and name the sample [CV].wav. For instance, if you are exporting the sample [ka] / [か], you would name it ka.wav or か.wav.

*^{Make sure to choose this specifically and not Export as WAV, because that will export the entire audio track.}

Recording in OREMO

Because OREMO automatically exports audio files, the first thing you should do is set the Recording Folder at the bottom of the window to be your voicebank folder. You can save them to OREMO's automatic results folder if you want, but you'll have to move them later anyways.

You'll want to have whatever reclist you plan to use saved as a .txt file somewhere on your computer (I keep a reclists folder in my OREMO program folder). Go to File > Load Voice List, find the reclist, and open it. Now it should appear on the lefthand side of the window, and the currently selected string or sample should show up in large text at the top.

The options next to the reclist are suffixes that you can automatically append to the name of each .wav file when exported, but we don't need to use these since we are just recording monopitch, so make sure the empty space is selected.

If you want to make sure you're using the correct microphone, go to Options > Audio I/O Settings and select it from the Input Device dropdown menu. You can also adjust other audio settings in this window.

Next, navigate to Option > Recording Style Settings to open up this window. If you don't want to mess with guideBGMs, make sure Manual recording is selected.

To use OREMO's built-in tuning fork and metronome, go to Show > Pitch Guide and Show > Tempo Guide and adjust the settings of these however you like. You can leave them open while you record as well.

Now we're ready to record. Match your pitch, start your metronome, and hold down <R> on your keyboard to begin recording. Sing the CV you have selected, and release the <R> key when you are finished. Make sure to capture a little bit of silence before and after each recording so that you don't clip off any of the consonant or vowel.

Select the next CV on the list (or press the down arrow on your keyboard) to move on to the next recording, and do the same thing. Continue until you've recorded the entire list.

OREMO will automatically export the last recording as a .wav whenever you move to a different recording on the list. To listen back to a recording, press the space bar. To redo a recording, have the string you want to redo selected, and hold down <R> again to overwrite it.