Making Base OTOs

Averaging a Configured OTO →
← About Base OTOs

Getting Started

The general process is this: for each sample you wish to configure in a voicebank, write out a line that follows this format while setting each of the parameters to the desired value:

[filename].wav=[alias],[offset],[consonant],[cutoff],[preutterance],[overlap]

But of course, we need to know what these values should be in the first place, and there are some techniques we can impliment to make this faster and easier than typing every line by hand.

A well-structured reclist will have a consistant way that strings are written and organized. Before we start, we need to know what types of strings we'll be working with, what types of samples we need to extract from each string, and how many samples per string there will be. From there, we can build a base configuration for one string of each type, and use this to build the rest of the base.

There is not really a step-by-step process to this, and more likely you will jump back and forth between different tasks as you build and refine your base.

Aliasing and Duplicates

Aliases are what UTAU (and anyone useing your voicebank) will use to differentiate between samples, so however you alias your sample types should be consistant, and no two aliases should be exactly the same, especially if they're meant to be different phonetically. UTAU will only read whichever comes first in the oto. While typically duplicates are unnecessary, if you feel you have good reason to include them, just be sure to make them different in some way, like by adding a prefix or suffix.

As far as giving the same sample two different aliases, this is something you will do after the voicebank is configured, otherwise you will have to manually copy the values from one oto line to the other.

I also recommend still setting an alias even if it's the same as the file name, just to be sure it will be read correctly by UTAU.

Tempo and Timing

The tempo the voicebank is recorded at is really important for determining the values of the base oto, because it will affect the starting point and duration of every phone in the sequence.

Tempo is measured in beats per minute (BPM), and the parameter values are measured in milliseconds (MS). Luckily for us, we can convert between these pretty easily to determine roughly how far apart each sample will be. Here's a chart of the cleanest tempos to work in:

Tempo	1/4 beat	1/2 beat	1 beat	2 beats	4 beats	8 beats
75 bpm	200 ms	400 ms	800 ms	1600 ms	3200 ms	6400 ms
100 bpm	150 ms	300 ms	600 ms	1200 ms	2400 ms	4800 ms
120 bpm	125 ms	250 ms	500 ms	1000 ms	2000 ms	4000 ms
150 bpm	100 ms	200 ms	400 ms	800 ms	1600 ms	3200 ms

I would recommend not going above 150 bpm or below 75 bpm, because it will become more difficult to keep in time, and 100 bpm in regular time (one mora/syllable per beat) is the same as 200 bpm in half time (each mora/syllable held for two beats) or 50 bpm in double time (two morae/syllables per beat) anyways. It's all just math.

120 bpm and 100 bpm in regular time are probably the most common tempos to record at, because they are slow enough for clear articulation and stable vowels, but still relatively fast and easy to count. Recording unstringed or short-stringed reclists at 120 bpm or 100 bpm in half time to get longer vowels is also fairly common.

When setting up our base values, knowing that every mora/syllable was held for 1 beat at 120 bpm tells us that each one will be approximately 500 ms long, the halfway point will occur every 250 ms, an 8-mora string will be about 4000 ms long, and so on. Point being, we can extrapolate a lot of useful numbers from this that will help us write our base oto.

Additionally, if the voicebank was recorded along with a guideBGM, or otherwise has a consistant number of beats before the sung note begins in each .wav file, we can fairly accurately predict the offset values as well.

Breaking Up Strings

Single Syllables

Isolated syllables, like those found in a CV reclist or a CVC-style CVVC, can either be stringed or unstringed. If they are unstringed, you are typically only going to have one oto line per .wav file, or two in the case of CVC, and your oto will end up looking something like this:

ka.wav=ka,,,,,

Or this:

kak.wav=- ka,,,,,
kak.wav=a k-,,,,,

Isolated vowels might be split into more samples, depending on what's needed from the reclist.

a.wav=- a,,,,,
a.wav=a,,,,,
a.wav=a -,,,,,

If the isolated syllables are stringed, then they will be separated by silence, and simply oto'd in sequence.

ka_ki_ku.wav=ka,,,,,
ka_ki_ku.wav=ki,,,,,
ka_ki_ku.wav=ku,,,,,

kak_kik_kuk.wav=- ka,,,,,
kak_kik_kuk.wav=- ki,,,,,
kak_kik_kuk.wav=- ku,,,,,
kak_kik_kuk.wav=a k-,,,,,
kak_kik_kuk.wav=i k-,,,,,
kak_kik_kuk.wav=u k-,,,,,

I prefer to keep like-samples together because it makes the otoing process a bit smoother.

Strings

VCV strings follow a straightforward pattern of an initial CV followed by a sequence of VCVs.

ka-ka-ki-ka-ku.wav=- ka,,,,,
ka-ka-ki-ka-ku.wav=a ka,,,,,
ka-ka-ki-ka-ku.wav=a ki,,,,,
ka-ka-ki-ka-ku.wav=i ka,,,,,
ka-ka-ki-ka-ku.wav=a ku,,,,,

VV strings are much the same, though they might have a vowel end as well.

a-a-i-a-u.wav=- a,,,,,
a-a-i-a-u.wav=a a,,,,,
a-a-i-a-u.wav=a i,,,,,
a-a-i-a-u.wav=i a,,,,,
a-a-i-a-u.wav=a u,,,,,
a-a-i-a-u.wav=u -,,,,,

CVVC strings will be broken into CVs, VCs, and sometimes initial and final consonants. Rentan CV functions the same, but with only the CVs.

sa-si-su-sas.wav=- s,,,,,
sa-si-su-sas.wav=a s,,,,,
sa-si-su-sas.wav=i s,,,,,
sa-si-su-sas.wav=u s,,,,,
sa-si-su-sas.wav=si,,,,,
sa-si-su-sas.wav=su,,,,,
sa-si-su-sas.wav=sa,,,,,
sa-si-su-sas.wav=s -,,,,,

2-mora CVVC, which also includes initial CVs and final VCs, looks something like this:

ka-kak.wav=- ka,,,,,
ka-kak.wav=ka,,,,,
ka-kak.wav=a k,,,,,
ka-kak.wav=a k-,,,,,

Of course, there are many, many other ways strings can be organized, as well as different ways of styling aliases, but I hope this gets the basic idea accross: figuring out what samples you're taking from a given string type, figuring out how you want to organize them, and creating one oto line for each.

TIP: In Notepad++, you can write out [filename].wav=,,,,, and hit <CTRL+D> on your keyboard to duplicate the line as many times as you need. Then just fill in or change the aliases.

Setting Base Values

The base values of a particular sample type should be an estimate based on the average value of a given parameter. The goal here isn't to be 100% accurate since these numbers will be tweaked later during actual configuration; it's just to make the process faster and easier by having them already set to around the right area. It also allows us to preset any fixed values so that we don't have to do them manually.

In short, if we know about what value a given parameter should be, that's what we use for our base. The easiest way to determine this is just by looking at a fully configured oto and guessing based on that, but there are ways we can be more precise.

Preutterance and Overlap

As a general rule, the overlap should be between a quarter to a third the value of the preutterance. These values are usually the easiest to determine because they are not that effected by tempo and can often be fixed.

For Typical CVs: For an unstringed CV reclist or a medial CV in another reclist type, the preutterance should be about 100 and the overlap should be about 25.

For VCVs: Preutterance and Overlap typically have a fixed value. I recommend setting them 300 and 100, respectfully.

For Initial CVs: Same as VCV, but I usually recommend setting these to be 200 and 50 to reduce the change of noise being captured.

For VCs and Vowel Ends: Like VCVs, these values will usually be fixed. If the syllables/morae are shorter than 600 ms, I recommend setting the preutterance to 200 and the overlap to 50. Otherwise, 300 and 100 will work here, too.

For Initial and Final Cs: These can also have a base preutterance of around 100 and an overlap of around 25.

Consonant

The consonant parameter should always be longer than the preutterance.

For CVs and VCVs: The consonant parameter usually falls between 100 and 200 ms after the preutterance, so setting our base consonant to be 150 more than our base preutterance is a good estimate.

If preutterance is...	Then consonant will be about...
100	250
200	350
300	450

For Transitional VCs and Initial Cs: Consonants are much shorter than vowels, so with our limited space to work with, our consonant parameter shouldn't be that much greater than our preutterance. I find that setting it to be 25 ms after is usually good.

If preutterance is...	Then consonant will be about...
100	125
200	225
300	325

For Final VCs: Here, the consonant parameter should cover the entire coda. We can treat this the same as a VCV and set this value to be 150 ms after our preutterance.

For Vowel Ends and Final Cs: Same as the above, but it doesn't need to be quite so long. We can use smaller values of 200, 300, or 400.

Cutoff

If you don't know, the cutoff parameter can be set to a negative number. This means that rather than counting the milliseconds from the end of the .wav file, it counts the milliseconds from the start of the offset. This is easier to work with, because it means we don't have to factor in the full duration of the .wav file, and we can set our base to a more consistant value.

For CVs and VCVs: We want the base cutoff value to be a little less than the preutterance plus the note duration — in other words, the maximum duration of the sample. Setting it to be 200 ms less than this value is usually good. Thus, the formula is Cutoff = -1 * (Duration + Preutterance - 200).

If the note duration is...	And the preutterance is...	Then the cutoff will be about...
500 ms	100	-400
500 ms	300	-600
600 ms	100	-500
600 ms	300	-700

Note: sustained/double vowels, like those found in VV strings, are typically twice the normal duration.

For VCs, Initial Cs, Final Cs, and Vowel Ends: The looping/stretching area for all of these should be pretty small, so we can usually set the cutoff to be about 25 ms after the consonant value. 50 ms is also fine for final VCs, final Cs, and vowel ends.

If the consonant is...	Then the cutoff will be about...
200	-225 / -250
225	-250 / -275
250	-275 / -300

And so on.

Offset

For every other parameter, the base value will be the same regardless of the sample's position in the string, which is determined by the offset.

Samples of the same type should be more-or-less evenly spaced in the base oto, and we can predict their positions using our tempo and timing:

If our tempo is...	Then continuous syllables will be...	And separated syllables will be...
120 bpm (regular)	500 ms apart	1000 ms apart
120 bpm (half)	1000 ms apart	2000 ms apart
100 bpm (regular)	600 ms apart	1200 ms apart
100 bpm (half)	1200 ms apart	2400 ms apart

Next, we need to factor in the duration of time before the first note in the .wav file begins. For example, if the recording captures two beats of silence at 120 bpm before the first note is sung, then we will need to add an additional 1000 ms to every offset.

For CV and VCV samples, we can calculate the offset using this formula: Offset = S + (D * (X - 1)) - P, where S is the duration of silence at the begining of the recording, D is the duration of each note in the string, X is the position of the syllable we are sampling from within the string, and P is the preutterance value.

I promise this is less complicated than it sounds; let's look at it in practice. A VCV recorded at 120 bpm in regular time will end up with a base oto something like this:

ka-ka-ki-ka-ku.wav=- ka,800,300,-600,200,50
ka-ka-ki-ka-ku.wav=a ka,1200,400,-600,300,100
ka-ka-ki-ka-ku.wav=a ki,1700,400,-600,300,100
ka-ka-ki-ka-ku.wav=i ka,2200,400,-600,300,100
ka-ka-ki-ka-ku.wav=a ku,2700,400,-600,300,100

The value of the initial CV offset is 1000 + (500 * (1 - 1)) - 200 → 1000 + (500 * 0) - 200 → 1000 + 0 - 200 = 800.

The first VCV is on the second syllable, so we get 1000 + (500 * (2 - 1)) - 300 → 1000 + (500 * 1) - 300 → 1000 + 500 - 300 = 1200. Every subsequent VCV will just be 500 ms after the previous one, so we follow the pattern from there.

For VCs, initial/final Cs, and vowel ends, the offset will be placed about 150 ms before it would be if it were a VCV with the same preutterance. So a CVVC base oto will look something like this:

sa-si-su-sas.wav=- s,850,225,-250,100,25
sa-si-su-sas.wav=a s,1150,225,-250,200,50
sa-si-su-sas.wav=i s,1650,225,-250,200,50
sa-si-su-sas.wav=u s,2150,225,-250,200,50
sa-si-su-sas.wav=si,1400,250,-400,100,25
sa-si-su-sas.wav=su,1900,250,-400,100,25
sa-si-su-sas.wav=sa,2400,250,-400,100,25
sa-si-su-sas.wav=s -,2850,200,-250,100,25

So the process for finding the offset of the first VC is 1000 + (500 * (2 - 1)) - (200 + 150) → 1000 + (500 * 1) - 350 → 1000 + 500 - 350 = 1150.

We can apply this same logic to however else we may need to break up our string:

ka_ki_ku.wav=ka,900,250,-400,100,25
ka_ki_ku.wav=ki,1900,250,-400,100,25
ka_ki_ku.wav=ku,2900,250,-400,100,25

ka-kak.wav=- ka,800,300,-600,200,50
ka-kak.wav=ka,1400,250,-400,100,25
ka-kak.wav=a k,1150,225,-250,200,50
ka-kak.wav=a k-,1650,300,-350,200,50

At least, this should work in theory. I find that my recordings are off-time by an average of about 150 ms, but I'm not sure if this is because of OREMO or because of human error. Still, that's just a matter of adjusting each offset by that much. For example:

ka-ka-ki-ka-ku.wav=- ka,950,300,-600,200,50
ka-ka-ki-ka-ku.wav=a ka,1350,400,-600,300,100
ka-ka-ki-ka-ku.wav=a ki,1850,400,-600,300,100
ka-ka-ki-ka-ku.wav=i ka,2350,400,-600,300,100
ka-ka-ki-ka-ku.wav=a ku,2850,400,-600,300,100

Creating The Full Base

The next part of the proccess is relatively simple: duplicate the lines of first string, change the revelant character(s), and continue on down until you've covered every string in the reclist.

You can do this by copy+paste followed by find+replace (CTRL+H), or by using RegEX if you're familiar with it. In Notepad++, you can find+replace characters and perform RegEX operations within a selection, so you can build your base without even switching windows or using placeholder characters.

If your strings and/or aliases are in kana rather than latin characters, this may take more time since there's less you can auto replace.

Repeat this for every string type you need, and save the file as an .ini