☢ Multipitch Voicebanks ☢

A comprehensive guide on creating multipitch and multiexpression UTAU voicebanks. This resource assumes at least a basic knowledge of UTAU. If you need help with terminology or linguistic concepts, check the vocabulary page.

☆ What are Multipitch Voicebanks? ☆

Multipitch is a general term used to describe voicebanks recorded at more than one musical pitch, measured in semitones such as F₃, C₅, A#₂, etc. These can sound more naturalistic compared to monopitch voicebanks as they preserve more of the natural tone shifts as a singer goes up and down their range.

Voicebanks with two pitches are sometimes called "dipitch" and voicebanks with three pitches are sometimes called "tripitch", but it's rare to see that nomenclature used for voicebanks with four or more pitches.

Multiexpression is a very similar concept, but rather than referring to the use of multiple pitches, it refers to the use of different vocal styles, such as having both a power expression and a soft expression.

To avoid confusion, the general term I will be using to describe individual pitches and expressions is subbank, as in subsidiary voicebank, but you may also see "pitch" used in this way.

Importantly, this type of voicebank involves recording an entire reclist more than once for the same voicebank. Each subbank is not always perfectly symmetrical, but having wildly different sample types and aliasing between subbanks would make it very difficult to use, and the assumption by the user is typically that each subbank will be identical.

EVEC-style voicebanks — those which are inspired by the VOCALOID 4 voicebank Luka Megurine V4X — are somewhat of an exception, recording only the vowels in multiple styles and then blending them with other CVs, but these don't generally work as well as fully recorded multiexpression banks, and aren't what I'll be talking about here.

This guide will cover how to create multipitch and/or multiexpression voicebanks in classic UTAU and OpenUTAU. For an example of a voicebank which is both, KATSU MultiEX is a Japanese voicebank with multiple expressions, each of which is was recorded at multiple pitches.

☆ Voicebank Planning ☆

While you can of course jump right into recording or change directions for a voicebank you've already started, with larger voicebanks it's good to have an idea of what will be included ahead of time.

Voicebank Size Limit

While this isn't a problem in OpenUTAU, classic UTAU has a limit to how big voicebanks can be. More specifically, there is a limit to how many oto lines the software can read in the in-enging oto editor before it will crash.

To be exact, the number of total lines in the oto, so the combined total of every oto.ini file in the voicebank, cannot exceed 2^15, or 32,768 lines. This number is quite large and unlikely to be a problem in the majority of cases, but it's good to be aware of it in the case that you're planning some massive voicebank project, like a multipitch multiexpression English bank for example.

So, on the off chance that your voicebank exceeds the oto limit, you might consider releasing a version that's split into separate smaller banks in order to be compatible with classic UTAU, or at least put a warning on your voicebank about it.

Vocal Range

Firstly, it's good to be aware of your vocal range. You can measure this using a digital tuning fork like this one or a physical instrument like a piano. Start by matching a pitch like C₄ (AKA "middle C"), then matching lower and lower pitches until it becomes difficult or uncomfortable. Then, repeat the process in the other direction, matching higher and higher pitches. This will give you a general idea of your comfortable range; it's good to test this on different days and at different times of the day to get a feel for how it varies.

Note: While you can push yourself to record in uncomfortable pitches and styles, it's not a good idea unless you really know what you're doing. Doing this can strain your vocal folds and cause temporary damage. Doing this to your voice repeatedly over a short period of time can cause permanent damage, so exercise appropriate caution — if your voice starts to feel tired or sore while recording, stop and take a break. If your voice feels tired or sore from the get-go on a certain pitch or expression, it's probably not a good idea to record that in the first place.

For an untrained vocalist, the comfortable vocal range will probably be around two octaves, inclusively (25 semitones), but might be more or less depending on physiological factors and how much singing you do in your spare time. For example, my comfortable range is about G₂ to B₄ (29 semitones). A trained vocalist might have an even larger one.

You'll also want to pay attention to the point at which you swtich from your full voice / chest voice to your falsetto / head voice. To give a simplified explanation, this happens when the muscles in your throat move to a different position in order to produce higher frequencies, which in turn shifts your vocal register and changes the quality of your voice from its fuller, more resonant sound to one that is often lighter and weaker. There may be a handful of notes you can perform in either register.

Singgeek has some good videos on the topic of vocal range and register if you'd like to learn more, including one that can help you find your range:

For a quick note on range classification, you may see these classical terms being used to describe both human singers and voicebanks based on their range:

Name	Type	Range
Soprano	feminine high range	C₄ ~ C₆ (or higher)
Mezzo-soprano	feminine mid range	A₃ ~ A₅
Contralto	feminine low range	F₃ ~ F₅ (or lower)
Tenor	masculine high range	C₃ ~ C₅ (or higher)
Baritone	masculine mid range	A₂ ~ A₄
Bass	masculine low range	E₂ ~ E₄ (or lower)

But these aren't hard labels that every singer is going to fall neatly into, just general categories.

With all of this in mind, you can figure out what pitches you (or the vocalist you are working with) can record at.

Selecting Pitches

Generally, you'll want to avoid recording pitches that are too close together, as it will create unnecessary work for the vocalist and the voicebank developer for not much difference in the software. You'll also typically want to avoid recording pitches that are too far apart, as they won't transition smoothly into each other and can create choppy jumps in the render; if the seams between pitches are super obvious, it ruins the illusion, so to speak.

Where exactly the line should be drawn is of course going to vary by vocalist, but generally I find that keeping the distance between pitches at 3 to 6 semitones is a good amount. So, for example, if I'm recording one pitch at D3, the next highest pitch should be between F3 and A3, and the next lowest pitch should be between G#2 and B2.

The best way to select pitches is to pay attention to where your tone changes while moving up and down the scale, and to try to find pitches which best represent each portion of your range. It can help to do some test recordings, like recording a single string of a reclist at a bunch of different pitches and playing around with them in the software until you find what distribution you think sounds best.

Rather than guessing randomly, though, an easy trick is to use either a major or minor chord. This will give a nice distribution of notes that can be extended upwards or downwards for more pitches.

Chord Name	Root	3rd	5th
C Major / minor	C	E / D#	G
C# Major / minor	C#	F / E	G#
D Major / minor	D	F# / F	A
D# Major / minor	D#	G / F#	A#
E Major / minor	E	G# / G	B
F Major / minor	F	A / G#	C
F# Major / minor	F#	A# / A	C#
G Major / minor	G	B / A#	D
G# Major / minor	G#	C / B	D#
A Major / minor	A	C# / C	E
A# Major / minor	A#	D / C#	F
B Major / minor	B	D# / D	F#

Note: Typically synth engines default to writing notes as sharps rather than the equivalent flats; it's not a requirement but it's common convention, so that's how I've listed them here.

For example, I usually record KATSU's voicebanks in D Major, meaning the notes I use are D, F#, and A. F#₃ is around the middle of my chest voice, so recording at D₃, F#₃, A₃, and D₄ gives me a distribution of notes I can sing comfortably in full register. I can also tack on A₂ if I want to extend him into bass range or F#₄ if I want to include a falsetto pitch.

Selecting Expressions

When it comes to any kind of expressive voicebank, you'll see a lot of terms in the vocal synth community that are lifted from Crypton's initial Append voicebanks for VOCALOID 2. These are a bit abstract and convey the mood of the vocal performance moreso than being a technical term for a singing style.

The most common descriptors are probably "power" (also called "strong" or 強い), "soft" (also called "weak" or 弱い), and "neutral" (also called "normal" or 通常), all of which I feel are all pretty self-explanatory. The original Miku Hatsune Append voicebank followed an expression matrix which resulted in six vocal styles:

Tone	Strong Sound	Weak Sound
closed	"light"	"sweet"
mid	"vivid"	"soft"
open	"solid"	"dark"

Generally it is more practical to develop the vocal style first and give it a name later; as I said, these names are just abstract descriptors. Other expression names commonly borrowed from VOCALOID are "cold", "warm", "serious", "adult", and so on.

Of course, solo expression voicebanks are quite common, but here we're talking about multiexpression banks. A multiexpression voicebank does not have to have all expressions recorded at the same pitch(es), but should ideally still use the same reclist for each of them and all pitches within.

Unlike multipitch, however, being able to blend seamlessly between expressions is not as much of a consideration; it depends on the specific way you intend the voicebank to be used. We wouldn't typically expect a smooth transition from a powerful belt to a pitchless whisper, for example. But if you intend for expressions to be able to be blended together in the same musical phrase, much like selecting pitches, try to find ways of singing which are not so similar as to be unnotable in the render, but which are not so different as to sound choppy together.

Note: You might also see the terms "powerscale" or "kire" (キレ) used, as made popular by Ritsu Namine's KIRE voicebank. This is a style of voicebank where the lowest pitch is recorded with a weaker sound that gradually becomes more powerful as the pitch increases. Despite technically involving multiple singing styles, the way these are set up and used is more in-line with multipitch solo expression than true multiexpression.

All in all, what expressions you choose to include are just going to depend on the different vocal styles you or your vocalist feels comfortable singing in. "Vocal style" in of itself is a pretty vague term, and I'm by no means a voice coach, so I'm not really equipped to give a run down on how to sing in different ways.

☆ Making Multipitch Voicebanks ☆

Folder Set-Up

Once you know everything your voicebank is going to include, you'll want to set up your voicebank folder to accomodate all of its subbanks. While you could dump everything into the root folder and differentiate them by their filenames, it will probably be better for you and anyone else using your voicebank to organize them into subfolders.

UTAU won't read more than one subfolder deep, but we can still use them for organization. For multipitch voicebanks, folders are usually named after the pitch the subbank was recorded at, or by an arrow indicating that it is higher or lower than the medial pitch. Of course, the folder name doesn't actually matter as long as it can be read by UTAU, but it's good practice to make it obvious what's in each one.

Sometimes, the medial pitch of a multipitch bank or the neutral expression of a multiexpression bank are left in the root folder, but they can be given their own subfolder, too; it's up to personal preference.

Here are some examples of how one could organize a hypothetical 5-pitch voicebank recorded on a C Major chord:
multipitch example

Note: The .wav files do not need to have pitch suffixes on them, as the folders do the work of distinguishing the filenames in the oto, but I've labled the ones here for readability.

For a voicebank with four expressions (neutral, power, soft, and falsetto):
multiexpression example

For a voicebank with those pitches and expressions together:
multipitch + multiexpression example

This last one is also a good visual of how these sorts of voicebanks grow exponentially, especially if each expression has symmetrical pitch distribution.

Alias Affixes

While the file names will be distinguished by belonging to a particular subfolder, we still need to distinguish the actual samples that will be pulled by the synth engine. This can be done by adding prefixes and suffixes to each alias within a subbank's oto.

What this means is that every alias will have an identical series of characters either before or after its contents in order to differentiate it from other samples with the same alias. Like with the folder names, this could theoretically be anything as long as UTAU can read it, but it's better to use more intuitive naming. Suffixes are used more often than prefixes for readability, but either would work the same.

Samples recorded at specific pitches are usually given a pitch suffix (or prefix) like [kaC4] or [ka_C4]. Underscores can be used to make it more visually obvious which part of the alias is the suffix, but aren't necessary. Arrows or +/- signs can also be used, as in [ka↑] or [ka+].

Samples recorded in specific expressions can be given some kind of shorthand for the expression name. Usually the first letter will suffice, as in [kaP] or [ka_P] for a power expression. Alternatively, they might be numbered, or use specific symbols.

Note: Remember that UTAU aliases are case sensitive; the suffix [_P] is different from the suffix [_p].

If combining these suffix styles in a bank which is both multipitch and multiexpression, especially one where each expression has the same pitches, you'll want to put the expression suffix first, for example [ka_P_C4].

Doing it this way means that so long as the prefix map is set up with the pitch suffixes, you'd only have to enter the expression suffix onto the note in the piano roll, rather than having to specify both suffixes on every note. In other words, if you wanted a particular note to use the power expression, you could simply attach [_P] and UTAU would select the right pitch within that expression automatically.

You may also choose to leave a particular pitch or expression unmarked. Doing this means that UTAU would use those samples by default whenever no suffix is specified in either the note name or the prefix map.

This might be easiest to explain with a chart:

	Pitch does not have affix	Pitch has affix
Expression does not have affix	Both pitch and expression are default and neither needs to be specified	Expression is default, pitch must be specified, e.g. [_C4]
Expression has affix	Pitch is default, expression must be specified, e.g. [_P]	Both pitch and expression must be specified, e.g. [_P_C4]

Pitch Mapping in Classic UTAU

Luckily for us, UTAU has a built-in way of handling multipitch voicebanks via the use of prefix maps. These are text files that assign specific prefixes or suffixes to specific pitches on the piano roll.

Telling the prefix map to associate a pitch with a particular affix means that every time a note is placed on the piano roll, UTAU will look for that note's lyric as if the specified affix were attached. For example, if we have set C₄ to be associated with the suffix [_C4], then we place a note on C₄ on the piano roll with the lyric [ka], UTAU will look for the alias [ka_C4]. Setting both a prefix and a suffix to a particular note means UTAU will look for an alias that also has both.

To edit the prefix map, load the voicebank in UTAU, and navigate to Tools(T) > Edit prefix.map(E).... This will open up the Prefix Map Editor window.

Select all of the pitches you want to associate with a particular affix, type that affix into either the Prefix or Suffix box, and then his Set. When you're finished, click OK. Leaving a note blank will call the unmarked default pitch as discussed in the previous section.
prefix map example

Tip: If naturalness is your goal, you'll almost certainly want the recording pitch to be mapped to the same note pitch, and often to the notes surrounding it on either side, too. If you have notes which are evenly spaced between two recording pitches, choose whichever pitch you feel sounds better. Typically UTAU is better at pitching notes higher than they were recorded at compared to pitching them lower, so it can be better to err on the side of pitching up.

For monopitch multiexpression banks, if all of your samples have affixes on their aliases, you can simply hit Select All in the Prefix Map Editor and enter in whichever affix you want to be the default for every note.

For multipitch multiexpression banks, you can create multiple prefix maps. This can be useful if your expressions have different pitch distributions, or it you want to be able to change which expression is default, but it isn't necessary.

To do this, you can set up the first map in UTAU, save it, then create a copy of the prefix.map file in the voicebank folder. Change the name of the copy to something that will help you remember which map it is. Now, you're safe to overwrite it in the Prefix Map Editor with a map for a different expression, repeating this process for each new map.

To choose which map to load along with the voicebank, simply copy the contents of one of the maps you created into the main prefix.map text file. You can also use this to mix and match pitches from different expressions, such as for creating a powerscale / kire effect.

In the case that you want to edit the prefix.map file in a text editor, each line is formatted as:

note	prefix	suffix

The whitespace is important here. So, for example, to set a range of notes to have the suffix [_C4], you'd enter:

D4		_C4
C#4		_C4
C4		_C4
B3		_C4
A#3		_C4

Pitch Mapping and Voice Colors in OpenUTAU

OpenUTAU also has a built-in way to handle multipitch, with the additional capability of setting up multiple pitch maps, or voice colors, within the software itself. These settings are saved within the voicebank's character.yaml file. For this reason, multiexpression banks are generally much easier to use in OpenUTAU, as voice color can be selected in the note settings compared to having to deal with attaching affixes directly on the lyric.
voice color settings example

To set up pitch mapping and voice colors, navigate to Tools > Singers.... Open the voicebank you want to edit in the Singers menu, then click Edit Subbanks to open the Subbank menu.

This menu functions more-or-less the same as the Prefix Map Editor in UTAU; select the notes you want to assign a specific affix to, type that affix in the appropriate box, and then hit Set.
pitch mapping in the subbank menu

Similar to prefix maps, leaving a note blank will call whatever the unmarked sample is, if one exists. Unlike prefix maps, however, we don't need to do anything to the voicebank files to set up different pitch mapping for different expressions, as that is exactly what voice colors are.

To create a new voice color, click on Add Color, enter the name of the expression, and click OK. Here, you can set what subbank affixes should be used for which pitches whenever this color is selected in the note settings. You can also mix and match pitches from different recorded expressions under different color names.

In the case that you want to edit the subbanks in the character.yaml directly, the format is like this:

subbanks:
- color: ""
  prefix: ""
  suffix: _C4
  tone_ranges:
  - A#3-D4
- color: ""
  prefix: ""
  suffix: _E4
  tone_ranges:
  - D#4-F4
- color: power
  prefix: ""
  suffix: _P_C4
  tone_ranges:
  - A#3-D4
- color: power
  prefix: ""
  suffix: _P_E4
  tone_ranges:
  - D#4-F4

Whitespace is important here. Empty parameters should have quotes "" in them, but non-empty parameters should not have quotes around their text. The default color will always be unnamed, therefore this parameter will be empty for it.

Hey, you made it to the end of this tutorial! Go forth and make the complicated voicebanks of your dreams.

I'm putting this footnote here specifically to complain about having to spell colour without the "u" in order to be in line with how OpenUTAU calls it. Do you know how many times I had to correct myself when writing this? Too many...