☢ How to Write Reclists ☢

A comprehensive guide on writing reclists for UTAU and other DIY vocal synths. This resource assumes at least a basic knowledge of UTAU. If you need help with terminology or linguistic concepts, check the vocabulary page.

☆ So You Want to Write a Reclist... ☆

First, consider your goals:

  • Are you dissatisfied with existing reclists in the language and/or style you want?
  • Do you want to cover a language you haven't seen done yet? Multiple languages?
  • Do you want to customize an existing list, or build one from scratch?
  • Do you plan to release this reclist for public use, or make it just for yourself? If it's just for you, do you plan to use it for more than one voicebank?
  • Is your reclist intended for a specific vocal style (e.g. pop, metal, jazz, opera, etc.)?
  • Do you just wanna try out a wacky idea and see what happens?

Knowing the purpose of your reclist is pretty important, because it'll impact a lot of your decisions.

If your reclist is just for you, or even just for one voicebank, then it being the most optimized or well-designed thing in the world isn't super important. But if you intend for other people to use it (or use whatever voicebanks you make with it), you'll want to consider things like efficiency, timesink, ease of voicebank creation, ease of use within UTAU, and, of course, resulting voicebank quality. Balancing all of these things is key to a good reclist.

If the reclist structure is weak, any voicebank made off of it will be weak, too. It might be missing key sounds, have oversights in configuration, lack organization, be too big to be practical or too small to be useful, and so on. A strong foundation, by contrast, will consistently and effectively synthesize its intended language, allowing multiple developers to create good quality voicebanks that can be used for multiple songs.

Before we jump into it, it's important to keep in mind that for most of us (read: almost all of us), vocal synths are ultimately just a hobby. As such, there aren't really any hard rules for voicebank creation, and you're free to disregard any of my advice and do whatever you want. For people like me, optimization is part of the fun, but if my text walls are too much of a slog, remember that you don't have to treat this super seriously; the important thing is that you are having fun and being creative!

That being said, a lot of common conventions exist for a reason, so I'll do my best to outline here why things tend to be done a certain way, speaking as someone who has done a lot of reclist experimentation over the years and discussed it at length with other users.

☆ Considering Your Language ☆

The most important part of reclist writing is understanding the language you're working with on a technical level, so you'll want to familiarize yourself with the language's phonology. A good entry point is to search up "[language name] phonology" on Wikipedia, which will usually give you a rundown of the language's phonemes and phonotactics. I'd recommend doing further research as well, potentially even with academic sources, because there are sometimes multiple ways of parsing a language's phonology and you'll likely find different perspectives on it. Still, Wikipedia is a great starting point and good for quick reference, and for more widely spoken or culturally dominant languages, there are usually a lot of other free and easy to find linguistic resources out there.

Research tip: avoid sources that rely on spelling convention to explain phonetic features; these are not usually accurate. Look for ones that use actual phonetic transcription and linguistics terminology.

If you're a native speaker of the language, this is a double-edged sword: on the one hand, you'll likely have a lot of intuition about how the language is "supposed" to sound, but on the other hand, your perception may be clouded, whether because of spelling not matching up with the actual pronunciation, or just because no one is perfectly aware of everything their mouth is doing while speaking or singing.

If you're a non-native speaker, or if you don't speak the language at all, then you'll have to do a lot more research to understand the ins-and-outs, and you're a lot less likely to pick up on the nuances that make it sound natural and understandable. Either way, it's a good idea to get feedback from multiple speakers of the language you're working in.


Multilingual Reclists

I want to touch briefly on something that a lot of people are interested in creating: multilingual reclists. While these are definitely possible to make, I'd kind of advise against them in favour of reclists geared towards specific languages, and I'll explain why.

Consider this: French and English have a lot of phonemes in common, but in a lot of cases there are subtle differences in how they're produced. The French /l/, for example, is actually produced with the tongue in a different position than the English /l/, and the French schwa is more fronted than its English counterpart. Therefore, if you were creating a voicebank that could sing both English and French, your /l/ and /ə/ and other phonemes would sound less natural or more accented for one language or the other.

Basically, the more languages you try to incorporate, the more muddied your voicebank's accent becomes in whatever it's trying to sing — not to mention that trying to accommodate the phonological rules of multiple languages can make the reclist get pretty big and unruly.

That's not to say that accented voicebanks are bad; if you're a non-native speaker, you're probably going to have a slight accent no matter what, but its still good to make an effort to match the target language, right? My point here is that a multilingual reclist is either going to be heavily accented towards one of the intended languages, or have a weird mix of pronunciation that won't sound quite right in any of them.

In my opinion, limited multilingual compatibility is a much better option. What I mean by that is a voicebank which is optimized for one primary language, so all phonemes are produced as is natural for that language, but with extra phonemes added to support limited singing in other languages. For example, I often add English consonants to my Japanese voicebanks, so while the voicebank is optimized for Japanese and will consistently sing well in that language, it can also sing decently in English — just not nearly as well as a dedicated English bank. This can be useful for Japanese songs which have English words and phrases in them.

TL;DR: I feel that a voicebank which sings very well in one language is better than a voicebank which sings poorly or just okay in more than one, but you're free to disagree with me on this.


Transcription

As should be obvious if you've gotten this far, spelling =/= pronunciation. Languages with writing systems tend to have spelling rules which generally correspond to pronunciation, but there are almost always exceptions, and these rules vary from language to language. While it's possible to come up with a custom transcription system based on your language's orthography, if you want to minimize pronunciation ambiguity and allow your reclist to be accessible to a wider audience, it's usually better to use a more universal script like X-SAMPA. While on the surface it may seem less intuitive, you can typically get a lot more mileage out of it than you would a custom system, with the added benefit of users potentially already knowing it. In short, there's no need to reinvent the wheel.

X-SAMPA is based on IPA, so it is very flexible and can be used for many languages. You may need to modify it slightly to avoid characters that can't be used in file names or UTAU aliases, like [?] and [\], but it's a really good base.

That being said, if a language has a standardized and more-or-less-phonemeic script, it can sometimes be better to base your list off of that. Spanish, for example, has very few irregularities, and scripts like romaja or pinyin will likely be more intuitive for users who are looking to work in Korean or Mandarin, respectively. However, a lot of languages (e.g. English, French, Russian, and many more) don't have this kind of standard, so it's better to use X-SAMPA. ARPAbet also exists for English, but it has a lot of limitations, and workarounds to these run into the same problem of inconsistency that comes with custom scripts (and I personally just dislike it).

For Japanese reclists, either kana or romaji is fine for the list itself, but I'd recommend using kana for your aliases when possible. It's not the most practical, but because UTAU is a Japanese software made with Japanese voicebanks in mind, most Japanese USTs and voicebanks are kana-based and it actually ends up being more user-friendly in the longrun. If you're on Windows, you can install a Japanese Input Method Editor (IME) to type in it. If you don't know how to read kana and want to work in Japanese, I'd suggest learning at least hiragana to make your life easier, but romaji ↔ hiragana conversion plugins exist, so it's not a big deal if you don't.


Selecting Your Phones

First, you'll want to include all of the commonly recognized phonemes of the language. These can vary from dialect to dialect, but there's usually enough overlap to figure out which ones you definitely need, and you can spend some time deliberating on the others depending on how common they are or how frequently they show up in singing. You may even decide to optimize your reclist for a particular dialect, inwhichcase you only need to focus on what's relevant to that one.

When it comes to allophones, things can get a bit trickier. If an allophone is phonetically identical to a phoneme, like /s/ becoming [ʃ] in a language where /ʃ/ is also a phoneme, then it is already accounted for.

For allophones that are phonetically distinct, it depends on how your voicebank samples are broken down. If an English reclist, for example, relies on blending CCs and CVs to form onset clusters, it can be useful to distinguish between aspirated and unaspirated /k/ in your CV samples, e.g. [sk][ki] for "ski" and [k'i] for "key", but if the English reclist relies only on CCVs for onset clusters, the /k/s in "ski" and "key" no longer occur in the same environment within the voicebank configuration, so we can leave them as [ski] and [ki]. Allophones that are completely distinct from other phones in the reclist, like tapped /t, d/ [ɾ] in English, can simply be included as additional consonants or vowels.

All this being said, it's kind of impossible to cover every possible variation of every phoneme, and I would generally advise against trying to do that, since it will make your reclists less adaptable to different voices and your voicebanks harder to use. Instead, like phoneme selection, focus on what is most common accross speakers and/or most frequently occurring.


Phonotactics

This is less of a second step and more of something you'll do simultaneously with previous selection, and that is noting the environments where each sound can occur.

Consider:

  • Which consonants can occur in the onset? Which ones in the coda? Does your language have codas at all?
  • Are there any rules for determining which consonants can occur with which nuclei in a syllable?
  • Does your language have consonant clusters? What are the rules for building them? Are there any clusters which follow the rules but aren't found in any known words?
  • Does your language have diphthongs? Rhotic vowels? Syllabic consonants? Any other nucleus variants? If so, are they phonemic or phonetic?
  • How does your language handle two vowels touching? Two consonants?
  • Are there any allophones which occur only in initial position? Final position? Medial position? Only in clusters?
  • What are the other rules for determining which allophones occur where?

You will want to understand of all of these relationships when moving on to the next step, because so much of the way a phoneme is realized depends on the environment it occurs in.

☆ Reclist Structures ☆

By this point you might have thought to yourself something along the lines of, "I'm going to make this a CV reclist, because they're the easiest!" or "I'm going to make this a VCV reclist, because they're the most natural sounding!" or even "I'm going to make a VCCV reclist, because it worked for English, right?"

I want you to throw this line of thinking into the garbage. It's totally backwards! I would know, I used to fall into it myself.

It is far more useful to approach reclist writing from the perspective of what sample types you want to include, rather than any overarching structure. You are not here to cram a language into an existing template regardless of how well it fits; you are here to build a structure that will best support your language.

But before you jump to the opposite extreme of "I'm going to invent a totally new reclist type that's unlike anything else out there!", I'm afraid I'm going to have to shatter your dreams again.

The boring reality is this: a well-structured reclist in almost any language is most likely going to end up as some form of CVVC. If it's not obvious why, allow me to explain.

A CV reclist is named as such because it contains mostly CV samples. It has no codas, and no transitional samples. Likewise, a VCV reclist is named as such because it contains mostly VCV samples. It also has no codas, and the transitions between notes are captured wholly within the triphone.

Following suit, CVVC reclists are reclists which have both CVs and VCs. These VCs can be either transitional or final, but their presence makes it a CVVC reclist because those are the primary sample types being used. By the sheer nature a language having codas, it will almost certainly need to have a CVVC reclist in order to be synthesized effectively. Though these reclists are sometimes labeled CV-VC or CVC or psuedo-VCV, they are all terms for the same idea. Even VCCV and ARPAsing are just CVVC reclists designed specifically for English (and I have my critiques of both of them).

It's not that I think coming up with names for specific methodologies is useless, just that I think it's best to label things straightforwardly when we're talking about language this abstractly; it helps alleviate a lot of the confusion and misconceptions.


Sample Types

Now that we've established why we're approaching our reclist from a sample-out rather than a structure-in perspective, let's go over all of the different types of samples we can include.

First, here are some terms I'm going to be using a lot:

  • Initial — occurring at the beginning of a phrase; preceded by silence.
  • Medial — occurring in the middle of a phrase; preceded and/or followed by another utterance.
  • Transitional — exists to capture the transition from one sample into another.
  • Final — occurring at the end of a phrase; followed by silence.
  • Isolate — occurring outside of a phrasal context; both preceded and followed by silence.
  • Blend — a diphone consisting of two vowels or consonants.
  • Sustained — a phone (typically a vowel) held for multiple beats.

Next, let's look at the five main sample types:

  • CVs can be either initial (+ sometimes isolate) or medial. We can consider CCVs to be a type of CV because they function the same in the software.
  • VCs can be final (+ sometimes isolate), medial, or transitional. Like the above, we can consider VCCs to be a type of VC.
    • I make a distinction here between medial and transitional because they function differently. A medial VC is configured and functions the same as a final VC, capturing the entire consonant while being sampled mid-phrase, while a transitional VC serves as a blending point from a previous vowel into a following consonant.
  • Vowels, or more broadly nuclei, can be initial, blended (also called VVs), or sustained. In a VCV or CVVC voicebank, the vowel ending is often also sampled and treated similarly to a VC.
  • Consonants can be initial, final, medial, or blended (also called CCs), and some of them can be sustained.
  • VCVs are always both medial and transitional, so we don't really need to distinguish different types. They can technically be isolates, but this is generally a bad idea to do for the entierty of a reclist for reasons stated below.

CV versus CVVC

For reasons mentioned earlier, CV reclists only work for a small number of languages. It might seem appealing to try and cram any language into this format due to its small size and relatively simple configuration, but it's almost never going to give you good results — not just on the level of sounding choppier due to the lack of transitional samples, but fundamentally lacking in a good way to handle codas, which are necessary for a lot of languages.

While you can technically finagle a CV into serving as a coda by clipping off the vowel, it's going to sound pretty unnatural, so I wouldn't really recommend it. Consonant clusters are especially difficult to synthesize this way. CVVC reclists, though about twice the oto length of a CV, are still pretty small, and while they take a little more effort for configure, you'll save yourself a lot of irritation when it comes to actually using the voicebank.

For a language that doesn't need codas, a solution to overall choppiness is by the use of the rentan technique, i.e. recording CVs in a continuous string with a leading vowel so that they are sampled medially rather than initially. When you think about, most of the notes you're going to be synthesizing are mid-phrase, so it makes sense to prioritize those in the voicebank. But if you're going to be recording in continuous strings anyways, I'd say you might as well sample the transitional VCs; CVVC voicebanks can still be used as CV, after all.


The Problem with VCV

AKA why they are not always the best choice, and why CVVC still is often better.

While it's true that VCV samples tend to sound the most naturalistic due to preserving the entire transition from note to note, they also come with two major drawbacks: size and lack of flexibility. A complete VCV needs to provide a sample for every possible combination of V + CV that can occur in a language. Due to phonotactics and lexical gaps, this might not encompass literally every hypothetical triphone, but it's likely going to include most of them. As such, these voicebanks can get real big real fast, often impractically so, and not leave much room for samples which are arguably more useful to include.

Consider a language with 5 vowels. In order to include every possible vowel blend, the number of samples is equivalent to the number of phonemes times itself plus one (to account for the initial sample): 5 * (5 + 1) = 5 * 6 = 30. That's a minimum of 30 samples for every consonant, plus another 30 for vowel blends.

This number grows exponentially. By 7 vowels, we're at over 50 samples per consonant (56). By 10 vowels, we're at over 100 samples per consonant (110). English, which has a minimum of 15 vowels, is at 256 samples per consonant. To give you an idea of the entire reclist size, say we have 25 possible onsets plus a set of vowel blends; that's at least 780 samples for 5 vowels, 1456 samples for 7 vowels, 2860 samples for 10 vowels, and 6656 samples for 15 vowels — way too freakin' big to be practical! And this is not even taking things like consonant clusters and codas into consideration.

Japanese has a small phoneme inventory and some simple yet restrictive phonotactics, resulting less than 1000 samples for a typical VCV voicebank. This is considerably larger than CV or CVVC, but still a fairly managable size. Languages with similarly small inventories and straightforward phonotactics, like Hawaiian and even Spanish, also lend themselves decently to VCV, but the reality is that a lot of languages simply don't.

The main appeal here is resulting voicebank quality, which comes from capturing the natural transitions, as mentioned, and also by the variation that comes from having duplicate CVs for each VCV environment. But, frankly, the quality difference between a VCV and a well-configured CVVC is going to be minimal, if not non-existent, and that same kind of naturalistic variance can be captured through multipitch or multiexpression — both of which I find more useful than duplicates of the exact same pitch and tone. Not to mention that all the time and space saved by going with CVVC over VCV can be put towards the inclusion of other more beneficial sample types, such as consonant blends for codas and clusters, and VC samples for more control over consonant timing than VCV allows.

All this being said, VCV voicebanks can definitely be easier to use compared to CVVC, and are typically easier to configure, so I'm not trying to argue that they're never worth it, just that they're not the end-all-be-all of quality reclists or voicebanks.

☆ Strings ☆

Once you have a solid idea of the phonemes and sample types you want to include, the next step is organizing them into strings, i.e. what will be included in each .wav file, and then organizing those strings into the reclist itself. This is likely going to take some trial and error, especially if you're dealing with a rather complex language.

Also keep in mind that each user is going to have different priorities and preferences for what should be included and how, and you are never going to please everybody. Instead, remember the original goals you set out for this reclist, and try to match them best you can. If you are writing a general purpose reclist with the intention of other people using it, be open to the idea that they are going alter whatever you come up with to suit their own wants and needs, and that's okay. Nonetheless, you can still attempt provide a solid foundation that offers both thoughtful structure and user flexibility — or just say "fuck it" and write it however works best for you.

In this section, I'm going to outline common conventions, give some general tips, and explain how I came to have some of my own preferences.


String Length

This is one of the biggest considerations, as it will affect the rest of your reclist's structure. The shortest possible string length is, of course, one syllable *. As for maximum length, you could try to cram every target sample into a single string, and there are some gimmick reclists that do this, but I think it should be pretty obvious why this isn't practical for standard use. I don't have a scientific source on this, but from my observations, most amateur vocalists seem to be able to comfortably hold a note for about 5000 milliseconds, which is 10 beats at 120 BPM or a bit over 8 beats at 100 BPM. As such, I try not to write strings longer than 8 syllables, and definitely no longer than 10.

* I'm going to use the word "syllable" here for consistency's sake, but know that this also applies to morae.

At a glance:

ProsCons
Short Strings
(1~5 syllables)
  • Typically easier to record
  • Allows for recording in half-time (2 beats per syllable)
  • Clearer articulation
  • Better for harsh / strenuous vocals
  • Easier to keep pitch / tone consistent within a string itself
  • Longer reclist overall
  • Less naturalistic utterances, esp. at fast tempos
  • Less efficient
  • Redundancy often unavoidable
Long Strings
(6~10 syllables)
  • Shorter reclist overall
  • Usually most efficient
  • Little to no redundancy
  • More naturalistic utterances
  • Easier to keep pitch / tone consistent from string-to-string
  • Can be more difficult to record, esp. for amateurs
  • Articulation can be muddier
  • Usually too long for half-time

By efficiency, we mean that we want to be able to hit all desired samples as optimally as possible while keeping strings reasonably sized and not too difficult to pronounce. By redundancy, we mean avoiding unnecessary duplicates in the list overall; why waste time recording a separate sample when you can pull the same sequence from an existing string?

Some amount is likely going to happen, like how all the [V ん] sequences in Japanese VCV that don't actually get oto'd are there because we still need [n CV]s, and it's more efficient to tack these samples to the end of existing strings. Additionally, sometimes what appears to be redundancy is there for a reason, like overall reclist organization or something to do with the phonological environment. We also don't usually want strings of wildly different lengths one right after the other; keeping them consistent makes for a smoother recording session since you can get into the rhythm of it. As usual, it's a balancing act.


Isolates and Isolate Strings

Isolates are usually found in CV reclists, but in my own reclist development adventures I've found they also work great as an add-on for continuous CVVC. By their nature, they are both followed and preceded by silence, so any initial or final sample can be pulled from them. Of course, initial and final samples can also be pulled from longer strings, but depending on the nature of the reclist, isolates may actually be more optimal.

They can be handled one of two ways: completely separated into their own .wav files (like a typical CV bank), or placed in a string with at least one beat of silence between each syllable (usually more efficient to record, but not as easy to configure). An isolate string will look something like CV_CV_CV_CV or CVC_CVC_CVC_CVC.


Vowel Blends and VCV Strings

While technically you could record every vowel blend and triphone in it's own .wav, it's not really recommended because it would be super inefficient. Instead, there's a pretty easy method of writing VCV strings that's guaranteed hit all the necessary samples, as well as initial CVs and final VCs.

  1. Consider all the vowels (or nuclei) you need in your reclist.
  2. Creating a separate string starting with each vowel.
  3. Follow an A-B-A-C-A... pattern on each string, where A is the starting vowel of the current line, B is the starting vowel of the next line, C is the starting vowel of the line after that, and so on.
  4. Continue until the string length is equivalent to the total number of vowels, e.g. 5 vowels = 5 syllables.
  5. Double up the vowel at the beginning, resulting in a final string length of the total number of vowels plus one, e.g. 5 vowels = 6 syllables.
  6. Repeat for every other line. When you hit the last vowel in the sequence, wrap back around to the beginning.
  7. If lines are longer than desired, split into as even segments as possible, making sure to include the last vowel of the split string at the beginning of the new string, e.g. a-i-aa-i / i-a

Now we've got a nice set of VV strings that we can also pull initial V, sustained V, and V ends from, and we simply duplicate this pattern as-needed for every onset. If we're just working with [a, e, i, o, u] and there's no restrictions on which ones can occur in sequence, the end result is this:

a-a-e-a-i-a
e-e-i-e-o-e
i-i-o-i-u-i
o-o-u-o-a-o
u-u-a-u-e-u

ka-ka-ke-ka-ki-ka
ke-ke-ki-ke-ko-ke
ki-ki-ko-ki-ku-ki
ko-ko-ku-ko-ka-ko
ku-ku-ka-ku-ke-ku

If we're working with a vowel inventory like [@, a, e, i, o, u], and we want to also include final VCs, but we want to keep our strings capped out at four syllables, we may get something like this:

k@-k@-ka-k@k
k@-ke-k@-kik
ka-ka-ke-kak
ka-ki-ka-kok
ke-ke-ki-kek
ke-ko-ke-kuk
ki-ki-ko-kik
ki-ku-ki-k@k
ko-ko-ku-kok
ko-k@-ko-kak
ku-ku-k@-kuk
ku-ka-ku-kek

Splitting VCV strings is going to lead to some redundancy. We could technically omit the first and final consonants of the split lines, but I find it's actually easier to record this way since the consistent pattern means you don't have to take that extra moment to process the different rhythm of each line. It doesn't add onto recording time, either, since we're going to be recording those syllables anyways; we simply don't include the duplicates in the oto.

I also want to mention here that it's totally fine to include VCVs for some onsets but not all; you'll sometimes see this done with semivowels, i.e. /j/ and /w/, and glottal consonants like the stop /ʔ/ and the fricative /h/. And, if you're doing anything beyond a basic CV reclist, you'll probably want to include vowel blends regardless of how the rest of the list is set up, and this is the most efficient way of doing so.


CVVC Method 1: Continuous / Rentan

This is my preferred way of handling CVVC because the samples tend to sound more natural, and the reclists are much shorter compared to the alternative method. The main drawback is that initial CVs and final VCs are not accounted for, but as mentioned earlier, these samples are going to be used with way less frequency, and either initial / final Cs can be included to make up for them, or initial CV / CVC strings can be recorded separately (and still make for a quicker recording sess).

These can be organized either by consonant or by vowel. Japanese I find easier to organize by consonant due to the small number of vowels, but for English and other languages with more complex vowel systems, I find organizing these strings by vowel helps a lot with vowel consistency, which in turn makes for smoother blending. Either way, we want to make sure that we're hitting every possible CV and VC.

For consonant-based strings, that's going to follow a pattern like A-A-B-C-D-E... or A-B-C-D-E...A. Like VCV, want to account for the starting vowel's transitional VC, so we either double up the CV at the beginning, or tack on a duplicate CV at the end. The order of the CVs in these strings doesn't really matter since all of them will be sampled separately. Example for a hypothetical language:

pa-pe-pi-po-pu-pa
ba-be-bi-bo-bu-ba
ta-te-ti-to-tu-ta
da-de-di-do-du-da
etc.

Do note also that we should be sampling the second instance of the duplicate CV in our oto, in this case the one at the end of the string, since we want to capture how it sounds following another vowel.

For vowel based strings, we simply organize every CV pair with the same vowel into strings of our desired length. Since we're not organizing these by consonsant, and it won't affect the string pattern, we don't need to include an initial C in these strings. In fact, leaving the vowel open is probably better for organization. Example for the same hypothetical language:

a-pa-ba-ta-da-ka-ga
a-fa-va-sa-za-xa-ha
a-ma-na-ja-wa-la-ra

e-pe-be-te-de-ke-ge
e-fe-ve-se-ze-xe-he
e-me-ne-je-we-le-re

etc.

CVVC Method 2: Segmented / 2-mora / VCCV

This method is less efficient, but is arguably easier to record and has initial CVs and final VCs already baked into it. Basically, just take every CV needed in the reclist, and double it into a two-syllable string. Bam. You're done. If you need a final VC, add one on the end. Example:

ka-kak
ke-kek
ki-kik
ko-kok
ku-kuk

PaintedCZ's VCCV reclist follows the same idea, though also includes medial VCs followed by a dummy consonant (one that is uttered but won't be included in the oto), like so:

ka-kak-tak
ke-kek-tek
ki-kik-tik
ko-kok-tok
ku-kuk-tuk

In my opinion, though, these medial VCs aren't super useful and just kind of bloat the reclist. They're not terribly different from the final VCs, and their purpose is better served by the inclusion of consonant blends, which allow for more precise timing and smoother transitions. Speaking of...


Single Consonants and Consonant Blends

Unlike vowels, most consonants can't really be produced in isolation — at least, not in the same way they'd sound in a syllable. As such, even if it's going to be cropped out in the oto, the syllable being sampled from still needs a dummy vowel. The vowel used in the reclist can be whatever feels logical. The schwa, [@], or any other unrounded central vowel is usually a good choice if the language has one, since it's less likely to affect the consonant.

Including initial and final consonants, if they're not being sampled elsewhere, is as easy as including isolate strings like f@f_v@v_s@s_z@z.

As for consonant blends, I've found the easiest way to organize them is by coda. This is less likely to result in performance errors compared to organizing them by onset, likely due to how we process rhyme. Essentially, you want to start with the VC of the dummy vowel and a given consonant, and follow it up with a series of CVCs starting with every other consonant. Example:

@f-p@f-b@f-t@f-d@f-k@f-g@f
@f-v@f-s@f-z@f-x@f-h@f
@f-m@f-n@f-j@f-w@f-l@f-r@f

We do not usually need a sustained consonant, so the double can be omitted, but you can include it if you feel its useful. Plosive + plosive CCs can also be omitted because plosives treat the break in airflow as the blending point, meaning that a plosive + plosive CC sample would just result in silence or amplified noise.

Additionally, you might consider splitting off any sibilant + sibilant CCs into isolated strings like @s-z@__@z-s@ since they can be particularly tricky to pronounce, sacrificing a little bit efficiency for significant ease-of-use improvements.

Lastly, a lot of the time, a language will have consonants which are allowed to serve as onsets but not as codas (and less frequently, the opposite), so keep that in mind when writing your CC strings.

Note on splitting consonant blends: We could technically write a string like @C-C@_@C-C@_@C-C@ without going over eight beats, but since these splits are two-syllables a piece, if they're going to be strung, I find it easier to time them with two beats of silence rather than just one, hence the use of two underscores.


Consonant Clusters

How these are handled is going to depend pretty heavily on the language's phonotactics — moreso than most other sample types, because the rules for cluster forming and how clusters are even analyzed in a language is going to have a lot of variation.

If they're aren't a lot of them, and they occur pretty frequently, it might be best to handle them as CCV and VCC samples, especially in the case of semivowel onset clusters, i.e. C + [j]/[y] and C + [w]. You'll see this approach a lot in Japanese and Spanish reclists. The string type that works best for them is going to depend on the rest of the reclist.

However, CCVs and VCCs aren't always the best approach, especially if a language has a lot of consonant clusters, like English. CCVs and VCCs in this case can add potentially hundreds more strings to the reclist and even more lines to the oto, a lot of which are almost never going to actually be used. As such, it's often more productive to break them down into cluster blends that can be blended with medial CVs and transitional VCs, if not just fully construct them phone-by-phone out of the regular consonant blends (think ARPAsing).

We can include isolate strings for initial and final clusters like CC@_CC@_CC@_CC@ and @CC_@CC_@CC_@CC, and continuous strings for medial clusters like @-CC@-CC@-CC@... for onsets and @CC-C@CC-C@CC-C@... for 3+ coda blends. I find that treating the final consonant of the coda cluster as the onset of the following syllable is the easiest to read while recording, but they could just as easily by written like @CCC-@CCC-@CCC-@.... We don't need medial blends for codas with only two phonemes if we're also including normal consonant blends, since they'll be functionally identical. I will say, though, that depending on how many and how complicated the 3+ codas are, you may want to split them into isolated strings like @CC-C@__@CC-C@ to lessen performance errors.

Alternatively, we can take the 2-mora approach for clusters altogether, and handle them in strings like CC@-CC@ and @CC-C@CCC.

☆ Reclist Organization ☆

Finally, let's talk about organization. I mentioned earlier that an ideal reclist will allow the vocalist to get into a rhythm while recording and not have to think too hard about what they're singing — especially if natural vocals are the goal. This means that we probably want to group all the strings of one type together, and keep them as close to the same length as possible. If there are multiple strings of the same type and the same primary phoneme (e.g., all the [k] strings in a VCV), we probably want to group those together, too. This way, we only have to mentally switch gears every so often throughout the recording process, instead of having to do it like every other line. Our brains are pattern-oriented, after all.

The order of the string type groupings doesn't technically matter, but keep in mind that vocal performance is going to shift throughout the recording process. A freshly warmed-up voice at the beginning of the list is going to have a more relaxed formant compared to the end of the session, where the voice might be getting tired, or worse strained (it's good to try to avoid this, obviously, but it still happens). For this reason, I like to present them like this:

  1. Whatever sample types are going to make up the majority of synthesis, usually medial CVs or VCVs, to be sure they're top quality, and any start-of-recording inconsistencies aren't going to be as noticeable
  2. Vs and VVs; I put these here so that the vocal quality is more consistent with the bulk of the main samples, but a lot of reclists put them first and that's fine, too
  3. Any lower frequency sample types that aren't included in the main strings but still have vowels, like initial CVs and final VCs
  4. Any samples where vowel quality doesn't matter, like Cs and CCs

Of course, we can also split each part of the reclist into a separate file, but this can be less convient for both the developer to manage and the vocalist to use, so I typically don't. Even if I develop the sections separately (which can make it easier to implement changes to big reclists), I usually combine them into a single list for actual recording and distribution.

Phoneme order, both in-string and in the reclist overall, is mostly going to depend on what feels logical for the language you're working with. In most cases, I like to sort consonants based on their phonetics, for example, but vowels are easier for me to sort alphabetically. For Japanese reclists, though, I find it most intuitive to more-or-less stick to 五十音 (gojuu-on) order.


File Naming

Every string in the reclist is also going to be the name of the audio file, especially if the user is recording with OREMO, which names each one automatically. With this in mind, we may want to take measures to keep files grouped together not only in the reclist, but also in the voicebank itself. This isn't a necessary step by any means, as some users don't mind if their .wav files are all over the place inside the voicebank, but personally I prefer things to be more orderly.

I can't speak for Mac users, but Windows at least is going to automatically sort files alphabetically. This means that whatever character is at the start of the string is going to be right next to all the others starting with that same character, something we can use to our advantage.

Take, for instance, a segmented Japanese CVVC; we need to include [n] + C VCs, but if we write every n-CV string as just n-ka, n-ta, etc., they'll be all grouped together in the files and mixed in with strings like na-na. To circumvent this and keep these strings grouped with the rest of their own consonant, we can add a string prefix to the front, like C__n-CV.

We can do this with any series of characters, really. If I'm not just grouping by phoneme, I tend to stick to underscores [_] and plus signs [+], since they play nice with autosorting and aren't usually a part of the string transcription — just make sure to communicate their purpose to the reclist user if you include them. Keep in mind also that Windows and Mac OS both have some restrictions on what characters can be used in file names, so we can't use anything like [?] or [/].

Alternatively, if the reclist is particularly big, you can instruct users to sort certain string groups into separate subfolders within the voicebank, though personally I don't like this much because it can make sorting all the subfolders of a multipitch or multiexpression bank kind of convoluted.

Speaking of subfolders, however, now's a good time to mention that UTAU treats the subfolder name as a part of the file name; this is why it's good to alias your oto even if your reclist only consists of isolates, and also why it's not necessary to indicate what pitch or expression a string belongs to in the file name itself if they're already placed in separate folders. In other words, you can have two identical file names in separate folders, and UTAU won't get them confused, but if you want to lump multiple pitches into a single folder, you'll likely want to differentiate them either with prefixes or suffixes.


Sample Aliasing

Last but certainly not least, let's go over sample aliasing. This is really important, because it's going to determine how the voicebank actually functions in the software. I'm not going to talk about other parts of configuration here, because this tutorial is plenty long enough already, but this topic still feels relevant to discuss.

Formatting conventions:

Use CaseRoman Characters
(my preferences)
Roman Characters
(alternatives)
Kana
(if applicable)
Whatever CV type is included if only one present OR medial CVs if initial CVs are present[CV]
[CCV]
[C V]
[CC V]
[か]
[きゃ]
Initial CVs if medial CVs are present[- CV]
[- CCV]
[-CV]
[-CCV]
[- か]
[- きゃ]
Whatever VC type is included if only one present[V C]
[V CC]
[VC]
[VCC]
Transitional VCs[V C]
[V CC]
Medial VCs[VC]
[VCC]
Final VCs[V C-]
[V CC-]
[VC-]
[VCC-]

[VC]
[VCC]
Initial vowel if not isolated[- V][-V][- あ]
Isolated vowel[V][あ]
Sustained vowel[V]
[V V]
[* V]
[*V]
[あ]
[a あ]
[* あ]
Vowel blend[V V][VV][V あ]
Vowel end[V -][V-]
[V R]
End breath[V hh][V BRE]
[Vhh]
[V 息]
Initial consonant[- C][-C]
Final consonant[C -][C-]
Consonant blend[C C][CC]
Medial onset cluster[CC]
[CCC]
Initial onset cluster[- CC]
[- CCC]
[-CC]
[-CCC]
[-C C]
[-C CC]
[-CC C]
Medial 3+ coda cluster[C CC][CC C]
[CCC]
Final coda cluster[C C-]
[C CC-]
[CC-]
[CC C-]
[CCC-]

I tend to prefer the ones with spaces because they're a bit easier for me to read at a glance. I follow the basic rule of one phone (or dash) on the left, space, and then everything else on the right, indicating that the left phone is what's being crossfaded with the previous note. This is with the exception of CVs, simply because the unspaced format is much more widely used, so it feels more intuitive even if it doesn't follow the same pattern; the only reclist style I've seen that uses [C V] is ARPAsing. I similarly do no use spaces for medial onset clusters in order to distinguish them from the regular consonant blends.

Overall, whatever kind of formatting you choose to implement, just be consistent; if your initial CVs are written with a space like [- CV] but your initial clusters are written without a space like [-CC], that's not going to be very intuitive to anyone using your reclist or voicebank.

Nonetheless, it help to include a note about your aliasing in your voicebank's readme, even if you're using a more standard reclist. You can also add any alias prefixes or suffixes you like, but that's going to be more on a per-voicebank basis.

☆ What Now? ☆

Go forth and release your reclist into the wild! — or don't, it's up to you.

I know this was pretty wordy (perhaps too wordy...), but like I said up front, in the end, you can do whatever you want. Go crazy go stupid etc.

I'm also definitely not the only one worth listening to when it comes to reclist design; I highly recommend getting multiple perspectives and doing a lot of experimenting yourself. It's not the flashiest aspect of vocal synth, but it's definitely one of my favourite parts; there's something really satisfying about breaking down a language into it's bear essence and then reconstructing it into its most optimal phonetic form, hearing your synthesis efforts sound more and more humanlike just by the virtue of hitting all the right sound combos... Or maybe I'm just insane.

Anyways, thanks for reading, and I hope you found at least something here helpful! Feel free to check out my other resources as well, though I'll be surprised if any of them end up longer than this one.