☢ Vocabulary ☢

Here are some helpful terms and concepts to know when working with UTAU and other speech/singing synthesizers.

Keep in mind that these are pretty complex topics, so my explanations are going to be simplified and focused on what will likely make sense for English speakers.

☆ UTAU ☆

Voicebank Terms

  • Voicebank — a complete folder of audio files, configuration settings, and frequency files that UTAU will read from and synthesize as one voice.
  • UTAU(loid) — the character associated with one or more specific voicebanks.
  • Root folder — the main folder of the voicebank; contains the character.txt and readme.txt files.
  • Subfolder — any folder inside of the main folder; often used to organize separate pitches or expressions within a voicebank.
  • Multipitch — a voicebank consisting of multiple completely recorded reclists at specified pitches, e.g. F3, C5, A#2, etc.
  • Multiexpression — a voicebank consisting of multiple completely recorded reclists in specified vocal tones.
  • Append — another name for expression popularized by VOCALOID; can refer to an expression within a voicebank or a separate voicebank for the same character.
  • Act — another name for voicebank version number popularized by VOCALOID, e.g. Act I, Act II, etc.
  • おまけ (omake) — Japanese for "bonus" or "extra". Additional vocal samples that are sometimes included with voicebanks; not typically meant to be synthesized.

Reclist Terms

  • Reclist — short for "recording list"; a list of everything that will be recorded and sampled from in a voicebank.
  • String — a series of syllables or morae recorded within the same .wav file.
  • Sample — an isolated segment of a string which is configured under a specific alias in the oto.
    • Sample Names — samples are named by the sequence of consonants and vowels they contain, e.g. CV = consonant+vowel, VCV = vowel+consonant+vowel, CC = consonant+consonant, etc.
    • Diphone — a sample containing a two-phone sequence; CVs, VCs, CCs, and VVs.
    • Triphone — a sample containing a three-phone sequence; VCVs, CCVs, and VCCs.
    • Initial — occuring at the beginning of a phrase; preceeded by silence.
    • Medial — occuring in the middle of a phrase.
    • Transitional — exists to capture the transition from one sample into another.
    • Final — occuring at the end of a phrase; followed by silence.
    • Isolate — occuring outside of a phrasal context; both preceeded and followed by silence.
  • CV / 単独音 (tandoku-on) — diphone-based reclist consisting almost entierly of CV samples.
    • れんたん (rentan) — a method of CV which involves recording a continuous string.
  • VCV / 連続音 (renzoku-on) — triphone-based reclist consisting almost entierly of VCV samples.
    • Lite VCV — reclist method which has VCV only for some morae and relies on CV for others.
  • CVVC — mostly diphone-based reclist which includes both CV and VC samples. Sometimes called CVC, CV-VC, or VCCV. Can also include VVs, CCs, and VCVs for certain phonemes.
    • VCCV — English CVVC method developed by PaintedCZ. Uses a custom transcription system.
    • ARPAsing — pure diphone-based English method developed by Kanru Hua. Uses ARPAbet.
  • VCVVC — reclist method which includes both VCV and VC samples. Can also include CCs.
  • Blend — a diphone consisting of two vowels (VVs) or consonants (CCs).

Configuration Terms

  • Configuration — the process of creating audio samples and setting their parameters so that they will synthesize correctly in a software; a more general term for otoing.
  • oto.ini — the name of the main file UTAU uses to read voicebank configurations.「音」(oto) is the Japanese word for sound.
  • Waveform — a visual representation of an audio wave; useful for precise parameter timing.
  • Spectrogram — a visual representation of audio frequencies; useful for precise parameter timing.
  • Alias — the unique unicode name given to a specific sample in the oto.
    • Prefix / Suffix — one or more characters placed at the beginning or end of an alias to differentiate it from otherwise identical samples.
    • Prefix map — the file UTAU uses to map specific alias prefixes or suffixes to specific notes on the piano roll; used for automating multipitch.
  • Parameters — the numerical values UTAU uses for note timing and resampling.
    • Offset — the parameter which tells UTAU where to begin playing the audio of the sample.
    • Consonant (oto value) — the parameter which determines how much of the sample should not be stretched or looped by the resampler.
    • Cutoff — the parameter which tells UTAU where to stop playing the audio of the sample.
    • Preutterance — the parameter which determines the timing of the sample in the UST.
    • Overlap — the parameter which determines how much of the sample is crossfaded with the previous note.
  • Envelope — the visual shape of a given note on the piano roll; shows it's intensity and how it crossfades with the other notes based on the length, preutterance, overlap, and crossfade settings.

☆ Phonetics ☆

Phonetics refers to the classification of speech sounds, called phones. Most spelling systems are inadequate to represent phones in writing, so we use various methods of phonetic transcription to represent them. The linguistic standard is the International Phonetic Alphabet (IPA), where every character represents a single phone. In speech synthesis, we usually use some version of X-SAMPA, which you can think of as a unicode-friendly version of IPA. Other scripts you might see used are kana and romaji for Japanese voicebanks, other standardized language-specific scripts like ARPAbet for English or romaja for Korean, or reclist-specific custom scripts.

Following standard notation, we'll use <carrot brackets> for written words, /IPA in slashes/ for phonemes, [IPA in square brackets] for individual phones, and [monospace brackets] for a script as used by UTAU, like romaji or X-SAMPA.


Phones can be sorted into many different categories, but perhaps the biggest distinction is between vowels and consonants. In phonetics, we consider a vowel to be any sound with unobstructed airflow through the vocal track (mouth, nose, and throat). We label them based on tongue height, tongue frontness, and lip rounding. For example /i/ as in <fleece> /flis/ is a high front unrounded vowel.

Helpful vowel terms:

  • Monophthongs and Diphthongs — monopthongs have a relatively fixed position from start to finish, while diphthongs start with the mouth in one position but move towards another position during utterance.
  • Vowel reduction — when a vowel is produced with a "weaker", more relaxed, more centralized sound than normal.
  • Monophthongization — when a diphthong is reduced to the point of being a monophthong.
  • Schwa — the common name for the mid central vowel [ə].
  • Tenseness — whether the throat muscles are tense or laxed during production.
  • Nazalization — when the velum is lowered during production, resulting in a "nasally" sound.
  • Rhoticization — when a vowel is affected by a following rhotic consonant, also called R-colouring.
  • Velarization — when a vowel is produced with the back of the tongue raised. This can happen as a result of a following lateral consonant, informally called L-colouring.

Unlike vowels, consonants have partially or completely obstructed airflow caused by one or more articulators (tongue, lips, teeth, etc.). We label them based on voicing (whether or not the vocal folds are vibrating), place of articulation (where in the vocal track the sound happens), and manner of articulation (how the sound is being produced). For example, [p] as in <pit> /pɪt/ is a voiceless bilabial plosive.

Place of Articulation:

  • Bilabial — produced with both lips.
  • Labiodental — upper teeth and lower lip.
  • Interdental — tongue blade between the teeth.
  • Dental — tongue tip against the upper teeth.
  • Alveolar — tongue tip or blade against the alveolar ridge.
  • Palatal — tongue body against the hard palate.
  • Velar — tongue body against the soft palate (velum).
  • Uvular — tongue body against the uvula.
  • Pharyngeal — tongue root against the pharynx.
  • Glottal — produced entierly in the glottis (aka vocal folds / vocal chords).

Manner of Articulation:

  • Obstruent — a consonant produced with complete closure in the mouth.
    • Plosive — airflow is completely obstructed in the vocal track. Also called a stop or oral stop.
    • Nasal — airflow is redirected through the nose. Also called a nasal stop.
    • Release — the release of air pressure at the end of a plosive.
    • Affricate — begins as a plosive but is released as a fricative.
    • Tap / flap — produced with a single muscle movement and no buildup of air pressure.
  • Continuant — a consonant with only partially obstructed airflow.
    • Fricative — airflow is turbulant.
    • Sibilant — a subclass of fricatives characterized by a "hissing" sound; includes [s, z, ʃ, ʒ].
    • Approximant — one articulator approaches another without actually touching.
    • Semivowel — an approximant produced somewhere between a consonant and a vowel. Also called a glide.
    • Lateral — air is released on the sides of the tongue, e.g. [l].
    • Rhotic — the tongue is curled towards the roof of the mouth, e.g. [ɹ].
    • Liquid — a grouping for lateral and rhotic approximants.
    • Trill — the articulator vibrates rapidly.
  • Aspiration — whether or not there is an extra release of air during production.

In vocal synth spaces you may have heard people use the terms "hard consonants" to refer to plosives and "soft consonants" to refer to nasals and continuants, but these are not terms used by linguists. It's more useful, in my opinion, to anaylize by manner, because "hard" and "soft" are quite vague and can cause confusion with regard to affricates. While its true that sounds like [l, n, s], etc. behave similarly in the oto, they all appear quite different in the waveform and spctrogram and may be treated differently phonotactically.

☆ Phonology ☆

Building off of phonetics, phonology is how all of these physical sounds work together to form a language. Every language is different, and some are more phonologically complex than others, but remember that complexity =/= sophistication; a language having fewer sounds and phonological rules usually makes up for it by being complex in another area. That being said, the more complex a language's phonology is, the more difficult it is to synthesize effectively; consider Japanese compared to Spanish compared to English.


Phonemes and Allophones

While a phone is simply any speech sound that exists, a phoneme is a speech sound which carries significance in a particular language. In English, our ears recognize that the words <cat> /kæt/ and <bat> /bæt/ are different words because we register /k/ and /b/ as different phonemes, same for other minimal pairs like <cat> /kæt/ and <cab> /kæb/, or <cat> /kæt/ and <kit> /kɪt/.

An allophone is a variation of a phoneme as it occurs in a predictable environment. Looking at <cat> again, the /k/ at the beginning is aspirated in many varieties, becoming [kʰæt], which is something that only happens when /k/ is in syllable-initial position. However, we still register this as the same consonant sound as the /k/ in <skit> /skɪt/, even though that one is unaspirated. Due to /k/ and other voiceless plosives /p, t/ being unaspirated in more environments and the fact their voiced counterparts are typically always unaspirated, we consider /k/ the phoneme and [kʰ] its allophone. Notably, allophones are not universal; in other languages, /k/ and /kʰ/ can be separate phonemes, illustrated by minimal pairs like <기> /ki/ and <키> /kʰi/ in Korean.

Essentially, when we say phone, we mean a specific physical sound, when we say phoneme, we mean a sound that our brains perceive as being the same across multiple environments in a language, and when we say allophone, we mean a variation of a phoneme where it is physically produced as a different sound but registered as the same unit.


Notes on Transcription

It's important to know that transcribing <cat> as /kæt/ is not less accuracte than writing it as, say, [kʰæt̚]; it is simply less precise. When we want to say "this is the general way these sounds are parsed", we use phonemic transcription since we are only concerned with the percieved phonemes, and when we want to say "this is the specific way these sounds are uttered by this speaker in this environment", we use phonetic transcription to convey what is being physically produced as opposed to what is being mentally processed. Phonetic transcription can also be broad or narrow, depending on how detailed it is.

When it comes to vocal synths, we tend to use both of these. For reclist writing and voicebank configuration, it's better to stick to phonemic transcription when possible and only use phonetic transcription when including additional phones (be them allophones or multilingual extras); this is because we (usually) want our voicebanks to be intuitive, so we want to avoid the needless complications that would come from narrowly transcribing every sample. If a distinction isn't useful for the voicebank, we don't need to account for it in the configuration. That being said, when it comes to lyrical input in the software, we create a phonetic transcript of the lyrics within the confines of whatever voicebank we're using.


Isochrony and Phonotactics

Isochrony is how individual phonemes are organized into timed units within a language. Most commonly, they are organized into syllables. Every syllable contains a nucleus, which is usually a vowel. In singing, this is the part that is held for the duration of the note. A syllable will not contain more than one nucleus, and a nucleus will not contain more than one phoneme, but that phoneme can be a monophthong, diphthong, rhotic vowel, syllabic consonant, etc.

A syllable also typically has an onset consisting of one or more consonants occuring before the nucleus. The most commonly occuring syllable structure accross languages is CV: a single consonant followed by a nucleus. Lastly, syllables sometimes have codas, which consist of one or more consonants occuring after the nucleus. These are less common and often (but not always) have simpler constructions compared to onsets. The nucleus and coda together is reffered to as the rime of the syllable.

Phones that occur at the very beginning of a syllable are said to be in syllable initial position and phones that occur at the very end are in syllable final position.

Rather than organizing by syllable, some languages organize instead by mora. A mora can be a single nucleus or a CV pair. This is well illustrated by Japanese, where every mora is represented by one kana: <あ> /a/, <か> /ka/, and <ん> /n/ are all one mora, but <あか> /aka/, <かあ> /ka:/, and <かん> /kan/ are all two morae, even though we might count the latter two as single syllables.

Phonotactics refers to the system of rules that dictate which phones are allowed in what environments, whether adjacent to other sounds or in certain positions within the syllable / mora. These are called phonotactic constraints. For example, English allows /ŋ/ as in <sing> /sɪŋ/ to exist in codas, but not in onsets, unlike the other nasals /m, n/ which can exist in either.

Finally, let's talk about consonant clusters; these are groups of two of more consonants which occur side-by-side in the onset or coda. Every language has different constraints for what sounds can be included in a cluster, if they allow clusters at all, but commonly you will find onset clusters that consist of a consonant followed by an approximant, and coda clusters of an approximant or a nasal followed by another consonant. English is quite a maximal example, as it allows onset clusters with up to three consonants and coda clusters with up to three or four (depending on dialect), like in <stengths> /stɹεŋθs, stɹεŋkθs/.

☆ Dialectical Variation ☆

Another useful term to know when working with vocal synths is lexicon, which refers to a complete set of meaningful units of a language — that is, all of the phonemes, words, phrases, etc. If something is lexical, that means it is significant in the lexicon.

I also want to briefly discuss dialects, also know as varieties of a language. These are often conflated with accents, but really, accent (how the language is pronounced) is just one part of dialect; dialects also include vocabulary, phrases, grammar rules, suprasegmental features, and other linguistic variations. Without getting into all of the complex sociopolitical implications, a dialect is simply the way a specific group of people use a language, whether by region, nationality, race, sexuality, age, or so on. Contrary to what certain folks may have you believe, every person speaks a dialect of some kind; "standard" language is kind of nonexistant outside of abstract averages or political ideologies. In fact, we all blend features of multiple dialects together into our unique idiolects, and may switch between certain linguisitc features in different social situations without even realizing.

For voicebank creation, we have to balance our intuition (what sounds we think we're saying) with reality (what sounds we're actually saying). We want our voicebanks to be intuitive to use, which often means reffering to our ideas of standard language, while still accounting for some level of variation, since working only with pure phonemes often won't sound very natural. No two speakers will produce or interpret the language in the exact same way, so it can take a lot of trial and error!