☢ Salem-Style English CVVC ☢
☆ Introduction ☆
Salem-Style English CVVC, or Salem CVVC / S-CVVC for short, is not a single reclist, but rather a method and style guide for English DIY vocal synths. It's built for use in UTAU and OpenUTAU, but could reasonably be used in other softwares.
It's really nothing revolutionary; just a careful construction of voicebank techniques with thoughtful application of English phonology. I noticed a lot of room for improvement in existing English reclists, so sometime in 2017 I decided to put my linguistics background and organization skills to the test and attempt to create a better one.
I wanted to develop a reclist, which later evolved into a broader method, that could create English voicebanks of consistently good quality without having an excess of unnecessary or redundant samples. After a lot of experimental voicebanks and consulting with other vocal synth users, I arrived at a few core design principles:
- Flexibility — ensuring samples included are ones that users can get the most mileage out of.
- Efficiency — reclists are organized to minimize tediousness and redundancy.
- Frequency — prioritizing included phones and samples based on how likely they are to be used.
- Naturalness — sampling from phrases whenever possible, organizing samples by their transition point, and accounting for relevant allophones.
- Consistency — creating a standardized core reclist and a style guide for any additional samples included.
- Ease-of-Use — taking measures to make S-CVVC voicebanks straightforward to create and use.
- Time & Effort — keeping the size of voicebanks relatively manageable in terms of both string count and oto length.
This resource assumes at least a basic understanding of UTAU, but if you need help understanding any of the concepts or terminology discussed here, please refer to the vocabulary page.
☆ Demonstrations ☆
FULL List Demo Bank Download — KATSU English
See here for full voicebank and character information.
LITE List Demo Bank Download — HIRO English
See here for full voicebank and character information.
Demo UST Download
Daisy Bell for manual usage in UTAU demonstration. Twinkle Twinkle Little Star for OpenUTAU Phonemizer demonstration.
☆ Overview ☆
The core reclist consists of three sections: CVVC strings, vowel strings, and consonant strings. Different sections have different prefixes so that they won't get jumbled together in the voicebank files, but these can be removed. All strings have a max length of 8 syllables, and are compatible with OREMO and my guideBGMs. A short string variant of each list also exists.
There are two prewritten lists at time of writing: the LITE list, which contains the smallest number of samples I felt still achieved good quality results, and the FULL list, which contains a number of additional samples for more natural pronunciation (optimized for American English but still compatible with other dialects, further discussed here).
For comparison with the tables below, a Japanese VCV is usually around 150 strings and 950 oto lines. The core LITE list is about the same size as an ARPAsing reclist (about 220~250 strings), but with more distinct samples and much less redundancy than the default list. Both lists are smaller than VCCV (1066 strings and 3429 oto lines), though the FULL List with all add-ons has a larger oto.
Time estimates do not include breaks or equipment setup/takedown.
I recommend taking breaks and/or splitting recording over multiple days for the longer lists.
LITE List Stats
String Count | Est. Time to Record | OTO Length | Max Pitches | |
---|---|---|---|---|
LITE Core v1.0 | 230 | 1 hr | 1622 | 17 |
Cluster Add-On v1.0 | 67 | < 30 min | 228 | |
Breath Add-On v1.0 | 19 | < 10 min | 38 | |
Core + Cluster + Breath | 316 | 2 hr | 1888 | 15 |
CVC Add-On v0.9 | 116 | < 1 hr | 766 | |
Everything | 432 | 3 hr | 2654 | 11 |
FULL List Stats
String Count | Est. Time to Record | OTO Length | Max Pitches | |
---|---|---|---|---|
FULL Core v1.1 | 357 | 2.5 hr | 2566 | 12 |
Cluster Add-On v1.1 | 110 | < 45 min | 389 | |
Breath Add-On v1.1 | 25 | < 10 min | 50 | |
Core + Cluster + Breath | 492 | 3.5 hr | 3005 | 10 |
CVC Add-On v0.10 | 209 | 1 hr | 1291 | |
Everything | 701 | 4.5 hr | 4296 | 7 |
☆ Frequently Asked Questions ☆
What does the 'S' in S-CVVC stand for?
Salem. It's just named after me, but I wanted a quick way of abbreviating it.
Is this a unique type of reclist?
Yes and no. It's just a specific method of CVVC.
What's the difference between Salem CVVC and Delta CVVC / VCCV / ARPAsing / etc.?
Accross the board: string organization, phonemic transcription style, and sample prioritization.
Compared to Delta CVVC: both use X-SAMPA and have basically the same aliasing style, but there are minor differences in transcription, and S-CVVC includes more phonemes. Delta CVVC also doesn't typically include consonant blends.
Compared to VCCV: Initial CVs and final VCs are not prioritized, and medial VCs are not included at all. Advised OTOing technique is different, and there is much less redundancy. S-CVVC also offers more standardized phoneme customization.
Compared to ARPAsing: S-CVVC includes more phonemes and more sample types on account of not being limited to diphones, but functions similarly due to the emphasis on sample transitions rather than isolated syllables. S-CVVC is also organized phonetically and has much less reduncdancy.
Why use X-SAMPA over ARPAbet / other transcription style?
X-SAMPA is flexible, standardized, and easy to type in. ARPAbet is quite limited in the phones that it can represent (also, I just don't like it very much), CZ's notation style lacks standardization for additional phonemes, and I just didn't find it practical to come up with my own when X-SAMPA is right there and much closer to the linguistic standard IPA.
Is it possible to make a smaller English reclist than the LITE list?
Yes. While the LITE list is pretty compact, not all samples are strictly necessary to synthesize English, they're just the minimum that (in my opinion) still produces clear results.
It's definitely possible to pair down the reclist further; you can take out the [4]
and the [@l]
, treat the diphones like vowel blends, conflate [V]
with [@]
, and other such things. The vowel blends, consonant blends, and initial/final consonants can also technically be omitted. I experimented with all of these and more, but ultimately felt the resulting quality difference was worth not having a truly minimal list.
Hell, if your goal really is minimalism, you can get decent English out of what's essentially a Japanese voicebank with the inclusion of a couple extra consonants and maybe a schwa. But while the human ear is quite flexible in what it will accept as a phoneme, it is equally sensitive to minor differences in the acoustics, so this likely won't sound very natural.
Why CVVC over VCV(VC)?
English has a large phoneme inventory, particularly for vowels; English VCV(VC) is not practical. You spend all that time recording and configuring samples that will rarely be used for maybe a minor quality improvement. It's not really easier to use, either, because you still have to account for codas.
Why are there no CCVs / VCCs?
They add a lot of bulk to the reclist (we're talking hundreds of strings and thousands of samples), many of them will hardly (if ever) be used, and they often end up sounding worse than CC+CV or VC+CC due to having fixed consonant durations. They do make it a little easier to use, but I don't find them all that useful in comparison to the time & effort it takes to include them.
Why are there no medial VCs like in VCCV?
The function they serve — creating more natural-sounding codas within larger phrases — is already filled by the inclusion of consonant blends, which are more flexible and offer smoother transitions.
What about non-American Englishes / why didn't you include [phoneme] in the main list?
Trying to write a reclist that accomodates all varieties of English is impossible, so I tried my best to come up with a reclist that is flexible enough to work for as many speakers as possible. I focused on which phonemes seemed to occur most frequently either accross dialects or accross total speakers, and which allophones seemed most useful for synthesizing more natural utterances.
But, of course, it's not going to be perfect for any one individual; it doesn't even 100% reflect my own accent. I recommend customizing the reclist if there's anything you really feel is missing for your own voice.
That being said, the reclist should still generally work for most native speakers. See the pronunciation page for more information.
If you're interested in working with me to develop a FULL-style S-CVVC list for a different dialect, feel free to reach out.
Are there any tools that can make using S-CVVC easier, like those for ARPAsing?
In vanilla UTAU, you can use a combination of presamp and autoCVVC to make lyric insertion a little easier, but it will still need manual editing.
In OpenUTAU, it works with the English X-SAMPA phonemizer. Eventually I'd like to make a custom dictionary for it as well.
Can this reclist be used for any other DIY synth engines?
In theory, yes. I don't have experience with making voicebanks in those other than UTAU and OpenUTAU, but it should be compatible with any that allow for the included sample types.
Can I change [x] about the reclist? / Can I make my own reclist based on S-CVVC?
If you still want it to be compatible with S-CVVC, see the customization page. But if there are some fundamental aspects that don't suit you, do whatever you want, I ain't gonna stop you.