☢ Character Encoding ☢

☆ What is Character Encoding? ☆

Character encoding refers to the way that text characters are stored as numerical values. These values tell the computer which characters to display in a text sequence.

The most common encoding used for websites and what is often the default one used by text editors is UTF-8, but there are multiple different types, and each uses a unique numbering system. As such, not all of them are mutually compatible, especially when dealing with characters outside of the Latin alphabet.


Shift JIS and Mojibake

Classic UTAU is encoded in Shift JIS, an encoding designed for Japanese characters. UTAU also functions almost entirely by reading text files (TXT's, INI's, and even UST's). Given that many voicebanks and USTs are Japanese language and therefore use Japanese characters, you might then imagine that a text file with the wrong encoding could cause problems.

Indeed, wrongly encoded text files are a common error encountered by UTAU users. If you've ever opened up an UTAU's files and seen something like this, you've encountered the dreaded mojibake:

Mojibake is a term used for the garbage characters resulting from improper encoding; they cannot be read correctly by the software — or by its users, for that matter. This is also a problem when trying to use OREMO or setParam, as both of them also require Shift JIS encoding for Japanese characters.

Luckily, there is an easy fix; we simply need to convert the text files to Shift JIS.

Note: This error can be avoided by using newer softwares for voicebank development and usage, namely RecStar, vLabler, and/or OpenUTAU, but given that many users still use OREMO, SetParam, and/or classic UTAU, this tutorial still seems useful.

☆ Encoding in Shift JIS ☆

These tutorials are written for Notepad++, a free text editor I highly recommend for its versatility and ease-of-use, but similar processes can be done in other text editors. There is (to my knowledge) no way to do this in the default Window's Notepad app, though.


Converting Existing Files in Notepad++

Step 1. Open the improperly encoded text file in Notepad++.

Step 2. Select all of the text in the file. This can be done manually or by hitting CTRL+A on your keyboard.

Step 3. Copy the text onto your clipboard by right clicking and selecting Copy or by hitting CTRL+C. You can paste the text into a temporary notepad window if you like, but it's unnecessary as long as you don't copy anything else during this process.

Step 4. On the toolbar, navigate to Encoding > Character sets > Japanese and click on Shift JIS. This will turn the characters into mojibake, which is why we have preserved the text elsewhere temporarily.

Step 5. With all of the text in the file still selected (or reselected), paste the original text back into the file by right clicking and selecting Paste or by hitting CTRL+V to overwrite the garbage characters.

Step 6. Verify the file is now correctly encoded by checking the Encoding tab and making sure that Shift JIS is selected.

Step 7. Save the file.

If everything was done correctly, this should resolve the problem.


Creating a New Shift JIS File in Notepad++

For if you are writng an oto.ini, readme.txt, or other such UTAU file want to encode in Shift JIS from the get-go and avoid encountering problems later. Very similar to the above.

Files already encoded in Shift JIS should not typically encounter problems when writing or pasting Japanese text into them, for example copying and pasting a reclist or base oto with kana in it.

Step 1. Create a new file in Notepad++ by navigating to File > New or by hitting CTRL+N.

Step 2. On the toolbar, navigate to Encoding > Character sets > Japanese and click on Shift JIS.

And that's it. It's quite simple.