MAnd Unicode came to this world - Most of us don't work in ANSI or ASCII anymore. Most of the projects I work on are either UTF-8 or UTF-16.
The reason for this, is that Unicode will save you a lot of headache when displaying all these weird characters, such as è, à, ø and å. And when you feel the urge to write わたし は その ほん を よむ, believe me, Unicode is better.
It just works! The proof is that you can read the characters above. And you know why? Because the web page is UTF-8 encoded. Programs around the world agree on how they're going to interpret bits and everything is nice and sound.
In Windows, there is another incentive to have a native UTF-16 application. The NT kernel runs in UTF-16 as well as all internal APIs (quite impressive for a now more than 20 years old kernel) . Having an UTF-16 application saves you from having a text string conversion at each system call such as WriteFile.
When your application is UTF-16, there is not much to be done. Make sure you work with wchar_t and have the proper compiler flags (in Visual Studio the Unicode/MBCS choice is made from the project properties page) and of course, your source files must be encoded in UTF-16.
But what if you have to work in UTF-8 ?
There could be many reasons. The application communicates with a database with UTF-8 encoded content. It's a cross-platform application and you want to work with char * and UTF-8. You need the reverse compatibility feature of UTF-8. Someone in the company decided that UTF-8 was the only true way and that the UTF-16 infidels will all burn in Hell. Who knows?
Guess what. Your first attempt will fail.
First step, you have to make sure the source file is UTF-8 encoded with the byte order mark (BOM). The BOM is an extremely important thing, without it the C++ compiler will not behave correctly.
In Visual Studio 2008, this can be done directly from the IDE with the Advanced save command located in the File menu:
A dialog box will pop up. Select UTF-8 with signature.
If you are not using Visual Studio 2008, you can convert to UTF-8 with the free notepad++ application (I like this editor, you should give it a try).
If you compile and run a test program of the sort:
1
2 3 4 5 6 7 8 9 |
#include
usingnamespace std; int main(int argc, char* argv, char envp) |
You are going to get:
Not quite what we wanted. What happens is that, although your text is properly encoded in UTF-8, for compatibility reasons the C/C++ runtime is by default set to the "C" locale. This locale assumes that all char are 1 byte. Erm. Not quite the case with UTF-8 my dear!
You need to change the locale with the setlocale function to have the string properly interpreted by the input output stream processors.
In our case, the locale of whatever the system is using is fine, this is done in passing "" as the second parameter.
1
2 3 4 5 6 7 8 9 10 11 |
#include
#include usingnamespace std; int main(int argc, char* argv, char envp) |
So what are we getting now?
Yay! Victory.
To be rigorous, you must check the return value of setlocale, if it returns 0, an error occurred. In multi-language applications, you will need to use setlocale with more precision, explicitly supplying the locale you want to use (for example you may want to have your application display Russian text on a Japanese computer).