Xah Lee, 2010-06-20, 2010-08-25
Any file has to go thru encoding/decoding in order to be properly displayed or written to. Suppose your language is Chinese (or Japanese, Russian, Arabic, or even English.). Your computer needs a way to translate the character set of your language's writing system into a sequence of 1s and 0s. This transformation is called Character encoding.
There are many encoding systems. Most popular are ASCII, UTF-8 (used in linux) and UTF-16 (used by Windows and OS X's file systems), IEC 8859 series (used for most European langs), GB 18030 (Used in China, contains all Unicode chars).
A encoding system also defines a character set implicitly or explicitly. A character set is a fixed collection of symbols. A encoding system needs to define that because it needs to define what characters (symbols) it is designed to handle.
For example, ASCII is designed for Latin alphabets, and the set includes numbers and punctuation symbols. ASCII cannot be used for Arabian alphabet, Cryllic (Russian) alphabet, Chinese characters, etc. Nor can ASCII be used for some European languages that has characters such as è é å ø ü.
Unicode's char set includes all human language's written symbols. Including the tens of thousands Chinese chars, as well as dead or rarely used languages, such as Egyptian hieroglyphs.
Unicode, defines a character set first, then it gives each character a unique ID. This id is just a integer, and is called the char's “coding point”. For example, the code point for the greek alpha “α” char is 945. In hexadecimal it's “3b1”. In the standard unicode notation it is written as “U+03B1”.
Then, it defines several encoding systems. (a way to map a given code point into binary number) UTF-8 and UTF-16 are the two most popular Unicode encoding systems. Each coding system is suitable for different purposes. UTF-8 is suitable for texts that are mostly Latin alphabet letters, numbers and punctuation symbols. For example, European languages. Most linux's files are in UTF-8 by default. UTF-8 is backwards compatible with ASCII. If your text only contain the characters in ASCII, then encoding using UTF-8 results the same byte sequence as using ASCII as encoding scheme.
UTF-16 is more modern coding system designed for Unicode. With UTF-16, every char is encoded into least 2 bytes, and commonly used characters in Unicode are exactly 2 bytes. For Asian languages or texts that's mostly non-latin chars, UTF-16 is more efficient. Smaller file size and less complexity in processing.
There's also UTF-32, which always uses 4 bytes per character. It creates larger file size, but is simpler to parse. Currently, UTF-32 is not being used much.
When a editor opens a file, it needs to know the encoding system used, in order to decode the binary stream and map it to fonts to display the original characters properly. In general, the info about the encoding system used for a file is is not bundled with the file.
Before internet, there's not much problem because most English speaking world use ASCII, and non-English regions use encoding schemes particular to their regions.
With internet, files in different languages started to exchange a lot. When opening a file, Windows applications may try to guess the encoding system used, by some heuristics. On linux and emacs, this is usually not done. When opening a file in a app that assumed a wrong encoding, typically the result is gibberish. Usually, you can explicitly tell a app to use a particular encoding to open the file. (e.g. in web browsers, usually there's a menu. In Firefox, under View, Character Encoding.) Similarly, when writing to a file, there's usually a option for you to specify what encoding to use. For example, in Microsoft Notepad, when you save a file, there's a “Encoding” menu at the bottom of the Save dialog.
When a computer has decoded a file, it then needs to display the characters as glyphs on the screen. For our purposes, this set of glyphs is a font. So, your computer now needs to map the Unicode code points to a font.
For Asian languages, such as Chinese, Japanese, Korean, or langs using Arabic alphabet as its writing system (Arabic, Persian), you also need the proper font to display the file correctly.
See: Best Fonts for Unicode.
For languages that are not based on alphabet, such as Chinese, you need a input method to type it.