Xah Lee, 2005-01, 2011-01
If your source code contains non-ascii or unicode characters, you should declare the file's encoding in the first line. This is not necessary but is helpful.
#-*- coding: utf-8 -*- print "look chinese chars: 请你不要哭"
The #-*- coding: utf-8 -*- declaration in the first line is a convention
adopted from the text editor Emacs. It tells any program reading the file that
the file is encoded using a particular character set. For example, its
purpose is similar to HTML's
<META HTTP-EQUIV="Content-Type" CONTENT="text/html;charset=utf-8">.
(See: Character Sets and Encoding in HTML.)
If you are going to do any processing with unicode string, such as substring extracting or string pattern matching, then you need to put u in front of the string. For example,
#-*- coding: utf-8 -*- $str = u"look Chinese chars: 请你不要哭"
Note, however, identifiers cannot use unicode chars. For example, variable names cannot contain unicode chars.
Sometimes when you print unicode strings, you may get a error like this:
# UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 16: ordinal not in range(128).
The solution is to encode or decode your line into a particular encoding. Because, when reading a file as lines, to Python a line is just a sequence of bytes. For example:
myString=myString.decode("utf-8") or
myString=myString.encode("utf-8")
#-*- coding: utf-8 -*- # python alpha=u'α' # Bad print u'Unicode alpha: ', alpha # Good print u'Unicode alpha: ', (alpha).encode('utf-8')
When using regex, you need to add the unicode flag “re.U” when calling regex functions. See: Pyhton Regex Flags.
use bytes; # Larry can take Unicode and shove it up his ass sideways.
# Perl 5.8.0 causes us to start getting incomprehensible
# errors about UTF-8 all over the place without this.
—from the source code of WebCollage (1998),
by Jamie W Zawinski (~b1971)
In Perl, dealing with unicode is quite different from Python. Perl's Unicode support starts to be somewhat usable with Perl 5.8. Perl provides the -C option in the command line, which changes input and output behaviors of Perl to work UTF-8. It is uncessarily complex, because it is hacked up thru the years since Perl 5.6.
Perl 5.8 (2002-07) can have unicode chars used as variable's name or function name. You need to say use utf8; in your code. Example:
# perl use utf8; # necessary if you want to use unicode in function or var names # processing unicode string $s = 'I ★ you'; $s =~ s/★/♥/; print $s; # variable with unicode char $愛=4; print $愛; # function with unicode char sub f愛 { return 2;} print f愛();
Because you are outputing utf-8 unicode string in the above code, you need to run it with the -C option, example: perl -C7 myCode.pl.