HomeMathComputingArtsWordsLiteratureMusictwitter facebook webfeed

Unicode in Perl & Python

Advertise Here For Profit

Xah Lee, 2005-01, 2011-01

Python

Source Code Interpretation

If your source code contains non-ascii or unicode characters, you should declare the file's encoding in the first line. This is not necessary but is helpful.

#-*- coding: utf-8 -*-
print "look chinese chars: 请你不要哭"

The #-*- coding: utf-8 -*- declaration in the first line is a convention adopted from the text editor Emacs. It tells any program reading the file that the file is encoded using a particular character set. For example, its purpose is similar to HTML's <META HTTP-EQUIV="Content-Type" CONTENT="text/html;charset=utf-8">. (See: Character Sets and Encoding in HTML.)

Text Processing with Unicode Strings

If you are going to do any processing with unicode string, such as substring extracting or string pattern matching, then you need to put u in front of the string. For example,

#-*- coding: utf-8 -*-
$str = u"look Chinese chars: 请你不要哭"

Note, however, identifiers cannot use unicode chars. For example, variable names cannot contain unicode chars.

Sometimes when you print unicode strings, you may get a error like this:

# UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 16: ordinal not in range(128).

The solution is to encode or decode your line into a particular encoding. Because, when reading a file as lines, to Python a line is just a sequence of bytes. For example:

    myString=myString.decode("utf-8") or
    myString=myString.encode("utf-8")
#-*- coding: utf-8 -*-
# python

alpha=u'α'

# Bad
print u'Unicode alpha: ', alpha

# Good
print u'Unicode alpha: ', (alpha).encode('utf-8')

Unicode in Regex

When using regex, you need to add the unicode flag “re.U” when calling regex functions. See: Pyhton Regex Flags.

Perl

use bytes; # Larry can take Unicode and shove it up his ass sideways. 
            # Perl 5.8.0 causes us to start getting incomprehensible 
            # errors about UTF-8 all over the place without this.

               —from the source code of WebCollage (1998),
                by Jamie W Zawinski (~b1971) 

In Perl, dealing with unicode is quite different from Python. Perl's Unicode support starts to be somewhat usable with Perl 5.8. Perl provides the -C option in the command line, which changes input and output behaviors of Perl to work UTF-8. It is uncessarily complex, because it is hacked up thru the years since Perl 5.6.

Perl 5.8 (2002-07) can have unicode chars used as variable's name or function name. You need to say use utf8; in your code. Example:

# perl

use utf8; # necessary if you want to use unicode in function or var names

# processing unicode string
$s = 'I ★ you'; $s =~ s///;
print $s;

# variable with unicode char
$愛=4;  print $愛;

# function with unicode char
sub f愛 { return 2;}  print f愛();

Because you are outputing utf-8 unicode string in the above code, you need to run it with the -C option, example: perl -C7 myCode.pl.

blog comments powered by Disqus