Xah Lee, 2006-05
On 2008-08-15, someone wrote (paraphrased):
Sometimes i save documents to disk from the web.
I wish to embed the article title and url in the saved filename.
e.g. if article titled “News for the next Century” at http://www.example.com/news/something.html
i want to save it in the filename such as “News for the next Century http://www.example.com/news/something.html”. But the special chars there causes problems.
is there some general char transformation scheme, so that special chars in url and title of article are replaced by other chars and can be used as a filename?
Hmmm. Maybe “---” for “/”? What about “:”? And what about “~”? Plus other chars I've not thought of?
P.S.: Oh, I forgot. tar shouldn't barf on the name.
What you want to do is pretty hopeless. Chars in url is confusing
enough, with its percent-encoding (such as %20 for space and %7E for ~), and when used in html as link,
there's also another layer of encoding the CDATA (such as & for &) . Depending on the browser, or whatever tool you are
using, the url you get may or may not be processed to eliminate a
variety of encoding, and the encoding spec itself is not crystal
clear and in practice lots of actually invalid uri anyway.
(See:
URL Percent Encoding and Unicode
and
URL Percent Encoding and Ampersand Char
)
Chars in file names itself is also confusing. Different file systems
allow different char sets with different special char meanings, and
each generation of file system changes slightly. (e.g. windows has C:\\ and
\ and if you are using cygwin you also get / … mac has :
in OS9, and / in OSX and there's complex char transform magic
underneath. Unix is the worst, they pretty much just allow
alphanumerics and underscore “_” and not even space. If you have anything like
= ( ) , ; ' " " # $ & - ~
etc, you can expect most shell tools to erase you disk)
(See: What Characters Are Not Allowed in File Names?)
The best thing to do is just to create a file and name it 〔readme.txt〕, then in that file put in the url, date, or keywords and annotation. That's what i do.
Nikolaj Schumacher wrote:
Actually unix systems allow pretty much every character except / and the null character.
To say that unix allows much wider chars in file names is like saying mud is the best medium for sculpture.
Unix file names, for much of its history up to perhaps mid 2000s, effectively just allow alphanumerics plus underscore “_” (hyphen “-” and space can occationally be seen.). As a contrast for comparison, Mac's file names often contain punctuations such as “ , $ # ! * ( )” and space, but also allows non-ascii such as:
Mac key maps from Keyboard Viewer (with Dvorak keyboard), showing you what characters you can type using the Option key. Those colored orange are prefix keys, allowing you to type combination characters such as “áéíóú”.
Some of these chars are widely used throughout the 1990s. For example, “ƒ” is widely used as the last char in a file name for folders.
ascii punctuations chars and non-ascii chars such as above are also allowed in filenames in Windows since about Microsoft Windows NT in late 1990s or earlier. Tools in MacOS (such as AppleScript) and Windows, support, expect, these chars in file names.
Sure, you can use many non-alphanumeric chars besides hyphen and underscore in unix, but the system is simply not designed for it. Majority of unix tools, including file name listing, will chock and break if your filename contain these chars. The chocking doesn't actually give you a nice error message, but silently break and often resulting in unexpected and unpredictable behavior. In short, it's just not designed for it.
Issues like these often perpetuate the myth that unix is “powerful”, but in fact it's just raw and no-design.