Table of contents
|
|
|
A short introduction and a reference
|
|
The presentation page about Unicode in the web site of the Unicode Consortium starts with:
The Unicode Standard is the universal character encoding standard used for representation of text for computer processing.
For more information of any kind, visit the site of the Unicode Consortium.
A Unicode string can contain virtually any existing character. If you are using MacOSX, your machine is Unicode-compliant: you can view, enter, store, and use Unicode text. Here we focus briefly on the scripting issues related to Unicode.
|
How text is stored in files
|
|
Files can store text under a variety of formats, we focus on four of them: ISO 8859-1 (the "PC" format), Mac-encoding, UTF-8 and UTF-16.
-
ISO 8859-1 can be translated from/into Mac-encoding in Smile, both by menu and by script, we do not address this format here.
-
Mac-encoded files store one byte per character in the range 0..255. The 128 first values are rendered according to the ASCII standard, for instance ASCII character of 37 is the percent sign %. The 128 larger values are rendered using a macintosh encoding, the one that goes with the first language listed in your International preference pane. For most US and Western European users, this will be MacRoman, and for instance for them ASCII character of 150 is ñ. We refer to this encoding as the Mac-encoding.
-
UTF-8 is a way of storing Unicode. UTF-8 files store each character as one byte or more. The 128 characters of the (strict) ASCII set are encoded as such in UTF-8, as one byte, so a file which contains only 7-bits ASCII characters is at the same time a UTF-8 file, a Mac-encoded file and a ISO 8859-1 file.
Note that some software provide an optional 3-bytes header (hexa EF BB BF) to UTF-8 files. Smile will preserve that header.
-
UTF-16 is another way of storing Unicode. UTF-16 files store each character as two bytes or more. Most often, the file begins with the two bytes 254 then 255 (hexa: FE FF), which display respectively as ogonek and caron. Thus an Unicode-aware editor will automatically detect UTF-16 from UTF-8 files.
The FE FF header is called the BOM (Byte Order Mark): low-endian systems write FF FE instead. All Unicode-aware software will transparently accept low-endian as well as big-endian files.
AppleScript reads and writes the three latter formats. In the example below we write two characters: #% in a (temporary) file then we read them as ASCII. We use the standard form of read and write.
on TempFile()
"" & (path to "temp" from user domain) & "utf_smile_file"
end TempFile
set f to TempFile()
set n to open for access file f with write permission
write "#%" to n
close access n
set x to read (f as alias)
display dialog x
-- will display: #%
set x to read (f as alias) as Unicode text
display dialog x
-- will display: ⌥
-
Prefer use Smile's commands readtext and writetext instead of AppleScript's read and write. readtext and writetext commands have an encoding parameter with which you can specify the encoding: "MACINTOSH", "UTF-8", "UTF-16", "ISO-8859-1" or another IANA name.
-
If you use the read and write commands, you should specify the encoding with their as parameter: as «class utf8» for UTF-8, as Unicode text for UTF-16 (write does not write the BOM).
|
How Smile opens text files
|
|
Since there is no tag which would specify whether a given file is ASCII or UTF-8, Smile bases its default choice on the file's type and the file's resource, two MacOS-only features.
-
If the file's type is ut16, Smile opens it by default as a Mac-encoded text, in a text window.
-
If the file's extension is .applescript, Smile opens it by default as a Mac-encoded text, in a text window.
-
Otherwise, by default, Smile attempts to open the file as Unicode, in a Unicode window. Smile selects automatically UTF-8 or UTF-16.
-
To override the standard behavior, select the file in Finder, then in Smile use File ▸ Open Finder selection with the shift key (⇧) or the option key (⌥) pressed.
|
unicode number s and unicode character n
|
|
Smile provides two commands to play for Unicode the same role as the ASCII number and ASCII character commands do for ASCII characters: unicode number and unicode character.
set the_Omega to unicode character 937
-- "Ω"
unicode number of the_Omega
-- {937}
ASCII number of the_Omega
-- 189
unicode number of "¬"
-- {172}
ASCII number of "¬"
-- 194
Note two differences however.
- unicode number works on strings of any length. It returns the list of the Unicode numbers of the characters of the string.
unicode number "hello"
-- {104, 101, 108, 108, 111}
Similarly, unicode character accepts lists of integers:
unicode character (reverse of {104, 101, 108, 108, 111})
-- "olleh"
-
You can specify encoding «class utf8» to have unicode number return the list of the UTF-8 bytes rather than the list of the Unicode numbers.
unicode number "été"
-- {233, 116, 233}
unicode number "été" encoding «class utf8»
-- {195, 169, 116, 195, 169}
Similarly, specify unicode character encoding «class utf8» to convert a list of UTF-8 bytes into a Unicode text.
|