Changeset 613

Show
Ignore:
Timestamp:
02/22/08 10:41:24 (9 months ago)
Author:
Janice Caron
Message:

Some functions renamed.
Documentation improved.

Files:

Legend:

Unmodified
Added
Removed
Modified
Copied
Moved
  • candidate/phobos/std/encoding.d

    r610 r613  
     1// Written in the D programming language. 
     2 
    13/** 
    24Classes and functions for handling and transcoding between various encodings. Encodings currently supported are 
     
    1618Date: 2006.02.21 
    1719 
    18 ($B A Brief Tutorial) 
     20License: Public Domain 
     21 
     22$(BIG $(B A Brief Tutorial)) 
    1923 
    2024There are many character sets (or more properly, character repertoires) on the planet. Unicode is the 
    21 superset of all other legacy character sets. Therefore, ($I every) character which exists in any 
     25superset of all other legacy character sets. Therefore, $(I every) character which exists in any 
    2226character repertoire, also exists in Unicode. Every character in Unicode has an integer associated 
    2327with it. That integer is the called the character's codepoint. For example, the codepoint of the letter 
    2428'A' is 65, or in hex, 0x41. It is important to know that a character's codepoint is unchangeable. It 
    2529is a permanent property of the character, and it does not depend on how you encode it. The codepoint 
    26 of 'A' is 65, period. If you choose to encode the letter 'A' in EDCDIC, that will not change its 
    27 codepoint. (Its encoding in EBCDIC will be different, but that is irrelevant). 
    28  
    29 Most character repertoires consist of 256 characters or fewer. This is because it is convenience to use 
     30of 'A' is 65, period. 
     31 
     32Most character repertoires consist of 256 characters or fewer. This is because it is convenient to use 
    3033single-byte encoding schemes. In such repertoires, every character will have an integer in the range 0 
    3134to 255 associated with it, denoting its position within that repertoire. That number is called a 
    32 code unit. Note that, in general, code unit != codepoint. 
     35codeunit. Note that, in general, codeunit != codepoint. 
    3336 
    3437For example, the Euro currency symbol has codepoint 0x20AC. This is a permanent property of the character. 
    3538That character does not exist in the ASCII repertoire, and so cannot be encoded in ASCII. It also does 
    36 not exist in the Latin-1 character repertoire, and likewise cannot be encoded in Latin-1. It ($I does) 
     39not exist in the Latin-1 character repertoire, and likewise cannot be encoded in Latin-1. It $(I does) 
    3740exist in the Windows-1252 character repertoire though. In that encoding, it is represented by the byte 
    38 0x80. So in that encoding, its code UNIT is 0x80, but its code POINT is still 0x20AC. Codepoints are 
    39 ($I always) measured in Unicode. 
     410x80. So in that encoding, its codeUNIT is 0x80, but its codePOINT is still 0x20AC. Codepoints are 
     42$(I always) measured in Unicode. 
    4043 
    4144Some character repertoires contain more than 256 characters. Yet it is still desirable to be able to 
     
    4346encodings a single character may require more than one byte to represent it. 
    4447 
    45 The process of converting a single codepoint into one or more code units is called ENCODING. 
    46 The reverse process, that of converting multiple code units into a single codepoint is called DECODING. 
    47  
    48 Almost all encoding schemes use 8-bit bytes as the storage type for a single code unit - but there are 
    49 exceptions. UTF-16, for example, uses 16-bit wide code units. (The character repertoire which it 
     48The process of converting a single codepoint into one or more codeunits is called ENCODING. 
     49The reverse process, that of converting multiple codeunits into a single codepoint is called DECODING. 
     50 
     51Almost all encoding schemes use 8-bit bytes as the storage type for a single codeunit - but there are 
     52exceptions. UTF-16, for example, uses 16-bit wide codeunits. (The character repertoire which it 
    5053represents contains more than 2^16 characters, so some of those characters need to expressed as 
    51 multiple code units, even in UTF-16). UTF-32 uses a 32-bit wide code unit, which means, just like 
    52 in the good old days of ASCII, one code unit == one codepoint. UTF-32, however, is the ($I only) 
     54multiple codeunits, even in UTF-16). UTF-32 uses a 32-bit wide codeunit, which means, just like 
     55in the good old days of ASCII, one codeunit == one codepoint. UTF-32, however, is the $(I only) 
    5356encoding for which this is true. 
    5457 
     
    312315// Unit tests over. Now for the code... 
    313316 
    314 template UTF(T) 
     317template Encoding(T) 
    315318{ 
    316319    static if (is(T==char)) 
     
    340343        ]; 
    341344 
    342         bool isValidCodeUnit(T c) 
     345        bool isValidCodeunit(T c) 
    343346        { 
    344347            return c < 0x80 || tails(c) >= 0; 
     
    482485        invariant(char)[] encodingName = "UTF-16"; 
    483486 
    484         bool isValidCodeUnit(T c) 
     487        bool isValidCodeunit(T c) 
    485488        { 
    486489            return true; 
     
    584587        invariant(char)[] encodingName = "UTF-32"; 
    585588 
    586         alias isValidCodepoint isValidCodeUnit; 
     589        alias isValidCodepoint isValidCodeunit; 
    587590 
    588591        alias isValidCodepoint isSingle; 
     
    655658        invariant(char)[] encodingName = "ASCII"; 
    656659 
    657         bool isValidCodeUnit(T c) 
     660        bool isValidCodeunit(T c) 
    658661        { 
    659662            return c < 0x80; 
    660663        } 
    661664 
    662         alias isValidCodeUnit isSingle; 
     665        alias isValidCodeunit isSingle; 
    663666 
    664667        int tails(T c) 
     
    729732        invariant(char)[] encodingName = "ISO-8859-1"; 
    730733 
    731         bool isValidCodeUnit(T c) 
     734        bool isValidCodeunit(T c) 
    732735        { 
    733736            return true; 
    734737        } 
    735738 
    736         alias isValidCodeUnit isSingle; 
     739        alias isValidCodeunit isSingle; 
    737740 
    738741        int tails(T c) 
     
    826829        } 
    827830 
    828         bool isValidCodeUnit(T c) 
     831        bool isValidCodeunit(T c) 
    829832        { 
    830833            return(win2uni(c) != 0xFFFD); 
     
    843846        bool isHead(T c) 
    844847        { 
    845             return isSingle(c) ? false : isValidCodeUnit(c); 
     848            return isSingle(c) ? false : isValidCodeunit(c); 
    846849        } 
    847850 
     
    923926        } 
    924927 
    925         bool isValidCodeUnit(T c) 
    926         { 
    927             return c.isValidCodeUnit; 
     928        bool isValidCodeunit(T c) 
     929        { 
     930            return c.isValidCodeunit; 
    928931        } 
    929932 
     
    936939            else 
    937940            { 
    938                 return c.isValidCodeUnit; 
     941                return c.isValidCodeunit; 
    939942            } 
    940943        } 
     
    10541057    { 
    10551058        assert(s.length != 0); 
    1056         if (!isValidCodeUnit(s[0])) return 1; 
     1059        if (!isValidCodeunit(s[0])) return 1; 
    10571060        int i = isHead(s[0]) ? 1 : 0; 
    10581061        for (; i < s.length; ++i) 
     
    10821085    } 
    10831086 
    1084     // find the first invalid code unit 
     1087    // find the first invalid codeunit 
    10851088    uint validatePartial(const(T)[] s) 
    10861089    { 
     
    10911094            if (isSingle(c)) continue; 
    10921095            uint n = tails(c); 
    1093             if (n <= 0) return i;                           // fail with illegal code units 
     1096            if (n <= 0) return i;                           // fail with illegal codeunits 
    10941097            if (i + n >= s.length) return i;                // fail if we exceed the length of the string 
    10951098            if (isInvalidHeadTail(c,s[i+1])) return i;      // fail with invalid head/tail combinations 
     
    11271130    } 
    11281131 
    1129     uint countCodepoints(string s) 
     1132    uint count(string s) 
    11301133    in 
    11311134    { 
     
    11421145    } 
    11431146 
    1144     int charIndex(string s, int n) 
     1147    int index(string s, int n) 
    11451148    in 
    11461149    { 
     
    11961199    } 
    11971200 
    1198     struct Dchar
     1201    struct Codepoint
    11991202    { 
    12001203        string s; 
     
    12541257    } 
    12551258 
    1256     Dchars dchars(string s) 
    1257     { 
    1258         Dchars ci; 
     1259    Codepoints codepoints(string s) 
     1260    { 
     1261        Codepoints ci; 
    12591262        ci.s = s; 
    12601263        return ci; 
    12611264    } 
    12621265 
    1263     struct Char
     1266    struct Codeunit
    12641267    { 
    12651268        T[MAX_SEQUENCE_LENGTH] buffer; 
     
    12891292    } 
    12901293 
    1291     Chars chars(dchar d) 
     1294    Codeunits codeunits(dchar d) 
    12921295    in 
    12931296    { 
     
    12961299    body 
    12971300    { 
    1298         Chars chars; 
    1299         chars.len = encode(d,chars.buffer); 
    1300         return chars; 
     1301        Codeunits codeunits; 
     1302        codeunits.len = encode(d,codeunits.buffer); 
     1303        return codeunits; 
    13011304    } 
    13021305} 
     
    13051308alias wchar Utf16;          /// A type representing the UTF-16 encoding (an alias of wchar) 
    13061309alias dchar Utf32;          /// A type representing the UTF-32 encoding (an alias of dchar) 
    1307 typedef char Ascii;         /// A type representing the ASCII encoding 
    1308 typedef ubyte Latin1;       /// A type representing the ISO-8859-1 (aka Latin-1) encoding 
    1309 typedef ubyte Windows1252;  /// A type representing the WINDOWS-1252 encoding 
     1310typedef char Ascii;         /// A type representing the ASCII encoding (a typedef of char) 
     1311typedef ubyte Latin1;       /// A type representing the ISO-8859-1 (aka Latin-1) encoding (a typedef of ubyte) 
     1312typedef ubyte Windows1252;  /// A type representing the WINDOWS-1252 encoding (a typedef of ubyte) 
    13101313 
    13111314/** 
     
    13331336string encodingName(T)() 
    13341337{ 
    1335     return UTF!(T).encodingName; 
     1338    return Encoding!(T).encodingName; 
    13361339} 
    13371340 
     
    13461349 
    13471350/** 
    1348  * Returns true if the character is a valid codepoint 
     1351 * Returns true if c is a valid codepoint 
    13491352 * 
    13501353 * Note that this includes the non-character codepoints U+FFFE and U+FFFF, since these are 
     
    13651368 
    13661369/** 
    1367  * Returns true if the code unit is legal. For example, the byte 0x80 would not be 
    1368  * legal in ASCII, because ASCII code units must always be in the range 0x00 to 0x7F. 
    1369  * For another example, 
    1370  * the byte 0xFF is not legal in UTF-8, because 0xFF can never occur in valid UTF-8. 
    1371  * (That's why it was chosen as the .init value for char!) 
     1370 * Returns true if the codeunit is legal. For example, the byte 0x80 would not be 
     1371 * legal in ASCII, because ASCII codeunits must always be in the range 0x00 to 0x7F. 
    13721372 * 
    13731373 * Standards: Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252 
    13741374 * 
    13751375 * Params: 
    1376  *    c = the code unit to be tested 
     1376 *    c = the codeunit to be tested 
    13771377 */ 
    1378 bool isValidCodeUnit(T)(T c) 
    1379 { 
    1380     return UTF!(T).isValidCodeUnit(c); 
     1378bool isValidCodeunit(T)(T c) 
     1379{ 
     1380    return Encoding!(T).isValidCodeunit(c); 
    13811381} 
    13821382 
     
    13961396bool isValid(T)(const(T)[] s) 
    13971397{ 
    1398     return UTF!(T).isValid(s); 
     1398    return Encoding!(T).isValid(s); 
    13991399} 
    14001400 
    14011401/** 
    1402  * Sanitizes a string by replacing malformed code unit sequences with valid code unit sequences. 
     1402 * Sanitizes a string by replacing malformed codeunit sequences with valid codeunit sequences. 
    14031403 * The result is guaranteed to be valid for this encoding. 
    14041404 * 
    14051405 * If the input string is already valid, this function returns the original, otherwise 
    1406  * it constructs a new string by replacing all illegal code unit sequences with the 
     1406 * it constructs a new string by replacing all illegal codeunit sequences with the 
    14071407 * encoding's replacement character, Invalid sequences will be replaced with the 
    14081408 * Unicode replacement character (U+FFFD) if the character repertoire contains it, 
     
    14161416invariant(T)[] sanitize(T)(invariant(T)[] s) 
    14171417{ 
    1418     return UTF!(T).sanitize(s); 
     1418    return Encoding!(T).sanitize(s); 
    14191419} 
    14201420 
     
    14261426/** 
    14271427 * Returns the slice of the input string from the first character to the end of the 
    1428  * first encoded sequence. The resulting string may consist of multiple code units, but 
     1428 * first encoded sequence. The resulting string may consist of multiple codeunits, but 
    14291429 * it will always represent at most one character. If the input is the empty string, 
    14301430 * the return value will be the empty string 
     
    14401440invariant(T)[] firstSequence(T)(invariant(T)[] s) 
    14411441{ 
    1442     return UTF!(T).firstSequence(s); 
     1442    return Encoding!(T).firstSequence(s); 
    14431443} 
    14441444 
    14451445/** 
    14461446 * Returns the slice of the input string from the start of the last encoded sequence 
    1447  * to the end of the string. The resulting string may consist of multiple code units, 
     1447 * to the end of the string. The resulting string may consist of multiple codeunits, 
    14481448 * but it will always represent at most one character. If the input is the empty string, 
    14491449 * the return value will be the empty string 
     
    14591459invariant(T)[] lastSequence(T)(invariant(T)[] s) 
    14601460{ 
    1461     return UTF!(T).lastSequence(s); 
     1461    return Encoding!(T).lastSequence(s); 
    14621462} 
    14631463 
     
    14761476 *    s = the string to be counted 
    14771477 */ 
    1478 uint countCodepoints(T)(invariant(T)[] s) 
    1479 { 
    1480     return UTF!(T).countCodepoints(s); 
     1478uint count(T)(invariant(T)[] s) 
     1479{ 
     1480    return Encoding!(T).count(s); 
    14811481} 
    14821482 
     
    14951495 *    s = the string to be counted 
    14961496 */ 
    1497 int charIndex(T)(invariant(T)[] s,int n) 
    1498 { 
    1499     return UTF!(T).charIndex(s,n); 
     1497int index(T)(invariant(T)[] s,int n) 
     1498{ 
     1499    return Encoding!(T).index(s,n); 
    15001500} 
    15011501 
     
    15031503 * Decodes a single codepoint. 
    15041504 * 
    1505  * This function removes one or more code units from the start of a string, and 
    1506  * and returns the decoded codepoint which those code units represent. 
     1505 * This function removes one or more codeunits from the start of a string, and 
     1506 * and returns the decoded codepoint which those codeunits represent. 
    15071507 * 
    15081508 * The input to this function MUST be validly encoded. 
     
    15111511 * Supercedes: 
    15121512 * This function supercedes std.utf.decode(), however, note that the 
    1513  * function dchars() supercedes it more conveniently! 
     1513 * function codepoints() supercedes it more conveniently. 
    15141514 * 
    15151515 * Standards: Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252 
     
    15201520dchar decode(T)(ref invariant(T)[] s) 
    15211521{ 
    1522     return UTF!(T).decode(s); 
     1522    return Encoding!(T).decode(s); 
    15231523} 
    15241524 
     
    15261526 * Encodes a single codepoint. 
    15271527 * 
    1528  * This function encodes a single codepoint into one or more code units. 
    1529  * It returns a string containing those code units. 
     1528 * This function encodes a single codepoint into one or more codeunits. 
     1529 * It returns a string containing those codeunits. 
    15301530 * 
    15311531 * The input to this function MUST be a valid codepoint. 
     
    15361536 * Supercedes: 
    15371537 * This function supercedes std.utf.encode(), however, note that the 
    1538  * function chars() supercedes it more conveniently! 
     1538 * function codeunits() supercedes it more conveniently. 
    15391539 * 
    15401540 * Standards: Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252 
     
    15611561 * Supercedes: 
    15621562 * This function supercedes std.utf.encode(), however, note that the 
    1563  * function chars() supercedes it more conveniently! 
     1563 * function codeunits() supercedes it more conveniently. 
    15641564 * 
    15651565 * Standards: Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252 
     
    15761576body 
    15771577{ 
    1578     return UTF!(T).encode(cast(uint)c,buffer); 
    1579 } 
    1580  
    1581 template Dchars(T) 
    1582 { 
    1583     alias UTF!(T).Dchars Dchars; 
    1584 } 
    1585  
    1586 template Chars(T) 
    1587 { 
    1588     alias UTF!(T).Chars Chars; 
     1578    return Encoding!(T).encode(cast(uint)c,buffer); 
     1579} 
     1580 
     1581template Codepoints(T) 
     1582{ 
     1583    alias Encoding!(T).Codepoints Codepoints; 
     1584} 
     1585 
     1586template Codeunits(T) 
     1587{ 
     1588    alias Encoding!(T).Codeunits Codeunits; 
    15891589} 
    15901590 
     
    16101610 * -------------------------------------------------------- 
    16111611 * string s = "hello world"; 
    1612  * foreach(c;dchars(s)) 
     1612 * foreach(c;codepoints(s)) 
    16131613 * { 
    16141614 *     // do something with c (which will always be a dchar) 
     
    16161616 * -------------------------------------------------------- 
    16171617 * 
    1618  * Note that, currently, foreach(c:dchars(s)) is superior to foreach(c;s) 
     1618 * Note that, currently, foreach(c:codepoints(s)) is superior to foreach(c;s) 
    16191619 * in that the latter will fall over on encountering U+FFFF. 
    16201620 */ 
    1621 Dchars!(T) dchars(T)(invariant(T)[] s) 
    1622 { 
    1623     return UTF!(T).dchars(s); 
     1621Codepoints!(T) codepoints(T)(invariant(T)[] s) 
     1622{ 
     1623    return Encoding!(T).codepoints(s); 
    16241624} 
    16251625 
    16261626/** 
    1627  * Returns a foreachable struct which can bidirectionally iterate over all code units in a codepoint. 
     1627 * Returns a foreachable struct which can bidirectionally iterate over all codeunits in a codepoint. 
    16281628 * 
    16291629 * The input to this function MUST be a valid codepoint. 
     
    16431643 * -------------------------------------------------------- 
    16441644 * dchar d = '\u20AC'; 
    1645  * foreach(c;chars!(Utf8)(d)) 
     1645 * foreach(c;codeunits!(Utf8)(d)) 
    16461646 * { 
    16471647 *     writefln("%X",c) 
     
    16531653 * -------------------------------------------------------- 
    16541654 */ 
    1655 Chars!(T) chars(T)(dchar d) 
    1656 { 
    1657     return UTF!(T).chars(d); 
     1655Codeunits!(T) codeunits(T)(dchar d) 
     1656{ 
     1657    return Encoding!(T).codeunits(d); 
    16581658} 
    16591659 
     
    16711671 * 
    16721672 * Params: 
     1673 *    s = the source string 
    16731674 *    r = the destination string 
    1674  *    s = the sorrce string 
    16751675 * 
    16761676 * Examples: 
     
    16971697    else 
    16981698    { 
    1699         foreach(d;dchars(s)) 
    1700         { 
    1701             foreach(c;chars!(U)(d)) 
     1699        foreach(d;codepoints(s)) 
     1700        { 
     1701            foreach(c;codeunits!(U)(d)) 
    17021702            { 
    17031703                r ~= c; 
     
    17201720 * Params: 
    17211721 *    U = the destination encoding type 
    1722  *    s = the sorrce string 
     1722 *    s = the source string 
    17231723 * 
    17241724 * Examples: 
     
    17501750        else 
    17511751        { 
    1752             foreach_reverse(d;dchars(s)) 
    1753             { 
    1754                 foreach_reverse(c;chars!(U)(d)) 
     1752            foreach_reverse(d;codepoints(s)) 
     1753            { 
     1754                foreach_reverse(c;codeunits!(U)(d)) 
    17551755                { 
    17561756                    r = c ~ r;