Changeset 613
- Timestamp:
- 02/22/08 10:41:24 (9 months ago)
- Files:
-
- candidate/phobos/std/encoding.d (modified) (48 diffs)
Legend:
- Unmodified
- Added
- Removed
- Modified
- Copied
- Moved
candidate/phobos/std/encoding.d
r610 r613 1 // Written in the D programming language. 2 1 3 /** 2 4 Classes and functions for handling and transcoding between various encodings. Encodings currently supported are … … 16 18 Date: 2006.02.21 17 19 18 ($B A Brief Tutorial) 20 License: Public Domain 21 22 $(BIG $(B A Brief Tutorial)) 19 23 20 24 There are many character sets (or more properly, character repertoires) on the planet. Unicode is the 21 superset of all other legacy character sets. Therefore, ($I every) character which exists in any25 superset of all other legacy character sets. Therefore, $(I every) character which exists in any 22 26 character repertoire, also exists in Unicode. Every character in Unicode has an integer associated 23 27 with it. That integer is the called the character's codepoint. For example, the codepoint of the letter 24 28 'A' is 65, or in hex, 0x41. It is important to know that a character's codepoint is unchangeable. It 25 29 is a permanent property of the character, and it does not depend on how you encode it. The codepoint 26 of 'A' is 65, period. If you choose to encode the letter 'A' in EDCDIC, that will not change its 27 codepoint. (Its encoding in EBCDIC will be different, but that is irrelevant). 28 29 Most character repertoires consist of 256 characters or fewer. This is because it is convenience to use 30 of 'A' is 65, period. 31 32 Most character repertoires consist of 256 characters or fewer. This is because it is convenient to use 30 33 single-byte encoding schemes. In such repertoires, every character will have an integer in the range 0 31 34 to 255 associated with it, denoting its position within that repertoire. That number is called a 32 code unit. Note that, in general, codeunit != codepoint.35 codeunit. Note that, in general, codeunit != codepoint. 33 36 34 37 For example, the Euro currency symbol has codepoint 0x20AC. This is a permanent property of the character. 35 38 That character does not exist in the ASCII repertoire, and so cannot be encoded in ASCII. It also does 36 not exist in the Latin-1 character repertoire, and likewise cannot be encoded in Latin-1. It ($I does)39 not exist in the Latin-1 character repertoire, and likewise cannot be encoded in Latin-1. It $(I does) 37 40 exist in the Windows-1252 character repertoire though. In that encoding, it is represented by the byte 38 0x80. So in that encoding, its code UNIT is 0x80, but its codePOINT is still 0x20AC. Codepoints are39 ($I always) measured in Unicode.41 0x80. So in that encoding, its codeUNIT is 0x80, but its codePOINT is still 0x20AC. Codepoints are 42 $(I always) measured in Unicode. 40 43 41 44 Some character repertoires contain more than 256 characters. Yet it is still desirable to be able to … … 43 46 encodings a single character may require more than one byte to represent it. 44 47 45 The process of converting a single codepoint into one or more code units is called ENCODING.46 The reverse process, that of converting multiple code units into a single codepoint is called DECODING.47 48 Almost all encoding schemes use 8-bit bytes as the storage type for a single code unit - but there are49 exceptions. UTF-16, for example, uses 16-bit wide code units. (The character repertoire which it48 The process of converting a single codepoint into one or more codeunits is called ENCODING. 49 The reverse process, that of converting multiple codeunits into a single codepoint is called DECODING. 50 51 Almost all encoding schemes use 8-bit bytes as the storage type for a single codeunit - but there are 52 exceptions. UTF-16, for example, uses 16-bit wide codeunits. (The character repertoire which it 50 53 represents contains more than 2^16 characters, so some of those characters need to expressed as 51 multiple code units, even in UTF-16). UTF-32 uses a 32-bit wide codeunit, which means, just like52 in the good old days of ASCII, one code unit == one codepoint. UTF-32, however, is the ($I only)54 multiple codeunits, even in UTF-16). UTF-32 uses a 32-bit wide codeunit, which means, just like 55 in the good old days of ASCII, one codeunit == one codepoint. UTF-32, however, is the $(I only) 53 56 encoding for which this is true. 54 57 … … 312 315 // Unit tests over. Now for the code... 313 316 314 template UTF(T)317 template Encoding(T) 315 318 { 316 319 static if (is(T==char)) … … 340 343 ]; 341 344 342 bool isValidCode Unit(T c)345 bool isValidCodeunit(T c) 343 346 { 344 347 return c < 0x80 || tails(c) >= 0; … … 482 485 invariant(char)[] encodingName = "UTF-16"; 483 486 484 bool isValidCode Unit(T c)487 bool isValidCodeunit(T c) 485 488 { 486 489 return true; … … 584 587 invariant(char)[] encodingName = "UTF-32"; 585 588 586 alias isValidCodepoint isValidCode Unit;589 alias isValidCodepoint isValidCodeunit; 587 590 588 591 alias isValidCodepoint isSingle; … … 655 658 invariant(char)[] encodingName = "ASCII"; 656 659 657 bool isValidCode Unit(T c)660 bool isValidCodeunit(T c) 658 661 { 659 662 return c < 0x80; 660 663 } 661 664 662 alias isValidCode Unit isSingle;665 alias isValidCodeunit isSingle; 663 666 664 667 int tails(T c) … … 729 732 invariant(char)[] encodingName = "ISO-8859-1"; 730 733 731 bool isValidCode Unit(T c)734 bool isValidCodeunit(T c) 732 735 { 733 736 return true; 734 737 } 735 738 736 alias isValidCode Unit isSingle;739 alias isValidCodeunit isSingle; 737 740 738 741 int tails(T c) … … 826 829 } 827 830 828 bool isValidCode Unit(T c)831 bool isValidCodeunit(T c) 829 832 { 830 833 return(win2uni(c) != 0xFFFD); … … 843 846 bool isHead(T c) 844 847 { 845 return isSingle(c) ? false : isValidCode Unit(c);848 return isSingle(c) ? false : isValidCodeunit(c); 846 849 } 847 850 … … 923 926 } 924 927 925 bool isValidCode Unit(T c)926 { 927 return c.isValidCode Unit;928 bool isValidCodeunit(T c) 929 { 930 return c.isValidCodeunit; 928 931 } 929 932 … … 936 939 else 937 940 { 938 return c.isValidCode Unit;941 return c.isValidCodeunit; 939 942 } 940 943 } … … 1054 1057 { 1055 1058 assert(s.length != 0); 1056 if (!isValidCode Unit(s[0])) return 1;1059 if (!isValidCodeunit(s[0])) return 1; 1057 1060 int i = isHead(s[0]) ? 1 : 0; 1058 1061 for (; i < s.length; ++i) … … 1082 1085 } 1083 1086 1084 // find the first invalid code unit1087 // find the first invalid codeunit 1085 1088 uint validatePartial(const(T)[] s) 1086 1089 { … … 1091 1094 if (isSingle(c)) continue; 1092 1095 uint n = tails(c); 1093 if (n <= 0) return i; // fail with illegal code units1096 if (n <= 0) return i; // fail with illegal codeunits 1094 1097 if (i + n >= s.length) return i; // fail if we exceed the length of the string 1095 1098 if (isInvalidHeadTail(c,s[i+1])) return i; // fail with invalid head/tail combinations … … 1127 1130 } 1128 1131 1129 uint count Codepoints(string s)1132 uint count(string s) 1130 1133 in 1131 1134 { … … 1142 1145 } 1143 1146 1144 int charIndex(string s, int n)1147 int index(string s, int n) 1145 1148 in 1146 1149 { … … 1196 1199 } 1197 1200 1198 struct Dchars1201 struct Codepoints 1199 1202 { 1200 1203 string s; … … 1254 1257 } 1255 1258 1256 Dchars dchars(string s)1257 { 1258 Dchars ci;1259 Codepoints codepoints(string s) 1260 { 1261 Codepoints ci; 1259 1262 ci.s = s; 1260 1263 return ci; 1261 1264 } 1262 1265 1263 struct C hars1266 struct Codeunits 1264 1267 { 1265 1268 T[MAX_SEQUENCE_LENGTH] buffer; … … 1289 1292 } 1290 1293 1291 C hars chars(dchar d)1294 Codeunits codeunits(dchar d) 1292 1295 in 1293 1296 { … … 1296 1299 body 1297 1300 { 1298 C hars chars;1299 c hars.len = encode(d,chars.buffer);1300 return c hars;1301 Codeunits codeunits; 1302 codeunits.len = encode(d,codeunits.buffer); 1303 return codeunits; 1301 1304 } 1302 1305 } … … 1305 1308 alias wchar Utf16; /// A type representing the UTF-16 encoding (an alias of wchar) 1306 1309 alias dchar Utf32; /// A type representing the UTF-32 encoding (an alias of dchar) 1307 typedef char Ascii; /// A type representing the ASCII encoding 1308 typedef ubyte Latin1; /// A type representing the ISO-8859-1 (aka Latin-1) encoding 1309 typedef ubyte Windows1252; /// A type representing the WINDOWS-1252 encoding 1310 typedef char Ascii; /// A type representing the ASCII encoding (a typedef of char) 1311 typedef ubyte Latin1; /// A type representing the ISO-8859-1 (aka Latin-1) encoding (a typedef of ubyte) 1312 typedef ubyte Windows1252; /// A type representing the WINDOWS-1252 encoding (a typedef of ubyte) 1310 1313 1311 1314 /** … … 1333 1336 string encodingName(T)() 1334 1337 { 1335 return UTF!(T).encodingName;1338 return Encoding!(T).encodingName; 1336 1339 } 1337 1340 … … 1346 1349 1347 1350 /** 1348 * Returns true if the characteris a valid codepoint1351 * Returns true if c is a valid codepoint 1349 1352 * 1350 1353 * Note that this includes the non-character codepoints U+FFFE and U+FFFF, since these are … … 1365 1368 1366 1369 /** 1367 * Returns true if the code unit is legal. For example, the byte 0x80 would not be 1368 * legal in ASCII, because ASCII code units must always be in the range 0x00 to 0x7F. 1369 * For another example, 1370 * the byte 0xFF is not legal in UTF-8, because 0xFF can never occur in valid UTF-8. 1371 * (That's why it was chosen as the .init value for char!) 1370 * Returns true if the codeunit is legal. For example, the byte 0x80 would not be 1371 * legal in ASCII, because ASCII codeunits must always be in the range 0x00 to 0x7F. 1372 1372 * 1373 1373 * Standards: Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252 1374 1374 * 1375 1375 * Params: 1376 * c = the code unit to be tested1376 * c = the codeunit to be tested 1377 1377 */ 1378 bool isValidCode Unit(T)(T c)1379 { 1380 return UTF!(T).isValidCodeUnit(c);1378 bool isValidCodeunit(T)(T c) 1379 { 1380 return Encoding!(T).isValidCodeunit(c); 1381 1381 } 1382 1382 … … 1396 1396 bool isValid(T)(const(T)[] s) 1397 1397 { 1398 return UTF!(T).isValid(s);1398 return Encoding!(T).isValid(s); 1399 1399 } 1400 1400 1401 1401 /** 1402 * Sanitizes a string by replacing malformed code unit sequences with valid codeunit sequences.1402 * Sanitizes a string by replacing malformed codeunit sequences with valid codeunit sequences. 1403 1403 * The result is guaranteed to be valid for this encoding. 1404 1404 * 1405 1405 * If the input string is already valid, this function returns the original, otherwise 1406 * it constructs a new string by replacing all illegal code unit sequences with the1406 * it constructs a new string by replacing all illegal codeunit sequences with the 1407 1407 * encoding's replacement character, Invalid sequences will be replaced with the 1408 1408 * Unicode replacement character (U+FFFD) if the character repertoire contains it, … … 1416 1416 invariant(T)[] sanitize(T)(invariant(T)[] s) 1417 1417 { 1418 return UTF!(T).sanitize(s);1418 return Encoding!(T).sanitize(s); 1419 1419 } 1420 1420 … … 1426 1426 /** 1427 1427 * Returns the slice of the input string from the first character to the end of the 1428 * first encoded sequence. The resulting string may consist of multiple code units, but1428 * first encoded sequence. The resulting string may consist of multiple codeunits, but 1429 1429 * it will always represent at most one character. If the input is the empty string, 1430 1430 * the return value will be the empty string … … 1440 1440 invariant(T)[] firstSequence(T)(invariant(T)[] s) 1441 1441 { 1442 return UTF!(T).firstSequence(s);1442 return Encoding!(T).firstSequence(s); 1443 1443 } 1444 1444 1445 1445 /** 1446 1446 * Returns the slice of the input string from the start of the last encoded sequence 1447 * to the end of the string. The resulting string may consist of multiple code units,1447 * to the end of the string. The resulting string may consist of multiple codeunits, 1448 1448 * but it will always represent at most one character. If the input is the empty string, 1449 1449 * the return value will be the empty string … … 1459 1459 invariant(T)[] lastSequence(T)(invariant(T)[] s) 1460 1460 { 1461 return UTF!(T).lastSequence(s);1461 return Encoding!(T).lastSequence(s); 1462 1462 } 1463 1463 … … 1476 1476 * s = the string to be counted 1477 1477 */ 1478 uint count Codepoints(T)(invariant(T)[] s)1479 { 1480 return UTF!(T).countCodepoints(s);1478 uint count(T)(invariant(T)[] s) 1479 { 1480 return Encoding!(T).count(s); 1481 1481 } 1482 1482 … … 1495 1495 * s = the string to be counted 1496 1496 */ 1497 int charIndex(T)(invariant(T)[] s,int n)1498 { 1499 return UTF!(T).charIndex(s,n);1497 int index(T)(invariant(T)[] s,int n) 1498 { 1499 return Encoding!(T).index(s,n); 1500 1500 } 1501 1501 … … 1503 1503 * Decodes a single codepoint. 1504 1504 * 1505 * This function removes one or more code units from the start of a string, and1506 * and returns the decoded codepoint which those code units represent.1505 * This function removes one or more codeunits from the start of a string, and 1506 * and returns the decoded codepoint which those codeunits represent. 1507 1507 * 1508 1508 * The input to this function MUST be validly encoded. … … 1511 1511 * Supercedes: 1512 1512 * This function supercedes std.utf.decode(), however, note that the 1513 * function dchars() supercedes it more conveniently!1513 * function codepoints() supercedes it more conveniently. 1514 1514 * 1515 1515 * Standards: Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252 … … 1520 1520 dchar decode(T)(ref invariant(T)[] s) 1521 1521 { 1522 return UTF!(T).decode(s);1522 return Encoding!(T).decode(s); 1523 1523 } 1524 1524 … … 1526 1526 * Encodes a single codepoint. 1527 1527 * 1528 * This function encodes a single codepoint into one or more code units.1529 * It returns a string containing those code units.1528 * This function encodes a single codepoint into one or more codeunits. 1529 * It returns a string containing those codeunits. 1530 1530 * 1531 1531 * The input to this function MUST be a valid codepoint. … … 1536 1536 * Supercedes: 1537 1537 * This function supercedes std.utf.encode(), however, note that the 1538 * function c hars() supercedes it more conveniently!1538 * function codeunits() supercedes it more conveniently. 1539 1539 * 1540 1540 * Standards: Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252 … … 1561 1561 * Supercedes: 1562 1562 * This function supercedes std.utf.encode(), however, note that the 1563 * function c hars() supercedes it more conveniently!1563 * function codeunits() supercedes it more conveniently. 1564 1564 * 1565 1565 * Standards: Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252 … … 1576 1576 body 1577 1577 { 1578 return UTF!(T).encode(cast(uint)c,buffer);1579 } 1580 1581 template Dchars(T)1582 { 1583 alias UTF!(T).Dchars Dchars;1584 } 1585 1586 template C hars(T)1587 { 1588 alias UTF!(T).Chars Chars;1578 return Encoding!(T).encode(cast(uint)c,buffer); 1579 } 1580 1581 template Codepoints(T) 1582 { 1583 alias Encoding!(T).Codepoints Codepoints; 1584 } 1585 1586 template Codeunits(T) 1587 { 1588 alias Encoding!(T).Codeunits Codeunits; 1589 1589 } 1590 1590 … … 1610 1610 * -------------------------------------------------------- 1611 1611 * string s = "hello world"; 1612 * foreach(c; dchars(s))1612 * foreach(c;codepoints(s)) 1613 1613 * { 1614 1614 * // do something with c (which will always be a dchar) … … 1616 1616 * -------------------------------------------------------- 1617 1617 * 1618 * Note that, currently, foreach(c: dchars(s)) is superior to foreach(c;s)1618 * Note that, currently, foreach(c:codepoints(s)) is superior to foreach(c;s) 1619 1619 * in that the latter will fall over on encountering U+FFFF. 1620 1620 */ 1621 Dchars!(T) dchars(T)(invariant(T)[] s)1622 { 1623 return UTF!(T).dchars(s);1621 Codepoints!(T) codepoints(T)(invariant(T)[] s) 1622 { 1623 return Encoding!(T).codepoints(s); 1624 1624 } 1625 1625 1626 1626 /** 1627 * Returns a foreachable struct which can bidirectionally iterate over all code units in a codepoint.1627 * Returns a foreachable struct which can bidirectionally iterate over all codeunits in a codepoint. 1628 1628 * 1629 1629 * The input to this function MUST be a valid codepoint. … … 1643 1643 * -------------------------------------------------------- 1644 1644 * dchar d = '\u20AC'; 1645 * foreach(c;c hars!(Utf8)(d))1645 * foreach(c;codeunits!(Utf8)(d)) 1646 1646 * { 1647 1647 * writefln("%X",c) … … 1653 1653 * -------------------------------------------------------- 1654 1654 */ 1655 C hars!(T) chars(T)(dchar d)1656 { 1657 return UTF!(T).chars(d);1655 Codeunits!(T) codeunits(T)(dchar d) 1656 { 1657 return Encoding!(T).codeunits(d); 1658 1658 } 1659 1659 … … 1671 1671 * 1672 1672 * Params: 1673 * s = the source string 1673 1674 * r = the destination string 1674 * s = the sorrce string1675 1675 * 1676 1676 * Examples: … … 1697 1697 else 1698 1698 { 1699 foreach(d; dchars(s))1700 { 1701 foreach(c;c hars!(U)(d))1699 foreach(d;codepoints(s)) 1700 { 1701 foreach(c;codeunits!(U)(d)) 1702 1702 { 1703 1703 r ~= c; … … 1720 1720 * Params: 1721 1721 * U = the destination encoding type 1722 * s = the so rrce string1722 * s = the source string 1723 1723 * 1724 1724 * Examples: … … 1750 1750 else 1751 1751 { 1752 foreach_reverse(d; dchars(s))1753 { 1754 foreach_reverse(c;c hars!(U)(d))1752 foreach_reverse(d;codepoints(s)) 1753 { 1754 foreach_reverse(c;codeunits!(U)(d)) 1755 1755 { 1756 1756 r = c ~ r;
