View previous topic :: View next topic |
Author |
Message |
teqdruid
Joined: 11 May 2004 Posts: 390 Location: UMD
|
Posted: Fri Feb 10, 2006 9:09 pm Post subject: Unicode BOM |
|
|
I've got the first couple of bytes from a file. How do I decode the BOM (if it has one) simply? I see UnicodeBom, but that seems a whole lot more complicated than I need.
~John |
|
Back to top |
|
|
kris
Joined: 27 Mar 2004 Posts: 1494 Location: South Pacific
|
Posted: Fri Feb 10, 2006 10:00 pm Post subject: |
|
|
I just added a lookupEncoding() method to UnicodeBom. That might do the trick?
- Kris |
|
Back to top |
|
|
teqdruid
Joined: 11 May 2004 Posts: 390 Location: UMD
|
Posted: Fri Feb 10, 2006 10:44 pm Post subject: |
|
|
So I need to do...
Code: | void[] bytes = ... ; //First 100 bytes
UnicodeBomT!(char) ubt = UnicodeBomT!(char)(Unicode.UTF_8);
int encoding = ubt.lookupEncoding(cast(char[])bytes);
|
?
Or do I need to limit the number of bytes I give it? This also looks kinda ugly, which is why I'm kinda doubtful it's right.
After the above code, I also have to use a switch case to convert from the Unicode.UTF_XXXX to one of the Type.UtfXX that the mango.convert stuff uses. Why are there two different enums? I know the Type.UtfXX doesn't include LE or BE, so how does this work right with the convert functions that use Type.UtfXX?
Thanks,
John |
|
Back to top |
|
|
kris
Joined: 27 Mar 2004 Posts: 1494 Location: South Pacific
|
Posted: Sat Feb 11, 2006 12:14 pm Post subject: |
|
|
If you're going to do all that, why not just use UnicodeBom to do the conversion for you? If you need an example of how to use it, take a look here: http://trac.dsource.org/projects/mango/browser/trunk/mango/io/UnicodeFile.d?rev=782
or examine this snippet, which converts an optionally prefixed void[] into a char[]:
Code: | char[] convert (void[] content)
{
auto decoder = new UnicodeBomT!(char)(Unicode.Unknown);
return decoder.convert (content);
} |
If the content provided does not have a BOM, it will default to UTF8N, which you can check via a decoder.getEncoding()
Endian issues are handled by UnicodeBom. If you want the content converted to wchar[]/dchar[] instead, specify that as the Template argument (instead of char).
There's a description of how this operates within the file: http://trac.dsource.org/projects/mango/browser/trunk/mango/convert/UnicodeBom.d?rev=782 |
|
Back to top |
|
|
csauls
Joined: 27 Mar 2004 Posts: 278
|
Posted: Sat Feb 11, 2006 6:31 pm Post subject: |
|
|
I remember a post many moons ago in the NG that submitted a patch to DMD that would allow it to recognize non-UTF8 encodings without there being a BOM. Could we get such a creature into UnicodeBom, perchance? Maybe as an optional flag parameter (, determine = false) to .convert()?
This would open the door for building code/script parsers with good Unicode support. _________________ Chris Nicholson-Sauls |
|
Back to top |
|
|
Derek Parnell
Joined: 22 Apr 2004 Posts: 408 Location: Melbourne, Australia
|
Posted: Sat Feb 11, 2006 11:28 pm Post subject: |
|
|
csauls wrote: | I remember a post many moons ago in the NG that submitted a patch to DMD that would allow it to recognize non-UTF8 encodings without there being a BOM. Could we get such a creature into UnicodeBom, perchance? Maybe as an optional flag parameter (, determine = false) to .convert()?
This would open the door for building code/script parsers with good Unicode support. |
For what its worth, I've already got an algorithm coded and its being used in Build. Basically, it reads a text file, in any UTF format and loads into RAM as a UTF-8 file. It handles BOMs, endianness, and missing BOMs too. _________________ --
Derek
skype name: derek.j.parnell |
|
Back to top |
|
|
csauls
Joined: 27 Mar 2004 Posts: 278
|
Posted: Sun Feb 12, 2006 1:16 am Post subject: |
|
|
Derek Parnell wrote: | For what its worth, I've already got an algorithm coded and its being used in Build. Basically, it reads a text file, in any UTF format and loads into RAM as a UTF-8 file. It handles BOMs, endianness, and missing BOMs too. |
I'll definitely take a look at that! Although I'm leaning toward using UTF32 (perhaps with an endianness converter in-between?) or else being able to specify a wanted encoding. Not a big deal, but it would help with my particular goal. *goes to peek at your code* _________________ Chris Nicholson-Sauls |
|
Back to top |
|
|
kris
Joined: 27 Mar 2004 Posts: 1494 Location: South Pacific
|
Posted: Tue Feb 14, 2006 5:05 pm Post subject: |
|
|
Derek Parnell wrote: | csauls wrote: | I remember a post many moons ago in the NG that submitted a patch to DMD that would allow it to recognize non-UTF8 encodings without there being a BOM. Could we get such a creature into UnicodeBom, perchance? Maybe as an optional flag parameter (, determine = false) to .convert()?
This would open the door for building code/script parsers with good Unicode support. |
For what its worth, I've already got an algorithm coded and its being used in Build. Basically, it reads a text file, in any UTF format and loads into RAM as a UTF-8 file. It handles BOMs, endianness, and missing BOMs too. |
Sounds good!
Would you be willing to donate the algorithm that identifies encoding sans BOM? |
|
Back to top |
|
|
teqdruid
Joined: 11 May 2004 Posts: 390 Location: UMD
|
Posted: Fri Feb 24, 2006 4:16 pm Post subject: |
|
|
I'm working on moving to UnicodeBomT, but I'd like to make the following changes to it:
Code: | ===================================================================
--- mango/convert/UnicodeBom.d (revision 782)
+++ mango/convert/UnicodeBom.d (working copy)
@@ -194,7 +194,7 @@
***********************************************************************/
- final T[] decode (void[] content)
+ final T[] decode (void[] content, void[] dst=null, uint* ate=null)
{
// look for a BOM
auto info = test (content);
@@ -221,7 +221,7 @@
Unicode.error ("UnicodeBom.decode :: explicit encoding does not permit BOM");
// convert it to internal representation
- return cast(T[]) into.convert (swapBytes(content), settings.type);
+ return cast(T[]) into.convert (swapBytes(content), settings.type, dst, ate);
}
/***********************************************************************
@@ -303,7 +303,7 @@
***********************************************************************/
- private final void setup (int encoding)
+ public final void setup (int encoding)
{
assert (Unicode.isValid (encoding));
|
The first change is so that I can use my own buffers. The second is to make UnicodeBomT reusable, like my parser is.
Is this code ok?
~John |
|
Back to top |
|
|
kris
Joined: 27 Mar 2004 Posts: 1494 Location: South Pacific
|
Posted: Sat Feb 25, 2006 11:36 am Post subject: |
|
|
Sure.
If you're gonna' change decode() then you should also change encode() to provide similar options (better to be symmetrical, I think). If you're not using that recent lookupEncoding() addition, then it should probably be removed.
- Kris |
|
Back to top |
|
|
teqdruid
Joined: 11 May 2004 Posts: 390 Location: UMD
|
Posted: Sat Feb 25, 2006 1:25 pm Post subject: |
|
|
kris wrote: | Sure.
If you're gonna' change decode() then you should also change encode() to provide similar options (better to be symmetrical, I think). If you're not using that recent lookupEncoding() addition, then it should probably be removed.
- Kris |
Sounds reasonable.
~John |
|
Back to top |
|
|
|