Unicode BOM

teqdruid · Joined: 11 May 2004 Posts: 390 Location: UMD

I've got the first couple of bytes from a file. How do I decode the BOM (if it has one) simply? I see UnicodeBom, but that seems a whole lot more complicated than I need.

~John

kris · Joined: 27 Mar 2004 Posts: 1494 Location: South Pacific

I just added a lookupEncoding() method to UnicodeBom. That might do the trick?

- Kris

teqdruid · Joined: 11 May 2004 Posts: 390 Location: UMD

So I need to do...

kris · Joined: 27 Mar 2004 Posts: 1494 Location: South Pacific

If you're going to do all that, why not just use UnicodeBom to do the conversion for you? If you need an example of how to use it, take a look here: http://trac.dsource.org/projects/mango/browser/trunk/mango/io/UnicodeFile.d?rev=782

or examine this snippet, which converts an optionally prefixed void[] into a char[]:

csauls · Joined: 27 Mar 2004 Posts: 278

I remember a post many moons ago in the NG that submitted a patch to DMD that would allow it to recognize non-UTF8 encodings without there being a BOM. Could we get such a creature into UnicodeBom, perchance? Maybe as an optional flag parameter (, determine = false) to .convert()?

This would open the door for building code/script parsers with good Unicode support.
_________________
Chris Nicholson-Sauls

Derek Parnell · Posted: Sat Feb 11, 2006 11:28 pm Post subject:

csauls · Joined: 27 Mar 2004 Posts: 278

kris · Joined: 27 Mar 2004 Posts: 1494 Location: South Pacific

teqdruid · Joined: 11 May 2004 Posts: 390 Location: UMD

I'm working on moving to UnicodeBomT, but I'd like to make the following changes to it:

kris · Joined: 27 Mar 2004 Posts: 1494 Location: South Pacific

Sure.

If you're gonna' change decode() then you should also change encode() to provide similar options (better to be symmetrical, I think). If you're not using that recent lookupEncoding() addition, then it should probably be removed.

- Kris

teqdruid · Joined: 11 May 2004 Posts: 390 Location: UMD