FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Unicode BOM

 
Post new topic   Reply to topic     Forum Index -> Mango
View previous topic :: View next topic  
Author Message
teqdruid



Joined: 11 May 2004
Posts: 390
Location: UMD

PostPosted: Fri Feb 10, 2006 9:09 pm    Post subject: Unicode BOM Reply with quote

I've got the first couple of bytes from a file. How do I decode the BOM (if it has one) simply? I see UnicodeBom, but that seems a whole lot more complicated than I need.

~John
Back to top
View user's profile Send private message Send e-mail AIM Address
kris



Joined: 27 Mar 2004
Posts: 1494
Location: South Pacific

PostPosted: Fri Feb 10, 2006 10:00 pm    Post subject: Reply with quote

I just added a lookupEncoding() method to UnicodeBom. That might do the trick?

- Kris
Back to top
View user's profile Send private message
teqdruid



Joined: 11 May 2004
Posts: 390
Location: UMD

PostPosted: Fri Feb 10, 2006 10:44 pm    Post subject: Reply with quote

So I need to do...
Code:
void[] bytes = ... ; //First 100 bytes
UnicodeBomT!(char) ubt = UnicodeBomT!(char)(Unicode.UTF_8);
int encoding = ubt.lookupEncoding(cast(char[])bytes);


?

Or do I need to limit the number of bytes I give it? This also looks kinda ugly, which is why I'm kinda doubtful it's right.

After the above code, I also have to use a switch case to convert from the Unicode.UTF_XXXX to one of the Type.UtfXX that the mango.convert stuff uses. Why are there two different enums? I know the Type.UtfXX doesn't include LE or BE, so how does this work right with the convert functions that use Type.UtfXX?

Thanks,
John
Back to top
View user's profile Send private message Send e-mail AIM Address
kris



Joined: 27 Mar 2004
Posts: 1494
Location: South Pacific

PostPosted: Sat Feb 11, 2006 12:14 pm    Post subject: Reply with quote

If you're going to do all that, why not just use UnicodeBom to do the conversion for you? If you need an example of how to use it, take a look here: http://trac.dsource.org/projects/mango/browser/trunk/mango/io/UnicodeFile.d?rev=782

or examine this snippet, which converts an optionally prefixed void[] into a char[]:

Code:
char[] convert (void[] content)
{
     auto decoder = new UnicodeBomT!(char)(Unicode.Unknown);
     return decoder.convert (content);
}


If the content provided does not have a BOM, it will default to UTF8N, which you can check via a decoder.getEncoding()

Endian issues are handled by UnicodeBom. If you want the content converted to wchar[]/dchar[] instead, specify that as the Template argument (instead of char).

There's a description of how this operates within the file: http://trac.dsource.org/projects/mango/browser/trunk/mango/convert/UnicodeBom.d?rev=782
Back to top
View user's profile Send private message
csauls



Joined: 27 Mar 2004
Posts: 278

PostPosted: Sat Feb 11, 2006 6:31 pm    Post subject: Reply with quote

I remember a post many moons ago in the NG that submitted a patch to DMD that would allow it to recognize non-UTF8 encodings without there being a BOM. Could we get such a creature into UnicodeBom, perchance? Maybe as an optional flag parameter (, determine = false) to .convert()?

This would open the door for building code/script parsers with good Unicode support.
_________________
Chris Nicholson-Sauls
Back to top
View user's profile Send private message AIM Address Yahoo Messenger
Derek Parnell



Joined: 22 Apr 2004
Posts: 408
Location: Melbourne, Australia

PostPosted: Sat Feb 11, 2006 11:28 pm    Post subject: Reply with quote

csauls wrote:
I remember a post many moons ago in the NG that submitted a patch to DMD that would allow it to recognize non-UTF8 encodings without there being a BOM. Could we get such a creature into UnicodeBom, perchance? Maybe as an optional flag parameter (, determine = false) to .convert()?

This would open the door for building code/script parsers with good Unicode support.


For what its worth, I've already got an algorithm coded and its being used in Build. Basically, it reads a text file, in any UTF format and loads into RAM as a UTF-8 file. It handles BOMs, endianness, and missing BOMs too.
_________________
--
Derek
skype name: derek.j.parnell
Back to top
View user's profile Send private message
csauls



Joined: 27 Mar 2004
Posts: 278

PostPosted: Sun Feb 12, 2006 1:16 am    Post subject: Reply with quote

Derek Parnell wrote:
For what its worth, I've already got an algorithm coded and its being used in Build. Basically, it reads a text file, in any UTF format and loads into RAM as a UTF-8 file. It handles BOMs, endianness, and missing BOMs too.

I'll definitely take a look at that! Although I'm leaning toward using UTF32 (perhaps with an endianness converter in-between?) or else being able to specify a wanted encoding. Not a big deal, but it would help with my particular goal. *goes to peek at your code*
_________________
Chris Nicholson-Sauls
Back to top
View user's profile Send private message AIM Address Yahoo Messenger
kris



Joined: 27 Mar 2004
Posts: 1494
Location: South Pacific

PostPosted: Tue Feb 14, 2006 5:05 pm    Post subject: Reply with quote

Derek Parnell wrote:
csauls wrote:
I remember a post many moons ago in the NG that submitted a patch to DMD that would allow it to recognize non-UTF8 encodings without there being a BOM. Could we get such a creature into UnicodeBom, perchance? Maybe as an optional flag parameter (, determine = false) to .convert()?

This would open the door for building code/script parsers with good Unicode support.


For what its worth, I've already got an algorithm coded and its being used in Build. Basically, it reads a text file, in any UTF format and loads into RAM as a UTF-8 file. It handles BOMs, endianness, and missing BOMs too.

Sounds good!

Would you be willing to donate the algorithm that identifies encoding sans BOM?
Back to top
View user's profile Send private message
teqdruid



Joined: 11 May 2004
Posts: 390
Location: UMD

PostPosted: Fri Feb 24, 2006 4:16 pm    Post subject: Reply with quote

I'm working on moving to UnicodeBomT, but I'd like to make the following changes to it:
Code:
===================================================================
--- mango/convert/UnicodeBom.d  (revision 782)
+++ mango/convert/UnicodeBom.d  (working copy)
@@ -194,7 +194,7 @@

         ***********************************************************************/

-        final T[] decode (void[] content)
+        final T[] decode (void[] content, void[] dst=null, uint* ate=null)
         {
                 // look for a BOM
                 auto info = test (content);
@@ -221,7 +221,7 @@
                        Unicode.error ("UnicodeBom.decode :: explicit encoding does not permit BOM");

                 // convert it to internal representation
-                return cast(T[]) into.convert (swapBytes(content), settings.type);
+                return cast(T[]) into.convert (swapBytes(content), settings.type, dst, ate);
         }

         /***********************************************************************
@@ -303,7 +303,7 @@

         ***********************************************************************/

-        private final void setup (int encoding)
+        public final void setup (int encoding)
         {
                 assert (Unicode.isValid (encoding));


The first change is so that I can use my own buffers. The second is to make UnicodeBomT reusable, like my parser is.

Is this code ok?

~John
Back to top
View user's profile Send private message Send e-mail AIM Address
kris



Joined: 27 Mar 2004
Posts: 1494
Location: South Pacific

PostPosted: Sat Feb 25, 2006 11:36 am    Post subject: Reply with quote

Sure.

If you're gonna' change decode() then you should also change encode() to provide similar options (better to be symmetrical, I think). If you're not using that recent lookupEncoding() addition, then it should probably be removed.

- Kris
Back to top
View user's profile Send private message
teqdruid



Joined: 11 May 2004
Posts: 390
Location: UMD

PostPosted: Sat Feb 25, 2006 1:25 pm    Post subject: Reply with quote

kris wrote:
Sure.

If you're gonna' change decode() then you should also change encode() to provide similar options (better to be symmetrical, I think). If you're not using that recent lookupEncoding() addition, then it should probably be removed.

- Kris


Sounds reasonable.

~John
Back to top
View user's profile Send private message Send e-mail AIM Address
Display posts from previous:   
Post new topic   Reply to topic     Forum Index -> Mango All times are GMT - 6 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group