Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

Streaming Unicode from a file

Moderators: larsivi kris

Posted: 11/01/07 15:04:58

Hi,

I've been trying to write code that reads Unicode text from a conduit and decodes it into dchars. But I find it kind of hard to deal with the various tango functions associated with this, and the need to buffer not yet decoded chars, etc. I just keep getting very sneaky bugs, losing characters, etc.

Does anyone have working code for decoding UTF, ideally without the need to load the whole file into memory?

Thanks, Sebastian

Author Message

Posted: 11/01/07 16:15:24 -- Modified: 11/01/07 17:10:23 by
kris

Have you looked at UnicodeFile? ?

Posted: 11/01/07 19:23:14

Thanks - I'm going to use that method for now (reading in the whole file as one chunk), but I really want to be able to stream content, i.e. read in e.g. 8K blocks of the file at a time and decode them. If anyone has any code doing that I'd still appreciate it.

Posted: 11/01/07 20:00:24

If it doesn't show up, I'll write one during the weekend. We could use one in tango.io.stream :)

Posted: 11/03/07 01:23:28

I noticed that part of my problems stem from the UnicodeBom?.decode() function. It returns the number of code units decoded in the "ate" variable, but there is no indication whatsoever whether a BOM has been additionally read from the stream. So it's impossible to know how many bytes were actually decoded in the function call.

Posted: 11/03/07 02:13:58

I have a UtfStream? which converts from one known encoding into another. You need BOM decoding there also? Yuck :p

Posted: 11/03/07 21:45:32

Ideally I'd also like to decode the BOM while reading, but I guess I could also find a workaround for the time being. Thanks, I'll try and switch my source code to UtfStream?.

Posted: 11/04/07 00:02:00 -- Modified: 11/04/07 00:02:38 by
kris

A BomInput stream would need to look for (and consume) a BOM when constructed, via the provided input stream, and then create an appropriate combination of UtfStream and EndianStream as additional inline filters.

Presumably a BomOutput would accept a type in the ctor, and construct a similar set of appropriate filters on the fly. The ctor would also have a flag indicating whether to write a BOM within the scope of the ctor.

Does that sound about right?

Posted: 11/04/07 02:05:02

Slightly off-topic: in the UnicodeBom class should the lookup array be declared static const instead of just const? Is it just const because a UnicodeBom is instantiated rarely enough, the overhead of having static data is unnecessary?

Posted: 11/04/07 22:31:48

kris: That's pretty much what would be doing all of my work. :)

Thanks for helping, I was getting kind of frustrated there ;)

Posted: 11/05/07 02:07:59

r.lph50 wrote:

Slightly off-topic: in the UnicodeBom class should the lookup array be declared static const instead of just const? Is it just const because a UnicodeBom is instantiated rarely enough, the overhead of having static data is unnecessary?

in D1, const amounts to the same thing?

Posted: 11/05/07 05:23:21 -- Modified: 11/05/07 10:30:05 by
r.lph50

kris wrote:

in D1, const amounts to the same thing?

Sorry for polluting the thread. I didn't really know, so I wrote some D1 code and if you initialise const fields in the class declaration they equate to static const but if you intitialise the same field in a constructor then the field is const per instance.

Posted: 11/05/07 07:17:25

that's interesting. Thanks!