View previous topic :: View next topic |
Author |
Message |
smjg
Joined: 29 Sep 2004 Posts: 41
|
Posted: Fri Aug 12, 2005 6:56 am Post subject: TextEncoder/TextDecoder |
|
|
I'm now starting on the TextEncoder and TextDecoder classes. I gather the point is (correct me if I'm wrong):
- it should make use of preallocated buffers, meaning that no memory allocation will take place during a translation unless it is needed
- it will remember what incomplete characters were at the end of each input string, and stick them on the beginning of the next, making it possible to split up input and not require that the split be at a character boundary
- each TextEncoder and TextDecoder object will have its own buffers, both to preserve the remaining bytes buffer and to make the conversion functions re-entrant.
I see that we'd need to give TextEncoder/Decoder direct access to convertFrom/ToUtf8/16. So these classes would be placed in indigo.i18n.textcodec. At the moment I expect that the code for these classes won't end up too long.
I've noticed some issues:
- At the moment, only convertToUtf8 and convertToUtf16 have the input parameter declared as inout to hold a remainder. Is there a reason this isn't in convertFromUtfXX?
- Qt has makeEncoder and makeDecoder. Is there supposed to be any difference between these and calling the TE/D constructor directly?
- How should the sizes of TE/D's buffers be controlled? Just keep growing to fit? (We could also provide buffer length properties making it possible to shrink them again manually....) Or is there a better idea? Moreover, convertFromUtf8 and convertToUtf8 provide no means of telling the caller that they needed to reallocate the intermediate UTF-16 buffer. Maybe make the utf16Buffer parameters inout?
Stewart. |
|
Back to top |
|
|
uwe
Joined: 05 Apr 2005 Posts: 34 Location: Stuttgart, Germany
|
Posted: Sat Aug 13, 2005 3:11 am Post subject: Re: TextEncoder/TextDecoder |
|
|
smjg wrote: | - it should make use of preallocated buffers, meaning that no memory allocation will take place during a translation unless it is needed
- it will remember what incomplete characters were at the end of each input string, and stick them on the beginning of the next, making it possible to split up input and not require that the split be at a character boundary
- each TextEncoder and TextDecoder object will have its own buffers, both to preserve the remaining bytes buffer and to make the conversion functions re-entrant.
|
Perfect. You hit the nail right on the face.
smjg wrote: | I see that we'd need to give TextEncoder/Decoder direct access to convertFrom/ToUtf8/16. So these classes would be placed in indigo.i18n.textcodec. At the moment I expect that the code for these classes won't end up too long. |
I think so, too. We could also make the functions "package", but your solution is better.
smjg wrote: | At the moment, only convertToUtf8 and convertToUtf16 have the input parameter declared as inout to hold a remainder. Is there a reason this isn't in convertFromUtfXX? |
I thought they would not be necessary. This is of course nonsense. They need a remainder, too.
smjg wrote: | Qt has makeEncoder and makeDecoder. Is there supposed to be any difference between these and calling the TE/D constructor directly? |
No, i do not think so. They are provided for convenience, and call the constructor. We can leave them out if you think they're ballast.
smjg wrote: | How should the sizes of TE/D's buffers be controlled? Just keep growing to fit? (We could also provide buffer length properties making it possible to shrink them again manually....) Or is there a better idea? Moreover, convertFromUtf8 and convertToUtf8 provide no means of telling the caller that they needed to reallocate the intermediate UTF-16 buffer. Maybe make the utf16Buffer parameters inout? |
Well, the newly allocated buffer is returned as the conversion result. But this is not optimal, you are right. I would say we should specify a certain buffer size (perhaps slightly less than a page, about 4000 bytes), and always feed the TextCodec functions with BufferSize/codec.maxCharSize() characters. The virtual protected function TextCodec.maxCharSize() or some similar name would return the maximum length of a single Unicode code point encoded in the target encoding. This way we can be sure there will be no reallocation. The TD/E will always split up their input data into chunks of this size.
Ciao
uwe |
|
Back to top |
|
|
smjg
Joined: 29 Sep 2004 Posts: 41
|
Posted: Sat Aug 13, 2005 4:14 am Post subject: Re: TextEncoder/TextDecoder |
|
|
uwe wrote: | smjg wrote: | I see that we'd need to give TextEncoder/Decoder direct access to convertFrom/ToUtf8/16. So these classes would be placed in indigo.i18n.textcodec. At the moment I expect that the code for these classes won't end up too long. |
I think so, too. We could also make the functions "package", but your solution is better. |
Is it possible to make a function both package and protected? AIUI package functions don't override. And even if they did, it would stop users of your library from creating their own codecs.
uwe wrote: | smjg wrote: | How should the sizes of TE/D's buffers be controlled? Just keep growing to fit? (We could also provide buffer length properties making it possible to shrink them again manually....) Or is there a better idea? Moreover, convertFromUtf8 and convertToUtf8 provide no means of telling the caller that they needed to reallocate the intermediate UTF-16 buffer. Maybe make the utf16Buffer parameters inout? |
Well, the newly allocated buffer is returned as the conversion result. |
Only the buffer for the final result of the translation, be it UTF-8, UTF-16 or encoded. But the convertTo/FromUtf8 functions don't in any way send back the reallocated buffer for the intermediate UTF-16. But if we're going to keep the TE/D buffers the same size, we probably don't really need it.
uwe wrote: | But this is not optimal, you are right. I would say we should specify a certain buffer size (perhaps slightly less than a page, about 4000 bytes), and always feed the TextCodec functions with BufferSize/codec.maxCharSize() characters. The virtual protected function TextCodec.maxCharSize() or some similar name would return the maximum length of a single Unicode code point encoded in the target encoding. This way we can be sure there will be no reallocation. The TD/E will always split up their input data into chunks of this size. |
Good idea I reckon. But should maxCharSize be in bytes or code units? Probably consistent with moreNeeded either way....
What should TE/D do if sent a string that is indeed too long for its buffers? Of course, Indigo won't do this, but a user might try to. The underlying conversion functions would of course reallocate the buffers, but should it just pass the result through? Or throw an exception?
Stewart. |
|
Back to top |
|
|
uwe
Joined: 05 Apr 2005 Posts: 34 Location: Stuttgart, Germany
|
Posted: Sat Aug 13, 2005 11:25 pm Post subject: |
|
|
You are right, we have to make them protected.
maxCharSize() would return the number of bytes, i think. But if you can think of a better name, come forward with it. I don't like the idea that TE/D has to multiply maxCharSize() and moreNeeded() with unitSize() all the time. This just adds another virtual function call, but no functionality. Do you think we should get rid of unitSize()? moreNeeded() could return the number of bytes, too...
If the string sent to TE/D is too long, it is split and converted chunk by chunk, i would say.
Ciao
uwe |
|
Back to top |
|
|
smjg
Joined: 29 Sep 2004 Posts: 41
|
Posted: Sun Aug 14, 2005 8:50 am Post subject: |
|
|
uwe wrote: | You are right, we have to make them protected.
maxCharSize() would return the number of bytes, i think. But if you can think of a better name, come forward with it. I don't like the idea that TE/D has to multiply maxCharSize() and moreNeeded() with unitSize() all the time. This just adds another virtual function call, but no functionality. Do you think we should get rid of unitSize()? moreNeeded() could return the number of bytes, too... |
Come to think of it, I suppose having maxCharSize and moreNeeded return a number of bytes would be a good idea. Using only moreNeeded and not unitSize as well would slightly simplify the code to read a character from a TextStream. moreNeeded(null) would by extension give the minimum byte length of a character, which would usually be the same as the unit size.
uwe wrote: | If the string sent to TE/D is too long, it is split and converted chunk by chunk, i would say. |
But how? If a function is only called once, it can only return once. So how would you return the translated string chunk by chunk?
Stewart. |
|
Back to top |
|
|
uwe
Joined: 05 Apr 2005 Posts: 34 Location: Stuttgart, Germany
|
Posted: Sun Aug 14, 2005 11:44 am Post subject: |
|
|
Well, i guess somehow i am out of sorts today. Of course we cannot translate it chunk by chunk. We have to reallocate the buffer. You can use indigo.core.allocmore.passiveAllocMoreSize() to get a good size for the buffer to be allocated. Pass to it the minimum size:
Code: | void[] fromUtf16(wchar[] input)
{
if (input.length * maxCharSize >= m_foreignBuffer.length)
m_foreignBuffer.length = passiveAllocMoreSize(input.length * maxCharSize + 1);
// Convert...
} |
(The names, parameters, variables etc. are not the suggestion, this is only an example for the usage of passiveAllocMoreSize.)
I like the moreNeeded() idea. Let us get rid of unitSize(). The only problem is that i do not like the name maxCharSize(). But i cannot think of a better one either.
By the way, moreNeeded() does another very useful job, because we do need the minimum unit size for foreign-to-Unicode conversion. The length of the UTF-16 buffer that is needed is 2 * input.length / moreNeeded(null).
With all these calculations, perhaps we should define virtual methods for the TextCodecs that do all that stuff:
Code: | size_t moreNeeded(void[] str);
size_t maxEncodedLength(size_t utf16Length); // Unicode -> foreign
size_t maxDecodedLength(size_t strLength); // foreign -> Unicode
|
The first one is only needed for dealing with leftovers. The second and third function would give the estimates for buffer allocation. For almost all of the codecs, using maxDecodedLength instead of 2 * strLength / moreNeeded(null) should save half of the memory. What do you think about this idea?
Sample implementations for ASCII and most of the codepages (could become the default implementation in TextCodec):
Code: | size_t moreNeeded(void[] str)
{
return 0;
}
size_t maxEncodedLength(size_t strLen)
{
return strLen;
}
size_t maxDecodedLength(size_t strLen)
{
return strLen;
}
|
Decoding should then look somehow like this:
Code: | wchar[] toUtf16(void[] str)
{
size_t neededLen = m_codec.maxEncodedLength(str.length + 1);
if (neededLen >= m_utf16Buffer.length)
m_utf16Buffer.length = passiveAllocMoreSize((neededLen + 1) * wchar.sizeof) / wchar.sizeof;
// Now do the leftover processing and str conversion.
}
|
This looks much more straightforward, doesn't it? The +1 is for the trailing 0 of TextCodec.convertXXX().
Ciao
uwe |
|
Back to top |
|
|
smjg
Joined: 29 Sep 2004 Posts: 41
|
Posted: Mon Aug 15, 2005 2:55 am Post subject: Is it worth it? |
|
|
To be honest, I'm not sure that all this buffer reallocation complexity is worth it. There would be a performance hit if we checked every input string to see if it's too long, in the cases where the input has already been split into chunks as TextStream will. Also as many of the reallocations would be just in case - the buffer might already be big enough even if the heuristic suggests otherwise.
This reallocation would happen only if the user of TE/D has passed in a string that is too long. Would it make more sense to call this practice less efficient, and just let the output reallocated by TextCodec pass through?
Then the remaining issue is if TextEncoder needs more room to hold the input string. This will only matter when remainingBytes is non-empty, as it needs somewhere to contatenate the new input with it. When no concatenation is necessary, it can look at the input string in place.
Meanwhile, I'll work on getting TE/D working. Then we can look at how it should work in more detail.
Stewart. |
|
Back to top |
|
|
uwe
Joined: 05 Apr 2005 Posts: 34 Location: Stuttgart, Germany
|
Posted: Wed Aug 17, 2005 12:31 pm Post subject: |
|
|
Hmm, perhaps you are right about the complexity issue. The problem is that there is no "too long" input string. The TE/D is created, and fed the first string. The next string might be a bit longer, who knows? I want to ensure that the reallocations are rare, and if they happen, they happen in TE/D, not in TextCodec, because buffers that are reallocated there "get lost", that means they are used once and discarded.
About the handling of remainingBytes: i thought it would be much better to not concatenate the old remainingBytes with the new input string (as that means copying a lot of data around), but instead asking the TextCodec how many bytes it needs and concatenating only them to the remainingBytes. The output is directed to the beginning of the buffer, and afterwards the rest of the input string is converted behind it. This is another reason why the allocation should take place in TE/D: it makes this algorithm much simpler.
Code: | size_t inputStart = 0;
size_t outputStart = 0;
if (m_leftoverLength != 0)
{
inputStart = m_codec.moreNeeded(m_leftover[0 .. m_leftoverLength]);
m_leftover[m_leftoverLength .. (m_leftoverLength+inputStart)] = input[0 .. inputStart];
outputStart = m_codec.convertToUnicode(m_leftover[0 .. (m_leftoverLength+inputStart)], m_buffer).length;
}
outputStart += m_codec.convertToUnicode(input[inputStart .. length], m_buffer[outputStart .. length]).length;
return m_buffer[0 .. outputStart];
|
Somehow like this? Handling of the new leftover is missing, buffer management too. And perhaps it can be expressed a little more efficient.
Ciao
uwe |
|
Back to top |
|
|
smjg
Joined: 29 Sep 2004 Posts: 41
|
Posted: Wed Aug 31, 2005 9:43 am Post subject: |
|
|
uwe wrote: | Hmm, perhaps you are right about the complexity issue. The problem is that there is no "too long" input string. The TE/D is created, and fed the first string. The next string might be a bit longer, who knows? I want to ensure that the reallocations are rare, and if they happen, they happen in TE/D, not in TextCodec, because buffers that are reallocated there "get lost", that means they are used once and discarded. |
Only if we choose to discard them. What's to stop us using the reallocated buffer returned by the TextCodec functions as the new buffer? We could also use passiveAllocMoreSize to give it a bit of growing space whenever this happens.
Another drawback with heuristic allocation in TE/D is that it will cause unnecessary reallocations not only when the translated string actually does fit into the current buffer, but also of the UTF-16 buffer when converting to/from UTF-8, if using a codec that bypasses UTF-16 in doing so.
uwe wrote: | About the handling of remainingBytes: i thought it would be much better to not concatenate the old remainingBytes with the new input string (as that means copying a lot of data around), but instead asking the TextCodec how many bytes it needs and concatenating only them to the remainingBytes. The output is directed to the beginning of the buffer, and afterwards the rest of the input string is converted behind it. This is another reason why the allocation should take place in TE/D: it makes this algorithm much simpler.
Code: | size_t inputStart = 0;
size_t outputStart = 0;
if (m_leftoverLength != 0)
{
inputStart = m_codec.moreNeeded(m_leftover[0 .. m_leftoverLength]);
m_leftover[m_leftoverLength .. (m_leftoverLength+inputStart)] = input[0 .. inputStart];
outputStart = m_codec.convertToUnicode(m_leftover[0 .. (m_leftoverLength+inputStart)], m_buffer).length;
}
outputStart += m_codec.convertToUnicode(input[inputStart .. length], m_buffer[outputStart .. length]).length;
return m_buffer[0 .. outputStart];
|
Somehow like this? Handling of the new leftover is missing, buffer management too. And perhaps it can be expressed a little more efficient. |
Maybe that's a good idea. Also missing is handling the case of negative moreNeeded. But that should be straightforward to do.
Stewart. |
|
Back to top |
|
|
uwe
Joined: 05 Apr 2005 Posts: 34 Location: Stuttgart, Germany
|
Posted: Tue Sep 06, 2005 9:12 am Post subject: Back from vacations |
|
|
Hi Stewart,
i have returned from my vacations and workshops. I am currently learning for my exams, but there should be plenty of time to work on Indigo besides.
Using the buffer returned by TextCodec is not as good as TE/D reallocating its own buffer, i guess. I think it is a very simple way to implement it, and at least as fast as the alternative. But you are free to implement it another way if you think it is a lot easier/faster.
Anyways, moreNeeded() currently returns a size_t, which will create a conflict if a derived class wants to return negative values. Is this needed, by the way? It could also return the maximum as a positive value, and TE/D could try to deal with possible "leftovers" of the conversion of the last leftover (this is a bad explanation, sorry).
Ciao
uwe |
|
Back to top |
|
|
smjg
Joined: 29 Sep 2004 Posts: 41
|
Posted: Tue Sep 06, 2005 10:06 am Post subject: Re: Back from vacations |
|
|
uwe wrote: | Hi Stewart,
i have returned from my vacations and workshops. I am currently learning for my exams, but there should be plenty of time to work on Indigo besides.
Using the buffer returned by TextCodec is not as good as TE/D reallocating its own buffer, i guess. I think it is a very simple way to implement it, and at least as fast as the alternative. But you are free to implement it another way if you think it is a lot easier/faster.
Anyways, moreNeeded() currently returns a size_t, which will create a conflict if a derived class wants to return negative values. Is this needed, by the way? It could also return the maximum as a positive value, and TE/D could try to deal with possible "leftovers" of the conversion of the last leftover (this is a bad explanation, sorry).
Ciao
uwe |
I invented moreNeeded to return an int. Why have you changed it to return a type that's incompatible with its semantics? You seem to be missing the point of moreNeeded.
The point is that it returns the minimum, and for a good reason: so that one can know how many bytes to read from a stream. It's useless to try and read more bytes than needed from some streams such as stdin.
If moreNeeded(null) returns positive, then TextStream can be fast, since the length of a character is constant. So after reading this number of bytes, it doesn't have to call moreNeeded again to determine if there's anything more to retrieve before we have a character. OTOH if moreNeeded(null) returns negative, then TextStream will first get this number of bytes and then call moreNeeded, and so on until moreNeeded returns non-negative IWC the character can be finished.
BTW I discovered a DMD bug while compiling the latest version you provided. Hopefully it'll be fixed soon....
But I have been able to do more work on it in the meantime....
Stewart. |
|
Back to top |
|
|
uwe
Joined: 05 Apr 2005 Posts: 34 Location: Stuttgart, Germany
|
Posted: Tue Sep 06, 2005 1:06 pm Post subject: |
|
|
Sorry then i must have overseen the semantics of moreNeeded(). Just change it back :oops: |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|