Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

Ticket #282 (assigned enhancement)

Opened 17 years ago

Last modified 14 years ago

tango.text.convert.Utf accepts some invalid UTF sequences

Reported by: Deewiant Assigned to: kris (accepted)
Priority: normal Milestone: 2.0
Component: Core Functionality Version:
Keywords: triage Cc: fawzi@gmx.ch

Description

Attached is a UTF conversion testing module and its output when used with Tango's tango.text.convert.Utf functions. What Tango fails at is that it's too liberal in what it accepts. To name a few: UTF-16 surrogates encoded in UTF-8 or UTF-32, overlong representations, UTF-8 continuation bytes that aren't part of a whole sequence.

The forum topic http://www.dsource.org/projects/tango/forums/topic/13 has a UTF-8 to UTF-32 converter which works for a single code point, returning a dchar. Fixing the toUtf32 functions, at least, should be trivial if it is used as a helper function. The toUtf16 ones I haven't looked at too much, but shouldn't be too difficult.

Attachments

utf_test.d (26.8 kB) - added by Deewiant on 02/17/07 14:51:42.
UTF conversion testing module
utf_test_tango.txt (8.5 kB) - added by Deewiant on 02/17/07 14:52:49.
Testing module results for Tango 0.95 beta 1

Change History

02/17/07 14:51:42 changed by Deewiant

  • attachment utf_test.d added.

UTF conversion testing module

02/17/07 14:52:49 changed by Deewiant

  • attachment utf_test_tango.txt added.

Testing module results for Tango 0.95 beta 1

02/26/07 12:59:22 changed by larsivi

  • milestone set to 0.96 Beta 2.

03/03/07 00:06:12 changed by kris

  • priority changed from major to normal.
  • status changed from new to assigned.
  • version deleted.
  • milestone changed from 0.96 Beta 2 to 1.0.

03/20/07 09:38:55 changed by Deewiant

  • cc set to deewiant@gmail.com.

05/10/07 04:42:26 changed by kris

haven't forgotten about this :)

05/24/08 19:09:39 changed by larsivi

  • keywords set to triage.

01/11/09 20:58:43 changed by JarrettBillingsley

  • status changed from assigned to closed.
  • resolution set to fixed.

(In [4242]) Closes #282, thanks Deewiant for a lot of the hard work. Also added some checks for things that Deewiant's tests didn't account for, like unmatched surrogate pairs in UTF-16. Also, dealing with surrogate pairs is now handled much more optimally, rather than transcoding the string twice.

01/11/09 21:19:22 changed by larsivi

  • milestone changed from 1.0 to 0.99.8.

01/12/09 10:53:54 changed by kris

  • status changed from closed to reopened.
  • resolution deleted.

have you done performance testing between pre and post changes? You may have noticed some commentary in the source file discussing throughput, and issues related to 32bit x86 compilers ...

Frankly, there's more to this than just adding some sanity checks. For example, in many cases one already knows the content is valid, and throughput is of primary concern instead. These particular functions were intended for the later cases only, and I feel we should be able to separate validation from transcoding where desired.

Suggestions?

01/12/09 17:22:09 changed by larsivi

I didn't look at the codepaths, but I assume the best alternatives would be either a validate argument or a validating wrapper.

01/15/09 02:26:00 changed by kris

yeah, probably a wrapper, or separate package. We didn't come up with a solid resolution in the past, but it's about time we did :)

01/16/09 11:17:28 changed by fawzi

I tend to agree that validation should not necessarily be performed, it should be possible to just convert as fast as possible

Invalid UTF sequences might make the function throw an exception, but should *never* make the function do unsafe things (like go ouside the bounds). These things should be guarded against (there was such an issue in the compilers reverse for example).

As doing a full validation is probably cheaper if done while converting the best solution in my opinion would be a compile time template switch (bool validate=false ?) that switches on the full validation (and might slow down the conversion a little).

With respect to UTF conversion I also found a template like this very useful in my code

template convertToString(T=char){
    T[]convertToString(S)(S[]src,T[]dest=null){
        static if(is(T==S)){
            return src;
        } else static if(is(T==char)){
            return Utf.toString(src,dest);
        } else static if(is(T==wchar)){
            return Utf.toString16(src,dest);
        } else static if(is(T==dchar)){
            return Utf.toString32(src,dest);
        } else {
            static assert(0,"unexpected char type "~T.stringof);
        }
    }
}

01/16/09 11:30:59 changed by fawzi

  • cc changed from deewiant@gmail.com to deewiant@gmail.com, fawzi@gmx.ch.

01/16/09 11:34:10 changed by Deewiant

  • cc changed from deewiant@gmail.com, fawzi@gmx.ch to fawzi@gmx.ch.

01/16/09 12:06:38 changed by Cyborg16

I suspect most users aren't going to care about validation though? Especially when the conversion might be done by one of the Utf streams or UnicodeBOM. I'm not sure what I'd want my application to do if an error came up anyway, except print an extra error message.

And those that do want to full validation may not want to convert the text? Unless they're fast checks intended to catch errors earlier.

01/16/09 17:23:52 changed by kris

I've reverted the additions until we have a plan in place to address concerns

01/17/09 22:36:23 changed by kris

  • milestone changed from 0.99.8 to 0.99.9.

11/09/09 09:39:29 changed by kris

  • milestone changed from 0.99.9 to 2.0.

11/13/09 23:14:12 changed by kris

  • status changed from reopened to new.

11/13/09 23:14:56 changed by kris

  • status changed from new to assigned.

08/31/10 10:22:31 changed by larsivi

  • type changed from defect to enhancement.