Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

Ticket #641 (closed enhancement: fixed)

Opened 10 months ago

Last modified 9 months ago

Unicode Upper and Lower Case Mapping

Reported by: ptriller Assigned to: kris
Priority: normal Milestone: 0.99.3
Component: Core Functionality Version: 0.99.1 RC4 Keep
Keywords: Unicode Casemapping Cc: larsivi, sean

Description

Unicode Casemapping is missing in Tango.

I wrote one.

Attachments

UnicodeData.d.gz (22.8 kB) - added by ptriller on 09/23/07 15:01:06.
UnicodeConverter.d (22.1 kB) - added by ptriller on 09/25/07 02:09:52.
reader.pl (8.5 kB) - added by ptriller on 09/26/07 14:01:27.
UnicodeConverter.2.d (26.9 kB) - added by ptriller on 09/26/07 14:01:52.
UnicodeData.d.bz2 (124.2 kB) - added by ptriller on 09/26/07 14:03:06.
UnicodeData.zip (207.7 kB) - added by ptriller on 09/29/07 18:01:05.
NewUnicodeStuff.zip (95.3 kB) - added by ptriller on 10/03/07 18:45:46.
All in one Package .. of the new Size reduced stuff

Change History

09/22/07 12:19:00 changed by ptriller

A pregenerated UnicodeData? can be downloaded from http://www.soapwars.de/peter/UnicodeData.d

It could be size minimized if necessary to only include the data actually used ATM.

09/22/07 19:30:10 changed by kris

  • owner changed from sean to kris.

two things that jump out immediately:

1) could the special-case be indicated by a flag in UnicodeData?? That would halve the number of array lookup for the vast majority of characters?

2) Would imagine the object file for all that data is quite considerable in size?

09/22/07 19:47:44 changed by ptriller

ad 1) Sure, thats possible, and probably a good idea

ad 2) Yep, its quite big, I will comment out all that is not needed for just case handling, this should reduce size considerably, since it removes all strings and just leaves 4 dchars and a bool per entry (the bool being the flag from 1)

09/22/07 20:55:33 changed by ptriller

UnicodeData?.d only contains absolutely necessary data, and not the complete Mapping.

SpecialCase? lookups are now only done when necessary.

09/23/07 00:20:30 changed by kris

  • status changed from new to assigned.

Will we need to retain the "ctype" flags, to support all the isSpace, isAlpha, isDigit, isBlahBlah? Could those be represented as bit-flags?

The other thing I'm wondering is how to best deal with utf8 & utf16 strings, like you were a few days back. Assuming the user provided a working buffer, it may well be a whole lot faster to convert everything to dchar[] right away, and then convert back? Would be interesting to see the difference in performance :)

- Kris

09/23/07 05:14:46 changed by ptriller

Well, the isXXX things are mapped in Unicode through the GeneralCategory? ( I left the enum in the UnicodeData? file) I could generate a bitfield to make those possible. I would probably then also put the specialMapping in this bitfield. I have to read up a little in the Unicode doc, so see what the classes actually mean, because I am for example not really sure if "Letter, Modifier" or "Letter, Other" should actually be considered a letter. The same problem with digits. what is a "Digit, Letter" or a "Digit, Other" I have to read up on unicode.org before I can say if it is mapped. Also the "isUpper" and "isLower" are a little hard to tackle. When is a letter uppercase ? if the lowercase mapping differs from the checked character, or if the uppercase mapping is identical to the checked character ?

As for the Performance tests, I will make those. I asked in the #d channel what the forech(dchar x;...) does, and it _sounded_ speedy :)

09/23/07 05:23:57 changed by ptriller

Forget my concerns about finding out what means what, I can just copy the behaviour from ICU4j.

09/23/07 05:32:48 changed by ptriller

About the Buffer: If I only want to do en block conversions and not char by char, I would need 3 Buffers. One to convert the input string to a dchar[] one to store the dchar[] output and one to recode it back to the desired encoding.

Doable sounds the way to do the decoding dchar by dchar, and the output encoding in some working buffers.

If you really want to have one sweep recondigns only (and I am missing nothing) the best way to do this would probably really be to Make a converter Class with internal buffers, which are reused.

09/23/07 14:23:32 changed by kris

sounds like this is gonna be a really great package :)

09/23/07 14:55:21 changed by ptriller

Did a lot of performance testing.

the foreach(dhcar;char[]...) is faster then bulk decoding it beforehand. I am still not sure what it does, but it does a good job. Although the differences are pretty minor.

Seems the need to run over the Array twice compensates for the overhead I produce by recoding it char by char. Cuiriously it is even a little bit faster doing it char by char. (1.88 sec compared to 2.12 sec).

I can only think of one way to optimize this further, and that is to rip out the code from toUtf8 copy it into the toUpper/toLower methods and incorporate it in the toUpper proces, so I save the overhead I have right now when I call the toUtf8 method. Not sure if I want to code this, since this will get pretty messy if you want to do it optimized.

I switched to use associative arrays for the lookup maps. this speeded things up by a factor of 6, but the setup code is a little messy now. I suppose an associative array also uses up more Memory at runtime, (top show that my test program uses 10 Mb without the data and 11 Mb with the data) but the speedup is really very noticable, so I guess its worth it.

09/23/07 15:01:06 changed by ptriller

  • attachment UnicodeData.d.gz added.

09/23/07 15:02:13 changed by ptriller

The currently attached UnicodeConverter?.d still holds the "blockToUpper" I used to test the speed for block recodings.

09/23/07 16:46:37 changed by ptriller

Rewrote everything to use associative arrays, and added some functions like "isUpper" or "isPrintable".

The UnicodeData?.d file grew significantly in size for the added info needed to implement those.

09/23/07 16:50:43 changed by ptriller

Ok, I am more or less done with this, I will probably add toFold for the folding case sometime next week, and I need to write unittests for the isXXXX function.

09/23/07 18:14:42 changed by kris

Looking good!

Would it be more effective to leave all the data as it was before, and inject a small loop to load the AA at module-start time? For example:

static this()
{
   foreach (inout entry; list)
            aa[entry.code] = &entry;
}

Or similar?

09/25/07 02:09:52 changed by ptriller

  • attachment UnicodeConverter.d added.

09/25/07 02:13:27 changed by ptriller

Yea, much better. Fixed it.

For a more or less complete Unicode support, the Folding case is still missing, and decomposition, composition and normalisation ...although the composition and normalisation will need some reading up of the specs on my part, so I am not sure when I getting around doing it.

09/26/07 14:01:27 changed by ptriller

  • attachment reader.pl added.

09/26/07 14:01:52 changed by ptriller

  • attachment UnicodeConverter.2.d added.

09/26/07 14:03:06 changed by ptriller

  • attachment UnicodeData.d.bz2 added.

09/26/07 14:05:39 changed by ptriller

Folding Case now Implemented.

Due to a missclickage of mine the current converter file is now called: UnicodeConverter?.2.d

Thats all for now. normalisation composition and decomposition might take a week or more.

The only thing I could do now is make the UnicodeData?.d smaller in source (not in compiled form tough). It would make it less human readable, but I am not sure if making the source file smaller has any benefits...

09/26/07 14:09:55 changed by ptriller

Well Title case might be something to be desired, but since this is something that only affects the first letter of a word, and I am not really sure how to go about splitting a string into words (maybe the first letter after non letters ?) I havent implemented it yet.

09/26/07 14:18:07 changed by ptriller

Found the definition for title case. Implementing it.

09/26/07 15:48:03 changed by kris

I'll take a closer look at this over the weekend, and find a home for it after chatting with Larsivi and Sean

09/29/07 15:54:42 changed by kris

  • cc set to larsivi, sean.

09/29/07 17:49:12 changed by kris

Can you attach UnicodeData?.d as a zip file, please? Or, something else that Windows can understand :)

09/29/07 18:01:05 changed by ptriller

  • attachment UnicodeData.zip added.

09/29/07 18:06:45 changed by ptriller

This should be readable

10/02/07 15:00:41 changed by kris

  • milestone changed from 0.99.2 RC5 to 0.99.3 RC6.

we'll add this in next week, and give it good thrashing :)

Nice work, by the way ... ICU is a bit heavyweight for many people and this provides much of the needed functionality. Any joy on trying to reduce the obj size?

10/03/07 18:45:46 changed by ptriller

  • attachment NewUnicodeStuff.zip added.

All in one Package .. of the new Size reduced stuff

10/03/07 18:48:18 changed by ptriller

This is the size reduced implementation. It only has one drawback. In some Cases the toTitle functions might slow down a bit (those cases being, trying to put letters in title case which have no lower/upper/title case .. mostly eastern languages. In those cases I have to do an additional lookup in an associative array)

11/02/07 17:39:04 changed by kris

  • status changed from assigned to closed.
  • resolution set to fixed.

Larsivi has some more ideas on this, so let's review over IRC when convenient?