Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact
Version 3 (modified by Nietsnie, 16 years ago)
--

Collection Improvement Proposal

This page is intended to discuss and keep track of ideas on how to improve the collections and iterators on the current Tango collection package. The collection package is based on Doug Lea's collection package, which was released into the public domain in 1999, so it is substantially out of date.

Goals

  • Allow iterators that can remove elements while iterating over the collection
  • Allow bi-directional iterators when appropriate
  • Allow saving iterators into the middle of collections (such as for later removal or traversal)
  • Allow slicing
  • Explore collection concepts from other languages and how we can benefit from using their ideas.

Modifiable iterators

At the very least, an iterator should be able to remove the last iterated element. I have modified the current tree to allow this (see #965), but with the current iterator design, no other modifications make sense. For example, iterators provide a remaining() method, but this could potentially be a O(n) operation if the underlying container was modified.

Bi-directional iterators

This should be possible with the current collections package for containers that support it.

Copying of iterators to use in later operations

The current model is focused on only having one iterator into a container at a time, but there can be reasons to have multiple iterators. One example is to allow later removal of elements after the first search has found them in O(1) time.

Allow slicing

The rest of D is based on slicing, why can't slicing be available for collections too?

Explore other language collection constructs

There are plenty of good collection implementions to look at:

  • C#
  • C++ STL
  • Java (which was based on Doug Lea's I think)
  • Others?

References

Ticket on allowing iterator removal: #965 Ticket on allowing null keys/values in Hash: #626

Discussion

Comments
Author Message

Posted: 03/05/08 17:31:55

Add the ability for O(1) removals from double-linked lists please.

Posted: 03/05/08 18:29:00

How about adding support for custom allocators?

Posted: 03/06/08 06:16:46

If collections currently suffers from anything, it's from a lack of quick comprehension to anyone new to it. I think this might be partially due to terminology, but I have read comments on the NG and in IRC to the effect that users don't understand how to use the package. Perhaps a documentation issue but I don't think so, the docs themselves seem fine.

Whom I worry about are new adopters of Tango, and I think that increasing accessibility for them would be a noble goal. I'm thinking here more of the practical programmers who possibly aren't interested in the CS behind the scenes and are just looking for a List to use quickly for their project, and don't want to educate themselves on all the varieties of Sequence.

Quite possibly flexibility and simplicity are two very conflicting goals and it won't be possible, but I think it would be worth pondering if we could think up a way to support those who 'Just need a List' or 'I just want a Hash' with minimal pre-education requirements. Perhaps some sort of Aliases? Or perhaps even renaming the current collection members (HashMulti? instead of Bag? List instead of Seq? Just ideas). I don't have any particular ideas to support how to achieve this goal, I just thought I'd bring it up and see what anyone else had to say. In the meantime, I am going to try to dredge the NG and see if I can't find some of the complaints about tango collections, when I get a chance.

Posted: 03/06/08 06:20:31

Also, AFAIK, the current Tango collections are based on Doug Lea's work, but the current Java collections are not. I had a quick email conversation with him about this, as I recall, he mentioned something about losing the battle with the current Java collections or at least some aspect of them. Perhaps it's worth investigating why Java changed away from this particular implementation?

Posted: 03/17/08 11:24:15

Re: learning curve (darrylb) IMHO Examples offer by far the fastest learning method, when someone "just wants a simple list". Just my 2-cents anyway.

What I wanted to say, is that previously I've found the "insert" methods from the C++ STL maps useful, in that they check for the existence of a KEY, and only insert the new element if no old one with the same KEY was found. This is only actually useful if it gives a performance boost via not having to traverse the tree twice (something I never actually checked about the STL method so it may have merely been assumption on my part). Of course since hash-maps are primarily used over tree maps in D it may not be very relevent anyway.

Posted: 03/18/08 21:08:10

I created a new version of the collections which re-implements all the collection classes from Tango, attached as collection.tgz

The features of this collection package are:

  • struct-based iterators (easy to copy, no heap penalty), which act like pointers.
  • foreach-based removal (called a Purger)
  • easy to swap out Hash or Tree implementation
  • some slicing (I plan to add more, to the Tree-based implementations)
  • Interfaces where interfaces make sense, none where they don't (i.e. for iterators)
  • iterators don't become invalid on element addition/removal if the implementation supports it (Hash and Tree allow this)
  • renamed Bag to Multiset (as I was confused by the term Bag, and multiset makes more sense to me, as a set with multiple instances of elements).

Please criticize/comment on both the conceptual changes above and the implementation (if you get a chance to read it). I documented almost everything with comments (may not be ddoc-compliant). I plan to have this new version replace Tango's current version, as long as everyone agrees that it's an improvement. I'm not adverse to rewriting/rethinking whole pieces of it, so anything is fair game.

-Steve

Posted: 03/22/08 13:23:08 -- Modified: 03/22/08 13:23:50 by
yigal

The most comprehensive and "correct" implementation of a collections package (IMO) is in the Eiffel language. Eiffel supports multiple inheritance so they have a bunch of abstract base classes that define various attributes of a collection and each collection class is a combination of several such classes. that's a very flexible design.tango can provide interfaces (and mixins) to represent those attributes.

also, regarding iterators, we should provide with more smalltalk-ish interface for collections so it would be possible to use: collection.each( delegate literal ) and similar functions. D provides all the building blocks for that unlike java/c++ and so we need not limit ourself to the c++ iterator approach. 1 last thing, collections should provide polymorphism unlike c++ STL, and the package should use interfaces to separate the concept of a "list" for example from various implementations of it (single or double linked list...)

Eiffel collection package: http://www.gobosoft.com/eiffel/gobo/structure/index.html

Posted: 03/31/08 20:23:00

I thought I'd post here what Kris had objections to (discussed on IRC):

Basically, he had four gripes with my current implementation:

  • iterators are not consistent. I think this means the usage of iterators is not intuitive. I think he also means that there are other 'Iterator' types in the library, like LineIterator for instance, that are not consistent with what I have as iterators.
  • No read-only view support.
  • No custom allocator ability
  • No mutation tracking (in fact there is some in ArrayList, but he may have missed it), either manual or optional.

I'll respond to each of these points in turn:

iterator consistency

This is true that my iterators are not consistent to Kris' idea of what an iterator should be. For example, when he sees

iterator find(value v)

He expects the iterator to iterate over all the elements which contain that value. In my implementation, it returns a 'pointer' to the first instance of that value. The iterator can then be used to iterate over all the elements of the container, not just the ones that contain value.

I think the general problem is that I used the term 'iterator'. Really, my implementation uses the term 'iterator' like C++'s iterators. I think they can easily be renamed to something like 'cursor', which implies it denotes a position. This is, in fact, all my iterators do. They can do 4 things, move forwards, move backwards, get and optionally set value, and be used as a pointer to an element for methods in the collection. Basically, an iterator is a collection's 'pointer'. 'cursor' may describe this behavior better, and it is not inconsistent with already existing Iterators in the language, such as LineIterator.

Now, if we replace 'iterator' with 'cursor', the term 'Iterator' is fair game to be used where I had used 'Enumerable'. So what I propose to fix this issue is to change 'iterator' to 'cursor', and to change 'Enumerable' to 'Iterable' or 'Iterator'. Then 'Iterator' or 'Iterable' is a generic interface that can even be applied to LineIterator (basically, it implements a foreach method, and an optional length method).

read-only view support

Kris and I differ on this requirement. I believe read-only view support is only marginally useful without const, since a read-only version of a container can easily be upcasted to it's read/write version, and therefore, cannot be enforced by the compiler (upcasting is legal). So my opinion is to wait until D 2.0 is good enough to port Tango, and then const-ify the appropriate methods on the interfaces.

I may be wrong, but I would like to see an example of how this can be reliably solved without compiler enforcement, and without heap activity (which we all agree is bad).

In any case, if required this can be solved with a decorator class.

custom allocators

Yes, this should be implemented. I'm not super-familiar with how they work so I'm not sure how to use them, but I'll take a look at the current collections and see how it's done.

no mutation tracking

I only implemented this as needed, because I see no benefit to mutation tracking when the underlying structure itself can be mutated without affecting cursors. For example, if I remove a link-list node, and I have a cursor somewhere else pointing to a different node, why should that cursor be invalidated?

Posted: 04/17/08 20:37:41 -- Modified: 04/23/08 22:15:51 by
schveiguy

As this doesn't seem to be getting any traction, I think I'll just post what I've done so far to scrapple, and let whoever wants to use it use it.

I've changed the 'iterator' to 'cursor', changed 'Enumerable' to 'Iterator', fixed some bugs, and added HashMultiset?. I plan to add slicing to the Tree classes, but I haven't got around to it yet. I also need to fill out the docs, and then I'll be adding it to scrapple.

I don't think I'm going to implement allocators, because a custom allocator implies a custom destructor, and so the code for using a custom allocator is going to be separate from code that doesn't use it. Therefore, one can implement custom allocators by re-implementing the underlying implementations (Hash and RBTree), which is already supported. One thing that probably should be changed is LinkList? should allow for custom Link nodes.

UPDATE: I've created a new project on dsource called dcollections. I'll post something to D.announce when I think it's ready for general consumption. I keep thinking of new ideas to add to it :)

-Steve

Attachments

  • Cache.d (5.0 kB) -Example of generic functionality that could be added to collection. (As a Cache, but could also be added simply as a MRU/LRU or BoundedList?), added by darrylb on 03/05/08 20:54:11.
  • collection.tgz (20.4 kB) -New implementation of Collections to consider (re-implements lots of collection, using more D-style than Java style), added by schveiguy on 03/18/08 20:58:35.