Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

opCmp in Fiber crash?

Moderators: kris

Posted: 04/25/07 19:06:20

I was wondering if you had any insight into this crasher bug. I opened a ticket about it a while back (#391) but I included code that wasn't as simple as possible. I suspect that when you saw the huge test case you decided to look at it later. I was able to greatly simplify the code. At this point, removing any features will make the bug not happen.

I am trying to implement a reactor pattern and have a heap full of objects that encapsulate a function/delegate and the time when they are supposed to be fired. The objects have opCmp implemented so they will sort in temporal order. This is part of a larger framework I implemented for writing asynchronous event driven servers. I found Tango to be excellent for this due to the included Fiber and Selector APIs. I was stunned at how simple it was to implement and how beautiful the code looked (thanks to D's excellent function templating facilities.) However, my app crashed randomly under load and it took me a while to figure out that this bug was causing at least some of the random crashes. This may or may not be the only problem I am experiencing but I can't really be certain or proceed at all really until this gets fixed. To be honest, I'm pretty frustrated at this point and am contemplating switching to Phobos and using StackThreads?, writing my own interface to epoll, etc... Of course, I don't really want to do that.

I know you guys are volunteers and I greatly appreciate the hard work you put into Tango. However, I know from watching the checkins that you guys are usually extremely quick to fix any real problems in Tango and I am surprised that this bug has languished. So I thought I would respectfully point out that the test case has been greatly simplified and hope that you could take a look at it or suggest a workaround.

aTdHvAaNnKcSe

import tango.core.Thread;

interface IDelayedCall {
  int opCmp(IDelayedCall o);
}

class DelayedCall(Callable) : IDelayedCall {
  Callable f;

  this(Callable f){
    this.f = f;
  }

  int opCmp(IDelayedCall o) {
    return 0;
  }
}

void main() {
  auto f = delegate void() {};
  auto fun = delegate void() {
    auto m = new DelayedCall!(void delegate())(f);
    auto n = new DelayedCall!(void delegate())(f);
    auto a = m > n;
  };
  Fiber x;

  while(1) {
    x = new Fiber(fun);
    x.call();
  }
}
Author Message

Posted: 04/25/07 20:08:40

Sorry about that. I dread tracking down bugs in the fiber code, and this didn't seem to be an issue I could even reproduce easily. I'll give it a go with the latest test case and see if I can figure it out.

Posted: 04/25/07 20:31:51 -- Modified: 04/25/07 20:34:16 by
drox -- Modified 2 Times

Thanks! I'll admit it is really scary, which in my case is compounded since I know nothing about the rest of the Tango runtime either. BTW, I went back to my previous thread about Fiber issues and see there is another very similar test case using opCmp but that class isn't templated. I didn't force the issue at the time since I didn't have GDC working and didn't know if it was a compiler issue.

Since this crashes without a templated class, but removing the template from the above test case causes it to not fail, it sounds like some sort of alignment issue (memory or the planets maybe?) Like the wrong offset into memory is being used (it is a segfault) but it won't be triggered unless the object is a certain size? The other thing that is weird is it doesn't happen on the first shot, which is why both test cases have an infinite loop.

I suspect I'm one of the few, if not the only one, using Fiber... I guess I just come from a different background which makes my favorite design pattern monothreading. ;-)

import tango.core.Thread;

class boo {
  double x;
  this(double z) {
    x = z;
  }
  int opCmp(boo o){
    if (x < o.x)
      return -1;
    else if (x > o.x)
      return 1;
    return 0;
  }
}

void main() {
  while(1) {
    auto fun = delegate void() {
      auto x = new boo(1.0);
      auto y = new boo(2.0);

      auto n = x > y;
    };
    auto x = new Fiber(fun);
    x.call();
  }
}

Posted: 05/04/07 16:07:27

I haven't been able to reproduce this with DMD on Win32 or Linux. I suppose you're using GDC?

Posted: 05/04/07 20:24:38

hrm, the above example isn't crashing for me either, this one does crash for me with both dmd-1.014 and gdc-0.23 on linux Ubuntu x86.

import tango.core.Thread;

interface IDelayedCall {
  int opCmp(IDelayedCall o);
}

class DelayedCall(Callable) : IDelayedCall {
  Callable f;

  this(Callable f){
    this.f = f;
  }

  int opCmp(IDelayedCall o) {
    return 0;
  }
}

void main() {
  auto f = delegate void() {};
  auto fun = delegate void() {
    auto m = new DelayedCall!(void delegate())(f);
    auto n = new DelayedCall!(void delegate())(f);
    auto a = m > n;
  };
  Fiber x;

  while(1) {
    x = new Fiber(fun);
    x.call();
  }
}

Posted: 05/21/07 15:23:50

Weird. On Win32, that app exits almost immediately with no error. Which it shouldn't. I'll look into it.

Posted: 05/21/07 20:01:25

For what it's worth, I've got the repro trimmed down to this:

import tango.core.Thread;

void fun()
{
    auto m = new Object;
    auto n = new Object;
    auto a = m > n;
}

void main()
{
    Fiber x = new Fiber(&fun);

    while(1)
    {
        x.call();
        x.reset();
    }
}

Interestingly, if I replace the objects with int pointers:

void fun()
{
    auto m = new int;
    auto n = new int;
    auto a = *m > *n;
}

The app works, so it's almost definitely with the Fiber code rather than the GC code. I don't understand how this error can occur intermittently though, since the same Fiber is being re-used across all calls. I'll keep experimenting, but this one may take me a while to figure out.

Posted: 05/22/07 04:23:59

After I logged off I realized that the only way for this sample to exhibit unpredictable behavior (which it does) is if the problem is somehow related to garbage collection. So I played with things a bit and figured out the problem. Basically, if a collection occurs when Fiber is running but before that Fiber has called Fiber.yield() at least once, then any data referenced by only that Fiber will be cleaned up. I can't believe no one has seen this before now. You must be the only person using Fibers regularly :-) In any case, this is the result of a minor oversight on my part, and it's a one line fix. I'll check in the change tomorrow.