Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

Tokenizer / buffer.next method

Moderators: larsivi kris

Posted: 08/16/07 15:53:15

Hi guys,

currently I'm writing a tokenizer/lexer for a programming language. Therefore I use the next() method of my FileConduit?-Buffer, which calls a user defined parse function, which in turn parses the input, calls for more via returning IConduit.Eof or returns the number of chars actually used for that token.

First question is, whether this is the intended way or is there a better option in Tango?!

If this is supposed to work that way, I have a problem: In a tokenizer you often check, if the current set of read characters is your token by reading the next one. But what to do if the next one is not in the buffer? You call for more, returning IConduit.Eof and in the next run of your function you hopefully get more data to decide. But what if there is end of file? The buffer will skip (readable) and be happy, although this might have been a valid token.

Example:

Buffer has space for 1024 bytes, currently there is only "+" inside and there is no further data in the attached conduit.

My parse function which I pass to next() does the following, which normally works like a charm and, by the way, is very elegant thanks to this architecture!

if (data[0] == '+')
{
  // now we have to decide with the next char
  if (data.length <= 1)
  {
    // call for more data (*)
    return IConduit.Eof;
  }

  if (data[1] == '=')
  {
    // token is +=, return 2
  }
  else
  {
    // token is +, return 1
  }
}

Is there a clean way to get information, whether the buffer reached Eof? If this was possible, I would be really happy, as my tokenizer-code looks beatiful and I don't like to change the archtitecture too much ;)

PS: In "normal" lexers, this problem does not exists, because you always have your '\0' character which is then different from '=' and thus validates the '+' token.

Author Message

Posted: 08/16/07 17:20:33

Hello.

I've tried the thing you described (it was a parser/lexer, too). Through trial and error, I found out that this is a bad way to do things. Parsing like this would take ages to parse files that don't fit into small buffer causing reading from disk (in my case, it slowed down a lot)... You'd better use simple wrapper ontop of char[] array (or whatever) -- or make your buffer big enough. Use UnicodeFile? to get some text in preferred encoding -- and do the work on it. I used throw/catch to get out of the function using Buffer.next(). Unfortunately, I've lost the code, so I'm telling you what I remember.

Posted: 08/17/07 16:38:30

In general I agree with dima-san ... parsing tokens from a stream is bound to be slower than using an array directly. If performance were a priority, I'd read all the content into an array and go with that instead.

However, sometimes streaming is more appropriate. For example, parsing content from a socket can often be completed by the time all the data has arrived, rather than waiting for it all to arrive first. Or, if the file is too big to fit in memory, or there are too many big files to load, or whatever. That's why, for example, stream-iterators exist in addition to the array-based token parsing support in tango.text.Util

As for Eof testing, this would probably have to be maintained by the app ... perhaps testing for an Eof state set by the (custom) tokenizer itself? E.g. you could make the tokenizer be part of a class or struct, and maintain some state therein?