Ticket #1 (closed defect: fixed)

Opened 2 years ago

Last modified 2 years ago

Parser marks error on UTF-8 characters

Reported by: asterite Assigned to: somebody
Priority: minor Component: component1
Version: Keywords:
Cc:

Description

The following file:

# // holá

should given no error. However, an "invalid UTF-8 sequence" is marked in the "á" character. The lexer (descent.core.dom.Lexer) must be corrected to treat UTF-8 characters correctly. This is probably because java char type is already UTF while c++ char type isn't.

Attachments

decodeUTF.patch (2.7 kB) - added by keinfarbton on 01/04/07 09:32:16.
patch with the suggested changes

Change History

01/04/07 09:31:08 changed by keinfarbton

To add some info: C++ char type has not a specified encoding. It is only a signed byte.Java uses UTF-16 for the char type and Strings. I think the eclipse framework does the UTF-8 (or any other supported encoding) to UTF-16 convertion automatically. So the decoding can be done with the Character.codePointAt() method:

private int decodeUTF() {
		try {
			// decode one codepoint, starting at the index p
			int result = Character.codePointAt(input, p);
			// increase p with the count of chars for the decoded codepoint.
			p = Character.offsetByCodePoints(input, 0, input.length, p, 1);
			return result;
		} catch (Exception e) {
			// a problem while decoding the codepoint occured => invalid input
			error("invalid input sequence", IProblem.InvalidUtf8Sequence, linnum, p, 1);
			return 0;
		}
	}

For completeness: all chars in Java are not 8-Bit, they are 16 Bit. So the decision, if decodeUtf shall be called is wrong in most cases. Instead of "if((c & 0x80) != 0)" the condition "if (c >= 0x80)" is needed.

I will add a patch.

01/04/07 09:32:16 changed by keinfarbton

  • attachment decodeUTF.patch added.

patch with the suggested changes

01/04/07 20:09:07 changed by asterite

  • status changed from new to closed.
  • resolution set to fixed.