tango.io.UnicodeFile

License:

Version:

Initial release: December 2005

Author:

Kris

Read and write Unicode files

For our purposes, Unicode files are an encoding of textual material. The goal of this module is to interface that external-encoding with a programmer-defined internal-encoding. This internal encoding is declared via the template argument T, whilst the external encoding is either specified or derived.

Three internal encodings are supported: char, wchar, and dchar. The methods herein operate upon arrays of this type. For example, read() returns an array of the type, whilst write() and append() expect an array of said type.

Supported external encodings are as follows:

Encoding.Unknown
Encoding.UTF_8
Encoding.UTF_8N
Encoding.UTF_16
Encoding.UTF_16BE
Encoding.UTF_16LE
Encoding.UTF_32
Encoding.UTF_32BE
Encoding.UTF_32LE

These can be divided into implicit and explicit encodings. Here is the implicit subset:

Encoding.Unknown
Encoding.UTF_8
Encoding.UTF_16
Encoding.UTF_32

Implicit encodings may be used to 'discover' an unknown encoding, by examining the first few bytes of the file content for a signature. This signature is optional for all files, but is often written such that the content is self-describing. When the encoding is unknown, using one of the non-explicit encodings will cause the read() method to look for a signature and adjust itself accordingly. It is possible that a ZWNBSP character might be confused with the signature; today's files are supposed to use the WORD-JOINER character instead.

Explicit encodings are as follows:

Encoding.UTF_8N
Encoding.UTF_16BE
Encoding.UTF_16LE
Encoding.UTF_32BE
Encoding.UTF_32LE

This group of encodings are for use when the file encoding is known. These *must* be used when writing or appending, since written content must be in a known format. It should be noted that, during a read operation, the presence of a signature is in conflict with these explicit varieties.

Method read() returns the current content of the file, whilst write() sets the file content, and file length, to the provided array. Method append() adds content to the tail of the file. When appending, it is your responsibility to ensure the existing and current encodings are correctly matched.

Methods to inspect the file system, check the status of a file or directory, and other facilities are made available via the FilePath superclass.

See these links for more info:

this(char[] path, Encoding encoding) ¶#: Construct a UnicodeFile from the provided FilePath. The given encoding represents the external file encoding, and should be one of the Encoding.* types.
UnicodeFile opCall(char[] name, Encoding encoding) [static] ¶#: Call-site shortcut to create a UnicodeFile instance. This enables the same syntax as struct usage, so may expose a migration path.
char[] toString() ¶#: Return the associated file path.
Encoding encoding() ¶#: Return the current encoding. This is either the originally specified encoding, or a derived one obtained by inspecting the file content for a bom. The latter is performed as part of the read() method.
UnicodeBom!(T) bom() ¶#: Return the associated bom instance. Use this to find more information about the encoding status.
T[] read() [final] ¶#: Return the content of the file. The content is inspected for a bom signature, which is stripped. An exception is thrown if a signature is present when, according to the encoding type, it should not be. Conversely, An exception is thrown if there is no known signature where the current encoding expects one to be present.
void write(T[] content, bool writeBom) [final] ¶#: Set the file content and length to reflect the given array. The content will be encoded accordingly.
void append(T[] content) [final] ¶#: Append content to the file; the content will be encoded accordingly.
Note that it is your responsibility to ensure the existing and current encodings are correctly matched.