Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

root/trunk/tango/io/UnicodeFile.d

Revision 3856, 9.8 kB (checked in by kris, 4 months ago)

moved Conduit and friends into tango.io.device in an effort to bring further clarity into tango.io -- this requires adjusting imports such that, for example, tango.io.FileConduit? becomes tango.io.device.FileConduit?

  • Property svn:mime-type set to text/x-dsrc
  • Property svn:eol-style set to native
Line 
1 /*******************************************************************************
2
3         copyright:      Copyright (c) 2005 Kris Bell. All rights reserved
4
5         license:        BSD style: $(LICENSE)
6
7         version:        Initial release: December 2005     
8         
9         author:         Kris
10
11 *******************************************************************************/
12
13 module tango.io.UnicodeFile;
14
15 private import  tango.io.FilePath;
16
17 private import  tango.io.device.FileConduit;
18
19 private import  tango.core.Exception;
20
21 public  import  tango.text.convert.UnicodeBom;
22
23 /*******************************************************************************
24
25         Read and write unicode files
26
27         For our purposes, unicode files are an encoding of textual material.
28         The goal of this module is to interface that external-encoding with
29         a programmer-defined internal-encoding. This internal encoding is
30         declared via the template argument T, whilst the external encoding
31         is either specified or derived.
32
33         Three internal encodings are supported: char, wchar, and dchar. The
34         methods herein operate upon arrays of this type. For example, read()
35         returns an array of the type, whilst write() and append() expect an
36         array of said type.
37
38         Supported external encodings are as follow:
39
40                 $(UL Encoding.Unknown)
41                 $(UL Encoding.UTF_8)
42                 $(UL Encoding.UTF_8N)
43                 $(UL Encoding.UTF_16)
44                 $(UL Encoding.UTF_16BE)
45                 $(UL Encoding.UTF_16LE)
46                 $(UL Encoding.UTF_32)
47                 $(UL Encoding.UTF_32BE)
48                 $(UL Encoding.UTF_32LE)
49
50         These can be divided into implicit and explicit encodings. Here are
51         the implicit subset:
52
53                 $(UL Encoding.Unknown)
54                 $(UL Encoding.UTF_8)
55                 $(UL Encoding.UTF_16)
56                 $(UL Encoding.UTF_32)
57
58         Implicit encodings may be used to 'discover'
59         an unknown encoding, by examining the first few bytes of the file
60         content for a signature. This signature is optional for all files,
61         but is often written such that the content is self-describing. When
62         the encoding is unknown, using one of the non-explicit encodings will
63         cause the read() method to look for a signature and adjust itself
64         accordingly. It is possible that a ZWNBSP character might be confused
65         with the signature; today's files are supposed to use the WORD-JOINER
66         character instead.
67
68         Explicit encodings are as follows:
69       
70                 $(UL Encoding.UTF_8N)
71                 $(UL Encoding.UTF_16BE)
72                 $(UL Encoding.UTF_16LE)
73                 $(UL Encoding.UTF_32BE)
74                 $(UL Encoding.UTF_32LE)
75         
76         This group of encodings are for use when the file encoding is
77         known. These *must* be used when writing or appending, since written
78         content must be in a known format. It should be noted that, during a
79         read operation, the presence of a signature is in conflict with these
80         explicit varieties.
81
82         Method read() returns the current content of the file, whilst write()
83         sets the file content, and file length, to the provided array. Method
84         append() adds content to the tail of the file. When appending, it is
85         your responsibility to ensure the existing and current encodings are
86         correctly matched.
87
88         Methods to inspect the file system, check the status of a file or
89         directory, and other facilities are made available via the FilePath
90         superclass.
91
92         See these links for more info:
93         $(UL $(LINK http://www.utf-8.com/))
94         $(UL $(LINK http://www.hackcraft.net/xmlUnicode/))
95         $(UL $(LINK http://www.unicode.org/faq/utf_bom.html/))
96         $(UL $(LINK http://www.azillionmonkeys.com/qed/unicode.html/))
97         $(UL $(LINK http://icu.sourceforge.net/docs/papers/forms_of_unicode/))
98
99 *******************************************************************************/
100
101 class UnicodeFile(T)
102 {
103         private UnicodeBom!(T)  bom;
104         private char[]          path_;
105
106         /***********************************************************************
107         
108                 Construct a UnicodeFile from the provided FilePath. The given
109                 encoding represents the external file encoding, and should
110                 be one of the Encoding.xx types
111
112         ***********************************************************************/
113                                  
114         this (char[] path, Encoding encoding)
115         {
116                 bom = new UnicodeBom!(T)(encoding);
117                 path_ = path;
118         }
119
120         /***********************************************************************
121         
122                 Construct a UnicodeFile from a text string. The provided
123                 encoding represents the external file encoding, and should
124                 be one of the Encoding.xx types
125
126         ***********************************************************************/
127
128         this (FilePath path, Encoding encoding)
129         {
130                 this (path.toString, encoding);
131         }
132
133         /***********************************************************************
134
135                 Call-site shortcut to create a UnicodeFile instance. This
136                 enables the same syntax as struct usage, so may expose
137                 a migration path
138
139         ***********************************************************************/
140
141         static UnicodeFile opCall (char[] name, Encoding encoding)
142         {
143                 return new UnicodeFile (name, encoding);
144         }
145
146         /***********************************************************************
147
148                 Return the associated FilePath instance
149
150         ***********************************************************************/
151
152         deprecated PathView path ()
153         {
154                 return new FilePath (path_);
155         }
156        
157         /***********************************************************************
158
159                 Return the associated file path
160
161         ***********************************************************************/
162
163         char[] toString ()
164         {
165                 return path_;
166         }
167        
168         /***********************************************************************
169
170                 Return the current encoding. This is either the originally
171                 specified encoding, or a derived one obtained by inspecting
172                 the file content for a BOM. The latter is performed as part
173                 of the read() method.
174
175         ***********************************************************************/
176
177         Encoding encoding ()
178         {
179                 return bom.encoding();
180         }
181        
182         /***********************************************************************
183
184                 Return the content of the file. The content is inspected
185                 for a BOM signature, which is stripped. An exception is
186                 thrown if a signature is present when, according to the
187                 encoding type, it should not be. Conversely, An exception
188                 is thrown if there is no known signature where the current
189                 encoding expects one to be present.
190
191         ***********************************************************************/
192
193         T[] read ()
194         {
195                 scope conduit = new FileConduit (path_); 
196                 scope (exit)
197                        conduit.close;
198
199                 // allocate enough space for the entire file
200                 auto content = new ubyte [cast(uint) conduit.length];
201
202                 //read the content
203                 if (conduit.read (content) != content.length)
204                     conduit.error ("unexpected eof");
205
206                 return bom.decode (content);
207         }
208
209         /***********************************************************************
210
211                 Set the file content and length to reflect the given array.
212                 The content will be encoded accordingly.
213
214         ***********************************************************************/
215
216         UnicodeFile write (T[] content, bool writeBom = false)
217         {
218                 return write (content, FileConduit.ReadWriteCreate, writeBom); 
219         }
220
221         /***********************************************************************
222
223                 Append content to the file; the content will be encoded
224                 accordingly.
225
226                 Note that it is your responsibility to ensure the
227                 existing and current encodings are correctly matched.
228
229         ***********************************************************************/
230
231         UnicodeFile append (T[] content)
232         {
233                 return write (content, FileConduit.WriteAppending, false); 
234         }
235
236         /***********************************************************************
237
238                 Internal method to perform writing of content. Note that
239                 the encoding must be of the explicit variety by the time
240                 we get here.
241
242         ***********************************************************************/
243
244         private final UnicodeFile write (T[] content, FileConduit.Style style, bool writeBom)
245         {       
246                 // convert to external representation (may throw an exeption)
247                 void[] converted = bom.encode (content);
248
249                 // open file after conversion ~ in case of exceptions
250                 scope conduit = new FileConduit (path_, style); 
251                 scope (exit)
252                        conduit.close;
253
254                 if (writeBom)
255                     conduit.write (bom.signature);
256
257                 // and write
258                 conduit.write (converted);
259                 return this;
260         }
261 }
Note: See TracBrowser for help on using the browser.