View previous topic :: View next topic |
Author |
Message |
brad Site Admin
Joined: 22 Feb 2004 Posts: 490 Location: Atlanta, GA USA
|
Posted: Thu May 06, 2004 3:58 pm Post subject: File Parsing Article |
|
|
Looking back at this, it's fairly embarassing and newbie-ish. Maybe the better way to look at it is that I've learned a great deal about programming, streams, etc. since I've written it.
Introduction
It started like most friendly rivalries around the office. A development task needed to be completed, and we had no clear-cut language to be used to complete it. For previous tasks we used Visual Basic, VB Script, and Java. This time, however, the volume of data to be crunched far exceeded anything we had tackled in the past. We needed speed. Code that executed as fast as possible, and processed many files in a small timeframe. We are a data warehousing firm for restaurant chains. The amount of data we process on a daily basis is fairly staggering, considering we only have a few hours to load data from thousands of restaurants.
The rivalry started as we began work with a new client who was sending us unstructured data in flat files. My co-worker Jeff was itching to use Perl for the task, continue his learning curve for his newly found toolset, and help the company. I had been hanging around the D Newsgroup for a few months, and decided this was the perfect time to try the fledgling language out on a real business task. It also gave me an opportunity to take a look at the standard library (Phobos) and how far it has come.
We both realized that the task was quite mild - parsing text files is child's play - but we jumped at the opportunity to cram millisecond timings into each others' faces. In fact, the Official D Website says clearly what D is not for: "Very small programs - a scripting or interpreted language like Python, DMDScript, or Perl is likely more suitable." However, as newbies to two different languages, we wanted to upgrade our programming skills, and so the race was on.
Many people have written fantastically efficient code for parsing text files, notably the expat parser for XML, and I'm sure there are many more. However, I started from scratch, newbie status and all, diving into D headfirst.
Here is an example of the text file:
Code: |
Store Number: 0123
Store Name: Store #123 Main St.
Date: 01/01/2004 Time: 12:34:56
**PLU** Z1
-----------------------------------------------------------------
PLU DESCRIPT PROMO SOLD WAST TOTAL SALE?
-----------------------------------------------------------------
10 Hamburger 0 28 0 22.72 41.2
15 Cheeseburger 0 17 0 25.33 25.0
1020 Fries 0 8 0 10.32 11.8
1025 Soda 0 15 0 22.35 22.1
...
-----------------------------------------------------------------
0 68 0 155.26
|
Note: DMD compiler 0.77 used on Windows XP, compiled with:
Code: | ? dmd parsefiles.d -version=Win32 -L parsefiles.exe |
The goal was a to generate a fixed-width file, but this file may become comma separated with text qualifiers in the future. Also, we agreed that exception handling and logging would be omitted for the competition.
A Directory of Files
We would be receiving thousands of files each night. The directory containing the files must be traversed and the files transformed into one big text file for bulk-loading into our database servers. In order to compete against Perl, I was going to need file counts and timings. I searched through the std.date library module and found the getUTCtime() function:
Code: |
import std.date;
int main(char[][] args) {
// declarations
d_time lStartTime, lLastTime, lProcTime; // std.date.d_time is an alias for long type
int iFiles=0;
// get start time
lStartTime = getUTCtime();
...
lProcTime = getUTCtime() - dStartTime;
printf("?d files in ?d ms\n", iFiles, msFromTime(lProcTime)); // msFromTime() gets milliseconds
}
|
I would need a place to put the results that would be ready to bulk-load into the database. I knew that funcions existed in std.file that would read, write, and append files, wrapping the different ways to do this between Win32 and linux in the version() statement. However, I had read about increased performance by using the std.stream module for file reads/writes. Upon looking in the std.stream module code, I also found the MemoryStream class, and decided to fill a buffer in memory and then write it all out to disk later using the MemoryStream class.
Code: |
import std.stream;
MemoryStream mOut;
char[] sOutFilename;
...
// ------------------------------------------------
// set up memory stream for output
// ------------------------------------------------
mOut = new MemoryStream();
// ------------------------------------------------
// set up file stream for output
// ------------------------------------------------
File fOut = new File();
fOut.create(sOutFilename);
...
|
My next task was to somehow loop through all of the files in the working directory. (The working directory is supplied as a command-line argument) After some poking around, I found some interesting functions in std.file and std.path.
Code: |
import std.file;
import std.path;
...
char[] sWorkPath;
...
// get the collection of files (fc)
// inside the current directory
char[][] fc = listdir(sWorkPath);
// loop through the file collection, processing
// each file if it is the correct type.
foreach(char[] f; fc) {
// determine type and run appropriate processing
if( getExt(f) == "100" ) {
// call 'process_file' function
if(process_file(f) == 0) {
// Succeeded
iFiles++;
} else {
// Failed
}
}
}
|
You can see that I used the listdir() function. This function returns an array of character arrays (char[][] is an array of strings). Each string holds a file name, and the '.' and '..' directories are left out. This saves a lot of nasty calls to Win32 API's. Unfortunately, at this time before D 1.0, it is not implemented in Linux. Hmmm... next article.
Next, notice the use of foreach. Looping across an aggregate is one of the nicer recent additions to D. I decided to use this here, instead of a for statement.
Finally, there is a call to process_file() to actually parse the file and fill the Memory Stream. This function is where we will define business rules and gather useful information from different places in the flat file. The function returns 0 if successful, and we increment the number of files. Or it returns 1 if it fails.
Parse the File
The filename f was sent to process_file() and so we need to open the file and take a look. I cast the std.file.read() function into a character array, putting the entire contents of the file into the fInput string. It is fully qualified to delineate it from std.stream.read()
Code: |
import std.string;
...
// declarations
char[] fInput;
char[][] inlines;
// open the input file
fInput = cast(char[])std.file.read(f);
// split into lines based on CR
inlines=splitlines(fInput);
|
The noteworthy part here is the use of the splitlines() function found in the std.string module. This makes an array of character arrays like listdir() did earlier. However, this takes the string fInput and uses character returns to break up the lines into the inlines array of strings.
Let's take a look at the data capture and output to the memory stream. We use a foreach again, looping through all of the strings in the inlines array.
Code: |
// loop through the lines
int gotStore=0;
int gotDate=0;
char[] strStore;
char[] strDate;
foreach (char[] strLine; inlines) {
if(gotStore==0) {
// look for 'Store Number:'
if((find(strLine,"Store Number:")) >= 0) {
gotStore=1;
strStore=strLine[19..23];
}
} else {
if(gotDate==0) {
// look for 'Date:'
if((find(strLine[],"Date:")) >= 0) {
gotDate=1;
strDate=strLine[11..21] ~ " " ~ strLine[28..36];
}
} else {
// test for valid output line (i.e. has PLU at beginning)
try {
// test first few characters to see if they're int
if( atoi(strLine[0..10]) > 0 ) {
// is valid line, so output it
mOut.writeLine(strStore ~ " " ~ strDate ~ " " ~ strLine[6..strLine.length]);
}
} catch(Error e) {
// line doesn't have PLU at beginning, so do not output.
}
}
}
}
|
As you can see, we are trying to gather the Store Number and Date/Time, so we can put this header information on each row. Once the flags are set, we don't attempt to use std.string.find() to find Store Number or Date -- we already have them. This also serves as a good example of D's string handling. Check out the slicing of strLine (the current line of the file). Characters 19 through 23 - [19..23] - will hold the store number if "Store Number:" is found in the find() function call. In the Try/Catch area, we use std.string.atoi() to turn the first 11 characters - [0..10] - from string into integer. If we get an exception, it is caught and nothing is done. If it works, we have a valid line with a menu item number (PLU) at the beginning of the line. We want to ignore the column headers and ---- separators, and none of those lines start with an integer. Finally, we write the line to the memory stream with std.stream.writeLine()
The Expensive Part
Disk writes are infamously expensive as I/O operations go, especially locked into a heated competition with Perl. Our decision to use the std.stream module was a good one. It takes only a few lines to write the memory stream to the file stream and clean up:
Code: |
// write output buffer to file
fOut.copyFrom(mOut);
// close Output file
fOut.close();
|
The stream operations performed much better than std.file.write(). And we end up with the eminently more bulk-loadable text file:
Code: |
0123 01/01/2004 23:32:18 10 Hamburger 0 28 0 22.72 41.2
0123 01/01/2004 23:32:18 15 Cheeseburger 0 17 0 25.33 25.0
0123 01/01/2004 23:32:18 1020 Fries 0 8 0 10.32 11.8
0123 01/01/2004 23:32:18 1025 Soda 0 15 0 22.35 22.1
|
And Finally
How did we do? Was Perl faster? Was its code more compact and readable? Do you think I would have written the article if I'd lost? Bah! In tests of about 100 files and 500 files, the D executable proved to be about twice as fast (100 files: 47ms vs 91ms). It is a matter of preference as to the code readability, but the two programs had about the number of lines.
Did we optimize everything possible in this small program? Assuredly, we did not. Can streams help us read the files more efficiently, as well as write them? Do we need the memory stream? I have a lot to learn about programming, let alone all the capabilities of D. However, what the Official D Site says is true: "It's a practical language for practical programmers who need to get the job done quickly and reliably." In the future, I may even attempt to repeat this exercise in an object-oriented style. As a newbie, I will continue to learn and use D quite a bit in the future, for larger and more demanding projects.
Brad Anderson - Jan 2004 |
|
Back to top |
|
|
Carlos
Joined: 19 Mar 2004 Posts: 396 Location: Canyon, TX
|
Posted: Thu May 06, 2004 6:26 pm Post subject: |
|
|
Very good, methinks!
A small note: are you sure you should use foreach to loop through the file content? I'm not sure about this, but I think I read once that how foreach iterated was not defined. So, maybe sometime it won't iterate linearly but in some other way and your code will fail. Can someone confirm/deny this?
Congrats, again. |
|
Back to top |
|
|
JJR
Joined: 22 Feb 2004 Posts: 1104
|
Posted: Fri May 07, 2004 12:43 am Post subject: |
|
|
An entertaining and well-written piece.
Thanks, Brad. |
|
Back to top |
|
|
jcc7
Joined: 22 Feb 2004 Posts: 657 Location: Muskogee, OK, USA
|
Posted: Fri May 07, 2004 1:07 am Post subject: Re: File Parsing Article |
|
|
D: 1
Perl: 0
Good article. |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|