Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Best way to get the number of records in a file with millions records

Status
Not open for further replies.

tbtcust

Programmer
Oct 26, 2004
214
US
Hi all,

I have searched the threads looking for a way to get the total number of records in a file in one shot.

It looks like Split function into an array and using UBound is the answer.

However, I have a file that contains millons of records and that may be a problem.

Question: Is there another way of geting the total number of records in a file in one shot? Note that each line is delimited by vbLF.

Thanks in advance for any help.
 
you can use my method within a VBA moduie.

Note that the second part of my method (converting from a LF to a CRLF file format) -- may -- not be required if you are using the Access Import method.





Regards

Frederico Fonseca
SysSoft Integrated Ltd
 
Receiving Results ....

From VB IDE
Dilettante 66150, 00:00:15
Trollacious 66150, 00:00:04

Strongm something is wrong
?err.Number
-2147467259

?err.Description
Selected collating sequence not supported by the operating system.

?err.Source
Microsoft JET Database Engine

From compiled
Dilettante 66150, 00:00:14
Trollacious 66150, 00:00:04

File imported using File Get External Data (no code tricks) shows 66150 records time took less than 2 min
 
>your approach end up third

I know. As I said before, I wasn't expecting it to be faster than specialised routines - just simpler ...
 
strongm

About the simplicity I have to disagree, trollacious way is based on knwoledge back from (when?)

But the difference still exists (for the mdb module)!
Any thoughts about the error on VB? In notepad I can read greek characters perfectly. Is there something I can do to give it a try? On daily basis, I append this txt to an mdb using schema.ini where CharacterSet = ANSI

Frederico the url for WC is blocked for me from here... (and I am bit lost, if you could simplify that I could run the test on the same machine and file -- just to test the time)

---------------------
I did that test out of curiosity, with the feeling that what ended up to be the fastest, was the worst!
 
The code I provided above is based on experience in 2004 with a program I wrote to read 120K+ records in a sequential file (not a file of 120K+ bytes) into an array. I experimented with ways to read a file in and create the array, and with a file this size, counting the number of records with the code I provided above, ReDimming the array, then reading the file back into the array a line at a time was far faster than the other algorithms I tried. With compiled code on an 833mHz Compaq (with no more than 256M memory), plus sorting and filtering duplicates of up to 1000 added records per run, the entire load time was around 17 seconds.

There are differences in speed with the various looping methods, but the Do/Loop variations seem to be the fastest and While/Wend close behind. With the benchmarks shown above, I don't think any speed differences would be significant.

Lee
 
Read it all rather quickly, but I think the only reason that you want to know the total of records is for the sake of keeping the user informed about the progress, right?

If that's the case and whatever it is you're doing involves running through that file sequentially while processing then why don't you do a byte count of the file then, something like FileLen() and keep track of the number bytes you've already read? That way you can calculate the progress in a percentage instead of an exact number of records read; either way the user gets informed about the status....

Greetings,
Rick
 
Am I lost here, or is the trollacious method a simple loop until EOF doing Line Input to read then count each record?

I must be doing something wrong because that approach seems interminable on a long file... though after a couple of minutes I did just kill it.
 
It IS just a simple loop until EOF. I found that was the quickest way with VB6 to handle real large sequential files. The code I provided counted 1.2 million lines in a sequential file in 3 seconds, according to Timer, on a PIII 1.0 with 384M of memory and in the IDE, not compiled.

I just tested the code again with a dictionary file of 246894 lines (4.1M), variable length, and it counted those in < 1 second. For counting records in sequential files that have a large number of records, I've found this algorithm to be the fastest.

Actual code, copied and pasted from the IDE, not including variable declarations and filename assignment:
Code:
starttime = Timer
Open filename For Input As #1

Do While Not EOF(1)
  Line Input #1, oneline
  linecount = linecount + 1
Loop

Close #1
Text1.Text = linecount
endtime = Timer

Text3.Text = endtime - starttime

Lee
 
I see what is going on now...

Go back and read the original post here: the file has LF-terminated records, not CRLF.

If one sticks to the original requirements I believe you will see very different results. It appears that Line Input will read LF records properly, just VERY SLOWLY.

Even if I create a 1.2M record file with CRLF line terminations I see elapsed times for Line Input that are nearly 20 times as long as reading chunks and counting LFs.

So I guess I now have 3 questions:

1.) What results do others see when using a proper test file (LFs, not CRLFs)?

2.) Why am I seeing longer times for Line Input even with CRLF files?

3.) Was strongm really saying that the Jet Text method was twice as fast as chunk-and-scan?

Maybe this just shows how difficult benchmarking is, at least in terms of getting reliable results that hold up on different machines. I've tried both VB6SP5 and VB6SP6 with no noticeable difference.
 
Reading the 4.1M file one byte at a time and checking for vbLF took 16 seconds with the same environment specified above. Reading larger chunks would decrease the time, but files that contain odd numbers of bytes would have to be adjusted for, which wouldn't be difficult (like unrolling a for loop). This could be optimized significantly to speed things up, I'm sure.

Code:
Dim onechar As String * 1
Open filename For Binary As #1

Do While Not EOF(1)
  Get #1, , onechar
  If onechar = vbLf Then
    linecount = linecount + 1
  End If
Loop

Close #1

Lee
 
Yes, and I discovered an error in my testing of Line Input...

It turns out that Line Input will never read an LF-record file properly: I was erroneously using a short CRLF file when I was trying to verify that Line Input was worth anything at all here.


... but files that contain odd numbers of bytes would have to be adjusted for...

I allowed for ANSI files of any length already in the source I provided above. This isn't a factor at all.


The Jet Text method does seem to handle LF as well as CRLF. It still takes a long time on my machine... but it does work and is pretty concise in terms of source statements. I still get faster results than Line Input even with CRLF records - significantly so.

I think we've pretty much beaten the topic to death by now.


My tests with a 1.2M line file seem pretty conclusive.

Using the expensive "cheat" of recopying the input file to have CRLF line terminations I get:
Code:
Trial #1  Byte Array Chunk:
  0.7617188 seconds 1200000 lines
String Chunk:
  1.480469 seconds 1200000 lines
Line Input:
  9.863281 seconds 1200000 lines
Jet Text:
  26.39844 seconds 1200000 lines

Trial #2  Byte Array Chunk:
  0.5703125 seconds 1200000 lines
String Chunk:
  1.183594 seconds 1200000 lines
Line Input:
  9.101563 seconds 1200000 lines
Jet Text:
  25.80859 seconds 1200000 lines

Trial #3  Byte Array Chunk:
  0.5585938 seconds 1200000 lines
String Chunk:
  1.183594 seconds 1200000 lines
Line Input:
  9.070313 seconds 1200000 lines
Jet Text:
  25.71875 seconds 1200000 lines
Disk caching is clearly a factor. If you need to just determine the file's size in one pass there won't be a huge advantage in using Byte arrays over chunking into a String and just using an InStr() loop to count the LFs.

Reading a file with LF line terminations the results are substantially the same, except Line Input can't be used at all. Cutting the chunk or block size to 1000 bytes substantially degraded performance but still produced results about 10 times as good as Line Input.

I've already found two bone-headed errors of my own so I won't claim there aren't others. The variation in posted results differ wildly enough that I have to wonder what other dumb mistakes I must be making.

I don't think we're all testing the same things here.
 
Once again I refer to my example.

program WC will read the file correctly regardless of whether it is a CR or a CRFL or a LF file.

My original test was done following the OP requirements of having a file with LF as terminator, hence the second program "unix2dos" used to convert to a format more agreable to VB.



Regards

Frederico Fonseca
SysSoft Integrated Ltd
 
>Reading the 4.1M file one byte at a time and checking for vbLF took 16 seconds
But we're looking at a 125Mb file ...

>1.) What results do others see when using a proper test file (LFs, not CRLFs)?
Makes no difference to the speed of the Jet Text method

>3.) Was strongm really saying that the Jet Text method was twice as fast as chunk-and-scan?

Well, on three seperate machines it runs about twice as fast in the IDE (which suprised me; as I have said before, I wasn't excpecting the Jet Text method to be super quick). Both the chunk-and-scan and Line Input methods speed up dramatically and overtake when compiled
 
I was really beginning to wonder though when I saw results floating all over the scale here. I decided to create a large file, with CRLFs to permit the testing of Line Input.

Then I defragged the volume to take fragmentation out of the picture as much as I could without using a clean partition. Finally I made sure I rebooted between tests in order to try to minimize differences due to read caching.

What I got was:

[tt]{defrag}

{reboot}
File bytes: 144,000,000
Byte Array Chunk: 9.722656 seconds 4,800,000 lines

{reboot}
File bytes: 144,000,000
String Chunk: 9.414063 seconds 4,800,000 lines

{reboot}
File bytes: 144,000,000
Line Input: 65.54297 seconds 4,800,000 lines

{reboot}
File bytes: 144,000,000
Jet Text: 99.94531 seconds 4,800,000 lines[/tt]

Disk I/O time washes out any time gained by the faster processing time of the Byte array approach.

One thing I learned... I sure hate rebooting!

I suspect the real conclusion I've come to though is that chunk-reading is still the fastest way to go. Using Byte arrays might gain you something over String buffers, but not all that much.

I think if you wanted to actually process the data instead of just count line-ends the String buffer might be best. You could chunk in 32K String buffers, find the line-ends using Instr(), and then Mid$() the records out being careful to handle lines spanning buffers. If you need to write processed data back out and you aren't substantially condensing it, buffering the data up into 32K blocks and writing those to disk might also help keep the elapsed time down. Of course there you'd also want to use a "fast strings" technique such as the one that uses a large String variable prepadded to 32K and then insert data into it with the Mid$() statement and write out full buffers as they fill.
 
I tried that WC.EXE after a reboot as well. Eyeball timing against my watch showed it takes (maybe) just under 9 seconds itself, so I see little advantage in it aside from the valuable if not obvious advantage of not writing and debugging some VB to do the job.

Perhaps if somebody wanted to make those Unix applets truly useful to VB programmers they'd repackage a few as ActiveX libraries though. Firing off external EXEs and piping their input/output is a hack I could live without. Joe VB is going to spend more debugging time doing that than just coding up a chunk-o-rama function.

Recopying large files from LF to CRLF format doesn't save you any time either, plus Line Input is still pretty darned slow even once you have converted the record format. Unix2DOS plus Line Input would be as slow as the more brief ADO/Jet Text approach.
 
One last thing while I'm boring everyone...

Yes, my tests were with compiled code. P-code is dog-slow at this chore, whether in the IDE or as an EXE.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top