Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Best way to get the number of records in a file with millions records

Status
Not open for further replies.

tbtcust

Programmer
Oct 26, 2004
214
US
Hi all,

I have searched the threads looking for a way to get the total number of records in a file in one shot.

It looks like Split function into an array and using UBound is the answer.

However, I have a file that contains millons of records and that may be a problem.

Question: Is there another way of geting the total number of records in a file in one shot? Note that each line is delimited by vbLF.

Thanks in advance for any help.
 
With stream files there is no "record count" attribute. By definition thare are no "records" at all.

This means imposing some sort of structural abstraction on the data stream. It also means reading and counting the "records.
 
There are basically two ways to do this, the first one replaces two characters with the "x", and then subtracts. The other uses split() and ubound(). I'm not sure how well it'd work with such big files, though. Couldn't you keep track of the records as you are writing the file? That way it'd be quick and easy to get the info.

Code:
Option Explicit

Private Sub Form_Load()
  Dim x As Integer, st As String
  Dim ff As Integer
  Dim strBuff As String
  Dim str() As String
  ff = FreeFile
  Open App.Path & "\to do.txt" For Input As #ff
    strBuff = Input(LOF(ff), ff)
  Close #ff
  ' ----------------- two ways to skin a cat --------------
  MsgBox "Lines = " & Len(strBuff) - Len(Replace(strBuff, vbCrLf, "x")) + 1
  ' -------------------------------------------------------
  str() = Split(strBuff, vbCrLf)
  MsgBox "There are " & UBound(str) + 1 & " lines in the file"
  For x = 0 To UBound(str)
    st = st & str(x) & vbCrLf & vbCrLf
  Next x
  MsgBox st
End Sub

-David
2006 Microsoft Valued Professional (MVP)
 
I worked with a sequential data file not too long ago that held over 120,000 records, and found that it was significantly slower to use Split and Replace than to read the whole file line by line with a counter, redimension the array to the size needed, then read the whole file in again line by line, assigning each line to the array element. I'm not sure what the break point is where Split or Replace would be faster, but it's less than 120,000.

If you just need to get the number of lines as fast as possible, you might consider using a language that produces faster code, like C or C++, then storing the value in a temporary file to read with VB.

Lee
 
Using 1.2 million lines in a file with each dummy record 28 random characters long, the VB6 interpreter took about 3 seconds to count all the records (using Timer to measure from before opening the file to after closing it):

Dim linecount As Long
Dim oneline As String

Open filename for Input as #1
linecount = 0
Do While Not EOF(1)
Line Input #1, oneline
linecount = linecount + 1
Loop
Close #1

I tried Split and Replace separately, and quit after about 25 seconds on each. Using a counter doesn't manipulate memory like Split and Replace do, and the size of the file makes those functions slower than disk access.

The kind of loop you use can make a difference in speed, too. I've found that the Do While loop is usually the fastest, though not seen a significant difference between Do While or Loop While. Do Until is pretty close, either bottom or top testing. While/Wend comes next, and For/Next has been the slowest.

Lee
 
If you just want the count, I would open the file in binary mode and count the number of &h0A values that you see. You might need to adjust by one if the last record doesn't have the linefeed as the last byte.

Pseudocode might be something like:
Open File #1 For Binary Read
While Not Eof(#1)
Read up to 1000 bytes
Count the number of vbLf you see
Wend
Close File
Chip H.


____________________________________________________________________
Donate to Katrina relief:
If you want to get the best response to a question, please read FAQ222-2244 first
 
First let me say thank you to everyone for your participation.

dglienna wrote "other uses split() and ubound(). I'm not sure how well it'd work with such big files" I tried it and brought the box to its knees.


dglienna wrote "Couldn't you keep track of the records as you are writing the file? That way it'd be quick and easy to get the info." I do that now, however the user would like the total number of records up front and would like to see the progress over the life of the execution.


trollacious wrote "you might consider using a language that produces faster code, like C or C++, then storing the value in a temporary file to read with VB." I agree. I have said to the powers that be this is not the correct language, platform, or data store. There are 600+ of the files to process per night. As a test, I FTPed the files to our mainframe wrote some quick and dirty REXX code and it ripped thought the files like nobodies business.


strongm wrote "Are they fixed length records?" I wish. These are transaction files with 62 different layouts per file.


Overall what I'm seeing is there is no elegant way to getting the record counts for a large file. So I'll be benchmarking chiph and trollacious solutions and report to everyone which one produces the count the quickest.

Thanks everyone
 
Well, you could try using a Text ISAM driver, eg:
Code:
[blue]Dim cn As Connection
    Dim recordset1 As Recordset
    
    Set cn = New Connection
    cn.Open "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\;Extended Properties=""text;HDR=No"""
    Set recordset1 = New Recordset
    recordset1.CursorLocation = adUseClient
    recordset1.Open "SELECT count(*) as TotalRecords from data.txt", cn
    MsgBox recordset1("TotalRecords")[/blue]
 
This was addressed some time ago in the VBScript Forum.

The example below is probably close to optimal using straight VB6. It is based on the suggestion made by chiph but increases the file blocksize used. About 32K seems optimal, which is consistent with similar findings using VBScript and FSO (or ADODB.Stream) to accomplish much the same thing.

This example runs ~ twice as fast as similar code using a String buffer and the InStr() function to detect and count delimiter characters. No doubt the ANSI to UniCode conversions are a big part of this difference. I am seeing results under a second for a file of 1.2 million 28-character records/lines, but clearly disk caching is a factor when running repeated tests - something to keep in mind.

I compiled this to native code with several optimizations set:
[ul][li]Optimize for Fast Code[/li]
[li]Favor Pentium Pro[/li]
[li]Advanced: Remove Array Bounds Checks[/li]
[li]Advanced: Remove Integer Overflow Checks[/li][/ul]
This set of optimizations seemed to make a measurable difference in the timings for the Byte Array version listed below. Collapsing the loop termination tests in BufferLines() into a Boolean variable set within the inner loop actually degraded performance.

With my 1.2 M line, 28 char/line sample file on a PIII 533 256 MB machine with a hard drive of modest performance specs and XP Pro SP2 my results range from .5 to .9 seconds. As I said, I suspect disk caching makes a difference here. Running the program in twice a row gives a second result about half the first.

This version only works properly for ANSI text files and single-byte delimiters.

modMain.bas
Code:
Option Explicit

Const BLOCKSIZE As Long = 32768

Function BufferLines(ByRef Buffer() As Byte, _
                     ByVal Delim As Byte) As Long
    Dim lngNext As Long
    Dim lngLast As Long

    BufferLines = 0
    lngNext = -1
    lngLast = UBound(Buffer)
    Do
        Do
            lngNext = lngNext + 1
            If Buffer(lngNext) = Delim Then
                BufferLines = BufferLines + 1
                Exit Do
            End If
        Loop Until lngNext = lngLast
    Loop Until lngNext = lngLast
End Function

Function FileLines(ByVal FileName As String, _
                   ByVal Delim As Byte) As Long
    Dim intF As Integer
    Dim lngFSize As Long
    Dim lngChunks As Long
    Dim lngRemain As Long
    Dim bytBuf() As Byte
    Dim lngChunk As Long

    If Len(Delim) <> 1 Then
        Err.Raise &H80044701, "FileLines", _
                  "Only single-char Delim values are supported"
    End If
    
    intF = FreeFile(0)
    Open FileName For Binary Access Read As #intF
    lngFSize = LOF(intF)
    lngChunks = lngFSize \ BLOCKSIZE
    lngRemain = lngFSize Mod BLOCKSIZE
    
    FileLines = 0
    
    ReDim bytBuf(BLOCKSIZE - 1)
    For lngChunk = 1 To lngChunks
        Get #1, , bytBuf
        FileLines = FileLines + BufferLines(bytBuf, Delim)
    Next
    
    If lngRemain > 0 Then
        ReDim bytBuf(lngRemain - 1)
        Get #1, , bytBuf
        FileLines = FileLines + BufferLines(bytBuf, Delim)
    End If
    
    Close #intF
    
    If bytBuf(UBound(bytBuf)) <> Delim Then FileLines = FileLines + 1
End Function

Sub Main()
    Dim strFileName As String
    Dim lngLines As Long
    Dim sngStart As Single
    Dim sngStop As Single
    
    strFileName = "..\big.txt"
    sngStart = Timer()
    lngLines = FileLines(strFileName, AscB(vbLf))
    sngStop = Timer()

    MsgBox CStr(sngStop - sngStart) & " seconds." & vbNewLine & _
           strFileName & " contains " & CStr(lngLines) & " lines", _
           vbOKOnly, "Byte Array"
End Sub
 
Frightening bug I slipped into the sample code above. Freakishly it didn't turn up in testing!

In the two [tt]Get[/tt] statements above, [tt]#1[/tt] should really be [tt]#intF[/tt] instead.
 
Hi all. Once again thanks for everyone's participation in this thread.

The solution was a combination of your suggestions that boiled down to: 1) read the file into a buffer 2) Replace vbLf with vbCrLf 3) Count the number of vbCrLf + 1 4) Write the changed buffered data to a temp file 5) opening and processing that temp file instead of the original buffered data. See suto code below. This gave me the record count and took the execution time for 200,000 records from an 1 1/2 to 3 mins.


Line Input #InputFileNumber, BufferedFile
BufferedFile = Replace(BufferedFile, vbLf, vbCrLf)
MsgBox "Lines = " & Len(BufferedFile) - Len(Replace(BufferedFile, vbCrLf, "x")) + 1
Open TempFileName For Output as #TempFileNameNumber
Print #TempFileNameNumber, BufferedFile
Close #TempFileNameNumber
Open TempFileName For Input as #TempFileNameNumber
 
tbt103,

Although your solution works, it can get very slow depending on the file size.

Please have a look at the following as an option to your code.
It reduces the time to create a "dos" file to just 1 second on my machine, from the 300 seconds it took using your method.

Machine details
P4 3.0 HT, 2GB Ram 73GB SCSI HD U320
Windows XP PRO SP2

File size 11,867,857 bytes
Line count 270,865
Additional Software used
WC from UnxUtils.zip (WC will count the number of lines on any file supplied

unix2dos from uddu ([url_fb_product_page]6279)
unix2dos will convert any file with a LF to a CRLF
(Do not use twice on the same file unless you change the source code do deal with existing CRLF files. e.g. it should ignore then on this case)



Main code taken from thread222-859655, namely the last post from Hypetia (many thanks for the code!!)

Code:
Option Explicit
Private Type STARTUPINFO
 cb As Long
 lpReserved As String
 lpDesktop As String
 lpTitle As String
 dwX As Long
 dwY As Long
 dwXSize As Long
 dwYSize As Long
 dwXCountChars As Long
 dwYCountChars As Long
 dwFillAttribute As Long
 dwFlags As Long
 wShowWindow As Integer
 cbReserved2 As Integer
 lpReserved2 As Long
 hStdInput As Long
 hStdOutput As Long
 hStdError As Long
End Type

Private Type PROCESS_INFORMATION
 hProcess As Long
 hThread As Long
 dwProcessID As Long
 dwThreadID As Long
End Type

Const STARTF_USESHOWWINDOW = &H1&
Const NORMAL_PRIORITY_CLASS = &H20&
Const INFINITE = -1&

Private Declare Function CreateProcess Lib "kernel32" Alias "CreateProcessA" (ByVal lpApplicationName As String, ByVal lpCommandLine As String, lpProcessAttributes As Any, lpThreadAttributes As Any, ByVal bInheritHandles As Long, ByVal dwCreationFlags As Long, lpEnvironment As Any, ByVal lpCurrentDriectory As String, lpStartupInfo As STARTUPINFO, lpProcessInformation As PROCESS_INFORMATION) As Long
Private Declare Function WaitForSingleObject Lib "kernel32" (ByVal hHandle As Long, ByVal dwMilliseconds As Long) As Long
Private Declare Function GetExitCodeProcess Lib "kernel32" (ByVal hProcess As Long, lpExitCode As Long) As Long
Private Declare Function CloseHandle Lib "kernel32" (ByVal hObject As Long) As Long

'--- Shells the passed command line and waits for the process to finish
'--- Returns the exit code of the shelled process
Function ShellWait(strCmdLine As String, Optional ShowCommand As VbAppWinStyle = vbNormalFocus) As Long
  Dim udtProc As PROCESS_INFORMATION, udtStart As STARTUPINFO

  'initialize the STARTUPINFO structure
  udtStart.cb = Len(udtStart) 'size
  udtStart.dwFlags = STARTF_USESHOWWINDOW 'uses show window command
  udtStart.wShowWindow = ShowCommand 'the show window command

  'Launch the application
  CreateProcess vbNullString, strCmdLine, ByVal 0&, ByVal 0&, 0, NORMAL_PRIORITY_CLASS, ByVal 0&, vbNullString, udtStart, udtProc

  'Wait for the shelled application to finish
  WaitForSingleObject udtProc.hProcess, INFINITE

  'get the exit code
  GetExitCodeProcess udtProc.hProcess, ShellWait

  'close handles
  CloseHandle udtProc.hThread
  CloseHandle udtProc.hProcess
End Function


Private Sub Form_Load()
    'test runs
Dim bufferedfile As String
Dim tempfilename As String
Dim t1 As Long
Dim t2 As Long
Dim t3 As Long

t1 = Timer

    ShellWait "cmd /c wc -l c:\temp\t\listax >c:\temp\t\wcount", vbHide
    ShellWait "unix2dos c:\temp\t\listax", vbHide
t2 = Timer
t3 = t2 - t1
Open "c:\temp\t\wcount" For Input As #1
Line Input #1, bufferedfile
Close #1
MsgBox "process took " & t3 & " seconds"

t1 = Timer

bufferedfile = "c:\temp\t\lista"
tempfilename = "c:\temp\t\lista.temp"
Open bufferedfile For Input As #1
Line Input #1, bufferedfile
bufferedfile = Replace(bufferedfile, vbLf, vbCrLf)
'MsgBox "Lines = " & Len(bufferedfile) - Len(Replace(bufferedfile, vbCrLf, "x")) + 1
Open tempfilename For Output As #2
Print #2, bufferedfile
Close #2
t2 = Timer
t3 = t2 - t1
MsgBox "process took " & t3 & " seconds"
End
End Sub

Regards

Frederico Fonseca
SysSoft Integrated Ltd
 

Just a thought... If the only reason you want to know the number of records is to be able to display a progress bar, then if you can obtain a statistical average record length for each file type and divide that into the total number of bytes in the file it should be good enough "for government work"

If the average is expected to vary over time, you could keep track of the exact average record length (for each file type) as you process the file and then store that for use the next night.

 
LOF=131690360 | .txt
---------------------
strongm

adUseClient, adOpenKeyset, Select Count(*)
Time 09min 03sec Records 63.187

adUseClient, adOpenKeyset, Select *, .RecordCount
Time 10min 05sec Records 63.187

adUseServer, adOpenStatic, Select Count(*)
Time 09min 04sec Records 63.187

adUseServer, adOpenStatic, Select *, .MoveLast, .Recordcount
Time 10min 02sec Records 63.187

adUseServer, adOpenForwardOnly, Select Count(*)
Time 09min 09sec Records 63.187

---------------------
dilettante
Time 00min 16sec Lines 66.150

---------------------
trollacious
Time 00min 11sec Lines 66.150


Actual records imported to an access mdb ... 66.150


Why this 2.963 records difference ????
 
maybe because some records were dropped, or because you have empty lines on the file. In this case Access will probably ignore the emply lines.


And did you try my method?

Regards

Frederico Fonseca
SysSoft Integrated Ltd
 
>09min

!!!

Something very wrong there. I wasn't expecting it to be faster than the other methods (just shorter) Even on my slowest machine I get about 25 secs for a similarly sized data set versus about 50 seconds for dilettante's method, and 15 seconds for trollacious in the IDE, and 25 seconds versus 4.5 and 15 seconds respectively when compiled

>Why this 2.963 records difference ????

SQL's COUNT will not count NULL records
 
Something still seems very wrong here.

I think disk read caching may be throwing things off. The times I'm seeing rank quite differently.

How much RAM is in your test machines?

Processing the same file multiple times seems an unlikely real-world scenario to me. Thus a fair test should be performed with caching disabled - something I have no idea how to accomplish under Windows XP.

For the most valid testing it might be reasonable to allow caching to operate as it will, but reboot before each test is run!
 
<SQL's COUNT will not count NULL records

Try
Select Count(primarykeyfieldname) from tablename

 
strongm

I m sorry to say that your approach end up third, followed by many variations of it.

-----------------------
dilettante

HP d330
CPU P4 3.0 GHz HT
RAM 512MB
HDD 80Gb
WinNT 4.0 sp 6

-----------------------
Everyone

Sorry to say that the txt file has NO empty lines. Imported as is into mdb records 66150 and xls couldnt read it all but up to 65.536 (obvious why)

Maybe the wrong thing here is that I ran the test within a module of an mdb (and that's the reason why I didn't test for Frederico's approach) can not disable antivirus or disk caching...

BUT, I'm willing to try through VB6.0 sp5.

Please hold on. We are processing the transaction ....
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top