Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Help - I need faster way to edit a LARGE text file 2

Status
Not open for further replies.

kskinne

Technical User
Oct 8, 2004
169
US
I have the following code in a VB project that loops through each record in a text file and edit each record as necessary, in a new text file:
Code:
  Open myOldFileName For Input As #1
  Open myNewFileName For Output As #2
  Do Until VBA.EOF(1)
    Line Input #1, str1
    If Left(str1, 2) = "11" Then
      If (Mid(str1, 30, 8) = "71277570" Or Mid(str1, 30, 8) = "71277571" Or Mid(str1, 30, 8) = "71277572" Or Mid(str1, 30, 8) = "71277573" Or Mid(str1, 30, 8) = "71277574") Then
        Print #2, Application.WorksheetFunction.Replace(str1, 41, 4, "9999")
      Else
        Print #2, str1
      End If
    Else
      Print #2, str1
    End If
  Loop

  Close #2
  Close #1

The code works as intended however it runs very slowly, because most of the text files are 300,000+ records in length. It usually takes over three minutes to run and I have a fast machine. And the problem is this has to be done to not just one but many files.

Is there a better way to accomplish the same thing, only faster?

Thanks,
Kevin
 
You have 5 mid functions that are very similar. You will probably get better performance if you get the mid just once.

Code:
      Select Case Mid(str1, 30, 8)
          Case "71277570","71277571","71277572","71277573","71277574"
            Print #2, Application.WorksheetFunction.Replace(str1, 41, 4, "9999")
          Case Else
            Print #2, str1
      End Select

Also, this does not look like VB6. I suspect this is vbscript instead. If this were VB6 I would encourage you to use the string versions of Left and Mid, which are Left$ and Mid$.

In VB6, left and mid (and several other functions) take variant parameters. Since you already have a string, there is no need to convert it to a variant, then to a string. By using the $ versions, there is no convert process, so they are slightly faster. When run in a loop like this, that time can add up quickly.

If I am right, and this is actually VBScript, I don't think you can use the $ versions.

-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
[tt]VBA.EOF(1)[/tt] looks bizarre enough, but [tt]Application.WorksheetFunction.Replace(str1, 41, 4, "9999")[/tt] takes the cake.

You are trying to do this using Excel VBA or something, right? Have you asked in the appropriate forum? VBA is not VB.


One thing might help a little: maybe just extract that "30 for 8" substring once for all of those comparisons?

Aside from that I think you're simply I/O bound and will have to get a lot fancier so you can read and write large data blocks.

The only other quick and dirty thing you might try is to add [tt]Len = 32767[/tt] to the end of those Open statements. This should cause VBA to try to use the largest buffers you can request on a sequential Open. It might not make a significant difference though.
 
No this is not vbscript or VBA - I'm in a VB6 exe project. But it sounds like there are some changes you've both suggested, that may speed things up a bit. Thank you both, I'll try these changes and post back with results.

Kevin
 
Please understand that there are often times when people post questions in the wrong forum. After all vb6, vbscript, vba and vb.net all start with "vb". I'm sure you can see how that would be confusing to some people.

Anyway, you should use the $ version of the mid and left functions.

Your replace function looks a little strange. The parameters appear to be different from the built-in replace function. So, using the built-in replace function will probably not work. BUT... Application appear to be an object with several properties and/or methods. You may get a further performance boost by using a with block for your replace function.

I would suggest something like this:

Code:
  Open myOldFileName For Input As #1
  Open myNewFileName For Output As #2
  
[red]  With Application.WorksheetFunction[/red]
  
    Do Until VBA.EOF(1)
      Line Input #1, str1
      If VBA.Left$(str1, 2) = "11" Then
        Select Case VBA.Mid$(str1, 30, 8)
            Case "71277570", "71277571", "71277572", "71277573", "71277574"
              Print #2, [red].Replace(str1, 41, 4, "9999")[/red]
            Case Else
              Print #2, str1
        End Select
      Else
        Print #2, str1
      End If
    Loop
  [red]End With[/red]
  
  Close #2
  Close #1

I'm curious to know how much improvement this gives. Can you please post back with some before and after execution times?

-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
I think we can do that in ... oh, about 3 seconds ... (at least that's what 300000 records that ALL need replacing takes on my PC) ...

Drop this in instead of your code:
Code:
[green]    ' late binding version[/green][blue]
    Dim fso As Object
    Set fso = CreateObject("scripting.FileSystemObject")
    With CreateObject("vbscript.regexp")
        .MultiLine = True
        .Global = True
        .Pattern = "(11.{27}7127757(0|1|2|3|4).{3}).{4}(.*)"
        fso.CreateTextFile(myNewFilename, True).Write .Replace(fso.OpenTextFile(myOldFilename).ReadAll, "$19999$3")
    End With[/blue]



 
gmmastros, I made those changes you recommended in your first post regarding left$ and mid$, and also nested those .replace statements in a with clause, but the difference was negligible. I took one of the larger text files at 678,722 rows and before and after both took about 7 minutes to execute.

strongm I'm definitely intrigued by your post, that's what I need is something that will change the records that quickly, however I tried inserting your code in place of my existing and when I ran it, it thoroughly locked up my machine for some reason, I had to end task on my project, any idea what could have caused that?

I'll have to pick this up tomorrow and post back again, in the meantime if you have anything that may help I'd really appreciate it.

Thanks
Kevin
 
Can't say. I just tested it against a 600000 line text file and it took about 5 seconds and no lockup of any sort. I guessed at a record length of 65 characters for the sake of the test,; maybe yours are longer. Hang on ...

... ok, bumped it up to 200 characters per record with 600000 records so now takes a little over 10 seconds. And still no lock up.
 
strongm,

Could you please test the method I suggested against your data file? I suspect it will be slower than your method, but I'm curious to know what the difference would be.



-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
George, yours takes about 18 seconds run against the same test file
 
thanks strongm.

Correct me if I'm wrong....

Your method loads the entire file in to memory. It's faster, but more memory intensive because it uses the file system object and does a read all. There's also a replace function operating on the entire (long) string, so it's possible that the amount of memory used could be as high as twice the size of the original data file.

The method I proposed takes almost no memory at all, but is slower.



-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Now that I have all the changes implemented that gmmastros recommended, including the with clause, I do see a noticable difference in the time it takes the compiled project to run - it is now taking a little over 1 minute 30 seconds, which is much better than before. That is on the file with over 678,000 rows, and that is running through the rows twice, because I am first writing the changes to a second, temporary text file. Then I am open this file, and make a second set of changes to a different piece of each record and writing those changes to a third file, which is my final version of the file. Then I am killing the temporary file that got created, that only has the first set of changes. I would prefer it to be fast yet, but it's a definite improvement.

This brings up another question - is there a way I can make a change to each record in two separate places of the record, at one time instead of having to loop through the text file twice? Here is the current version of my code:

Code:
With Application.WorksheetFunction
  Open myOldFileName For Input As #1
  Open myNewFileNameTemp For Output As #2

  Do Until VBA.EOF(1)
    Line Input #1, str1
    If Left$(str1, 2) = "11" Then
      If Mid$(str1, 46, 4) = txtOldCic Then
        Print #2, .Replace(str1, 46, 4, txtNewCic)
      Else
        Print #2, str1
      End If
    Else
      Print #2, str1
    End If
  Loop

  Close #2
  Close #1

  Open myNewFileNameTemp For Input As #1
  Open myNewFileName For Output As #2

  Do Until VBA.EOF(1)
    Line Input #1, str1
    If Left$(str1, 2) = "11" Then
      Select Case Mid$(str1, 30, 8)
        Case "71277570", "71277571", "71277572", "71277573", "71277574"
          Print #2, .Replace(str1, 41, 4, "9999")
        Case Else
          Print #2, str1
      End Select
    Else
      Print #2, str1
    End If
  Loop

  Close #2
  Close #1

  Kill (myNewFileNameTemp)
End With

also strongm I still don't understand why my machine locks up when I try using the code you supplied - I am running Windows XP pro SP3, with about 3.2gb of ram and a 2.53ghz intel core duo processor. I don't know if my system has something to do with it or if it is the code itself.

Thanks for your help,
Kevin
 
Here are a couple changes for you. At a minimum, this should cut your execution time in half because this code only goes through the data file once.

Code:
[!]Dim NewCic As String
Dim OldCic As String

OldCic = txtOldCic.Text
NewCic = txtNewCic.Text[/!]

With Application.WorksheetFunction
  Open myOldFileName For Input As #1
  Open myNewFileNameTemp For Output As #2

  Do Until VBA.EOF(1)
    Line Input #1, str1
    If Left$(str1, 2) = "11" Then
      If Mid$(str1, 46, 4) = [!]OldCic[/!] Then
        str1 = .Replace(str1, 46, 4, [!]NewCic[/!])
      End If
    
      Select Case Mid$(str1, 30, 8)
        Case "71277570", "71277571", "71277572", "71277573", "71277574"
          str1 = .Replace(str1, 41, 4, "9999")
        Case Else
      End Select
    End If
    
    Print #2, str1
  Loop

  Close #2
  Close #1

End With

Also notice the parts in red. I assume that txtNewCic is a text box control on your form. By setting a variable outside the loop with this value, and then using it within the loop, your code should speed up a bit.

When performance tuning your code, you should almost always pay attention to what happens inside loops. In this case, you are looping on rows from a text file. As such, it's important to look at what is happening inside the loop and optimize those operations. In this case, you are repetitively accessing the default property of a text box (which happens to be the .Text property). Accessing a property of ANY object is going to be slower than accessing the value in a locally declared variable.

-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
In the code I just posted, you should change:

Open myNewFileNameTemp For Output As #2

to

Open myNewFileName For Output As #2

-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Thanks gmmastros - I made that change and that does make more sense to do it that way, have it update the str1 first for both of those two separate changes, and then write that new line to the second text file, instead of the way I was doing it before. However, strangely enough it didn't make a really significant difference in the speed. It was a little faster, this time it took about 1 minute 15 sec, but I guess I was expecting it to cut the time in half. But maybe this is the fastest it will run, if that the case I can live with that.

Just in case, here is what I *think* will be the final version of this (at least for now), I added another select statement, so the user has the option to choose whether they want to make one or the other of the two changes to the text file, or both:
Code:
With Application.WorksheetFunction
  Open myOldFileName For Input As #1
  Open myNewFileName For Output As #2
  Do Until VBA.EOF(1)
    Line Input #1, str1
    If Left$(str1, 2) = "11" Then
      Select Case CIC_TG
        Case "01"
          Select Case Mid$(str1, 30, 8)
            Case "71277570", "71277571", "71277572", "71277573", "71277574"
              str1 = .Replace(str1, 41, 4, "9999")
            Case Else
          End Select
        Case "10"
          If Mid$(str1, 46, 4) = txtOldCic Then
            str1 = .Replace(str1, 46, 4, txtNewCic)
          End If
        Case "11"
          If Mid$(str1, 46, 4) = oldCIC Then
            str1 = .Replace(str1, 46, 4, newCIC)
          End If
          Select Case Mid$(str1, 30, 8)
            Case "71277570", "71277571", "71277572", "71277573", "71277574"
              str1 = .Replace(str1, 41, 4, "9999")
            Case Else
          End Select
        Case Else
      End Select
    End If
    Print #2, str1
  Loop
  Close #2
  Close #1
End With

If you see any problems with the way I set this up I'd appreciate the feedback, otherwise thank you very much for the help.

Kevin
 
one thing to note, disregard in the code where I showed I was still referring to txtoldcic.text instead of the oldcic variable, and the same with newcic, I did get those changed, just not in the copy I pasted here.

Also update: I replaced:

str1 = .Replace(str1, 41, 4, "9999")

with:

str1 = Replace$(str1, Mid(str1, 41, 4), "9999")

And now this completes in about 15 sec - I can live w/ that :)

If you see anything else in the code that could/should be changed, please let me know. Thanks again, I couldn't have done this otherwise.

Thanks,
Kevin
 
str1 = Replace$(str1, Mid(str1, 41, 4), "9999")

I was thinking about that. But you know.... I think it might be a problem. Obviously I don't know your data, so I can't say for sure.

I say this because the 2 replace functions do not do the same thing.

The replace function in the Application.WorkSheetFunctions thingy will not replace multiple occurrences of a string like the VB6 version of the replace function.

For example...

Code:
    Dim str1 As String
    
    str1 = "ABC123ABC123"
    
    Dim Application As Excel.Application
    Set Application = New Excel.Application
    
    With Application.WorksheetFunction
        MsgBox .Replace(str1, 10, 3, "999")
        MsgBox Replace$(str1, Mid$(str1, 10, 3), "999")
    End With

When you run this, you will see:

[tt][blue]
ABC[!]123[/!]ABC999
ABC[!]999[/!]ABC999
[/blue][/tt]

As you can see, the output is different.


-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
There are some optional parameters to the VB6 Replace function that may be helpful.

According to the help files...

The return value of the Replace function is a string, with substitutions made, [!]that begins at the position specified by start[/!] and and concludes at the end of the expression string. It is not a copy of the original string from start to finish.

so, you could try this:

Code:
' str1 = .Replace(str1, 41, 4, "9999")
str1 = Left$(str1, 40) & Replace$(str1, Mid(str1, 41, 4), "9999", 41, 1)

I don't know if this will be faster than using the other replace function or not. Basically we are using 3 string functions here to replace the one call to the excel object's replace function.



-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Thanks George - the vb6 replace I am using does correctly edit the two specific places in each record that need to be edited, so I don't see a problem with how that is working.

For the your last reply, it is much slower when I use the .replace function, even if I am only replacing the one little section of the record, and then concatenating it with the rest of the record to the left and to the right of the changed portion. I think calling the using the excel worksheetfunction is just slower.

Thanks,
Kevin
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top