Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

parsing large data files using CF

Status
Not open for further replies.

Bammy217

Programmer
Sep 28, 2002
74
BE
hey folks,

right here's a chalange... I need to parse a 18Mb txt file, totally unformatted, though I know the section I want is labelled [start_(date)] .... [end (date)] and the .... beeing the text I need to parse... now what i've done so far was:

turn the file into a list (CFFILE Action="read" then translate to a list) and loop over it! And that worked fine until the file started getting massive (end of the month!!) so is there anyone out there who has a solution which allows me to parse Large files in CF????

Cheers,
D,

We never fail, we just find that the path to succes is never quite what we thought...
 
Well... my first inclination is to say that 18MBs is really too massive for a web app. Particularly if it's a public or semi-private customer-facing app where you're going to have multiple people hitting it at once (think about it... each hit would load a new instance of that 18MB file into server memory).

Though, technically, there really aren't any roadblocks to ColdFusion reading/loading such a large file, if you insist on doing so (the only limitations being amount of server memory available)... what I imagine you're running into is a request timeout... which could easily be remedied by overriding the timeout for that request... such as calling the page with a RequestTimeout parameter in the URL (somepage.cfm?requesttimeout=9999).

That being said... there are several things that confuse me about your question. You say the text in the file is "totally unformatted"... yet you mention that there are sections ("[start_(date)]")... which would indicate that it has some formatting. Also, you mention that you load up the file and then treat it as a list... which wouldn't be possible unless the file had some formatting (some sort of delimiter that you use it the list functions to denote chunks of data).

If there's enough formatting in the file to treat the text as a list, there's probably enough formatting to use it as a query. Create a DSN using the Merant ODBC Text Driver (available in versions prior to MX) with that file as a data source. You can then grab the specific data you need using SQL and standard CFQUERY tags. It's much more efficient for larger files... because the system only loads the file once, and automatically handles swapping segments into RAM.


-Carl
 
Carl,

Cheers for the message, The 18MB file, is indeed a bastard... though it is truely unformatted ... well it is formatted unfortunately, the client has given us compiled log files... (I think it's about 12 logfiles in a row...) so there isn't much we can do about it...

Had a look at the merant, and would greatly appreciate some more help on this one...

File Format 'm dealing with is:
[start_(date)]
Value01 Value01 Value01 Value01
Value01 Value01 Value01 Value01
Value01 Value01 Value01 Value01
Value01 Value01 Value01 Value01
etc...
[end_(date)]
[start_(files)]
Value02 Value02 Value02
Value02 Value02 Value02
Value02 Value02 Value02
Value02 Value02 Value02
Value02 Value02 Value02
etc...
[end_(files)]

and so on... leading to a grand total of a few hundred thousand lines of junk.... now the idea is to parse out all relevant information and stick it into a DB (then at least we can do some serious querying on there!) and allow users to play with the data once it has been sucked out of the original files...

though the pain is... that as is currently (turned into a list with all spaces, line brakes and paragraphs turned into '/#/' I then just run a list with a seperator of '/#/' which works okay the first 15 to 25 days of the month... though towards the end, it just craps out on me (runs for over 23.59 hours) and as I can't assign it higher priority or stick it on a better server.... I need to get it solved... PLEASE HELP!!!!

Cheers,
D.

We never fail, we just find that the path to succes is never quite what we thought...
 
Yuck!
Okay... my first suggestion would be... don't process the file every time a user hits the page... I would recommend that you have a pre-processor... a page that sits out somewhere that parses through the file and turns it into something that's more formatted and efficient, saves that result to other txt files, and these are the files that your customer-facing page(s) reads.

This pre-processor page can be run manually, or set up an entry in the CF scheduler to do it every night or whatever.

If appropriate, it could save separate txt files for each date (that way you're dealing with much smaller files) or whatever. That's up to you.


Second suggestion would be to use regular expressions rather than loop through a list (since your list delimiter is giving you fits anyway).

Example:

Let's say you loaded up the file into a variable called
Code:
#sLogFileContent#
, and it contained something like:
Code:
[start_2003-12-10] 
Value01 Value01 Value01 Value01 
Value01 Value01 Value01 Value01 
Value01 Value01 Value01 Value01 
Value01 Value01 Value01 Value01 
etc...
[end_2003-12-10]
[start_2003-12-11]
Value02 Value02 Value02
Value02 Value02 Value02
Value02 Value02 Value02 
Value02 Value02 Value02 
Value02 Value02 Value02 
etc...
[end_2003-12-11]
[start_2003-12-12]
Value03 Value03 Value03
Value03 Value03 Value03
Value03 Value03 Value03 
Value03 Value03 Value03 
Value03 Value03 Value03 
etc...
[end_2003-12-12]

you could get at the individual days values like so:
Code:
<CFSET dtTargetDate = &quot;2003-12-11&quot;>

<CFSET stTargetData = ReFindNoCase(&quot;\[start_#dtTargetDate#\]([^\[]*?)\[end_#dtTargetDate#\]&quot;,sLogFileContent,1,&quot;true&quot;)>

<CFIF IsStruct(stTargetData) AND StructKeyExists(stTargetData,&quot;pos&quot;) AND IsArray(stTargetData.pos) AND ArrayLen(stTargetData.pos) GT 1 AND stTargetData.pos[2] GT 0>
    <CFOUTPUT>#Mid(sLogFileContent,stTargetData.pos[2],stTargetData.len[2])#</CFOUTPUT>
</CFIF>
would output the data between the start and end tags for December 11, or whatever.

Instead of outputting the data, you could save it to a
Code:
2003-12-11.txt
file, etc.




-Carl
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top