simple, but UGLY 2

hughed00 · Mar 12, 2001

I have a huge (500mg and growing) text file which is fixed field length with no delimiters.

I need to determine is the 28-->77 characters are ever repeated in a subsequent row/record, and if so, spit them and the entire record it is contained in out to another file leaving only one occurence of a record with the first occurence of the 48.

It seems like a very tedious process to grab the string and search the entire file for another occurence of record containing it, but maybe this is the only way?

Basically, data cleanup prior to use in a datawarehouse application.

I had been told I could do all kinds of magical tricks with SQL temp files and such, but all of it has left a bad taste. I was hoping someone over here who works with text files could point me in a cleaner and better solution.

Thanks.

MikeLacey · Mar 13, 2001

Here's a start.

[tt]
#!/usr/bin/perl

#
# To process the text file as it is written.
# This uses the UNIX program "tail" to read the log file and
# creates two new files. One is a "cleaned" version of the log file
# containing only "unique" data (unique as specified in *this*
# case anyway) and another of the duplicate lines detected.
#
# Stop the script with ^C
#
# Read the log file as it stands and then each new line as it is written
# If the chars 28->77 of the current line match chars 28->77 of the previous line
# print the line to the DUPES file
# else
# print the line to the cleaned up file
# fi
#
# This script relies on two UNIX utilities: tail and wc
#
# tail -- used to read the log file and (-f)ollow new lines
# as they are added
# wc -- used to count how many lines are in the logfile so
# far, so that the whole file is processed.
#
# BUGS: Finding these is left as an excersise for the reader
# But one is certainly that if the log file is written to between
# counting the lines in it and starting to process them then
# some lines at the beginning of the file will not be processed.
# Start the script when the log file is quiet, if possible.
# Another is that I haven't actually tested it, so there might be
# one or two typos......
#

use strict;

my $LogFile = "logfile.txt";
my $CleanedFile = "cleaned.txt";
my $Dupes = "dupes.txt";

open(CF,">$CleanedFile&quot || die "Can't create $CleanedFile\n$!";
open(DUPES,">$Dupes&quot || die "Can't create $Dupes\n$!";

my $prev_line = '';

# how many lines in the logfile so far then?
my $lines = `wc -l $LogFile`; chomp($lines);

open(TAIL,"tail -$lines -f $LogFile|&quot || die "Can't run tail -f on $LogFile\n$!";
while(<TAIL>){
# just so we don't bother with the first line
next if $prev_line eq '';
if(substr($_,28,77) eq substr($prev_line,28,77)){
# a match, so write the dupes file
print DUPES $_;
} else {
# no match, so write the cleaned up version of the log file
print CF $_;
}
# save the current line to compare with the next one.
$prev_line = $_;
}
[/tt] Mike
michael.j.lacey@ntlworld.com
Email welcome if you're in a hurry or something -- but post in tek-tips as well please, and I will post my reply here as well.

hughed00 · Mar 14, 2001

Thanks Mike, unfortunately, I am using a WIN32 machine so I wouldn't have access to the tail program.

MikeLacey · Mar 15, 2001

oh, right then ..... <wince> Mike
michael.j.lacey@ntlworld.com
Email welcome if you're in a hurry or something -- but post in tek-tips as well please, and I will post my reply here as well.

MikeLacey · Mar 15, 2001

Ok -- well the idea was that you would be able to leave that script running and it would automatically keep on creating the two files; never mind I guess....

The approach is still applicable, but you would have to run it against the log file every so often, like this:

mv the log file to another name, "newfile.txt" for instance, let the application carry on writing to the old filename.

Run the script below on newfile.txt

[tt]
use strict;

my $LogFile = "logfile.txt";
my $CleanedFile = "cleaned.txt";
my $Dupes = "dupes.txt";

open(CF,">$CleanedFile&quot

||
die "Can't create $CleanedFile\n$!";
open(DUPES,">$Dupes&quot

||
die "Can't create $Dupes\n$!";

my $prev_line = '';

open(F,$LogFile)
|| die "Can't open $LogFile\n$!";
while(<F>){
# just so we don't bother with the first line
next if $prev_line eq '';
if(substr($_,28,77) eq substr($prev_line,28,77)){
# a match, so write the dupes file
print DUPES $_;
} else {
# no match, so write the cleaned up version of the log file
print CF $_;
}
# save the current line to compare with the next one.
$prev_line = $_;
}
[/tt]
Mike
michael.j.lacey@ntlworld.com
Email welcome if you're in a hurry or something -- but post in tek-tips as well please, and I will post my reply here as well.

sackyhack · Mar 17, 2001

The code samples above will only find duplicates when the duplicates are in adjacent records - one following immediately after the next. They also keep the first occurrence and write later occurrences to the duplicates file.

I interpret the original question differently... I think that the duplicates could occur anywhere in the file, and that if a duplicate is found, the last occurrence is the one you want to keep. The whole record is not duplicated, just the 50 bytes in the range 28-77 are duplicated so which record to keep is important.

I can envision a solution that involves using bytes 28-77 as the key to a hash that stores the entire record. As each record is read the hash is checked for a previous occurrence. If found, the previous occurrence is written to the duplicates file and then the hash value is replaced by the later occurrence. After the final record is processed the hash values containing the last entries for each key are written out. This code would not preserve record order - if that is important the solution would have to be modified.

With a 500MB file memory usage would also be an issue. This solution would require enough memory to hold one record for each unique occurrence of the 28-77 bytes + 50 bytes for the key + other hash overhead. Unless you have a large percentage of duplicates this could be a problem.

MikeLacey · Mar 17, 2001

Sackyhack,

Yes, good points there; comments hughed00?

The memory usage problem, could possibly be gotten around by using ndbm files (look the same as hashes from the perl code but save things to disk) Mike
michael.j.lacey@ntlworld.com
Email welcome if you're in a hurry or something -- but post in tek-tips as well please, and I will post my reply here as well.

hughed00 · Mar 19, 2001

Exactly!! I was going to get back to you all on this today, but we had a power meltdown and I have been futzing about with UPSs all morning.

sackyhack's reading of my question was right on as duplicates can occur anywhere in the source file.

I didn't even get a chance to look at Mike's code until after lunch last Friday (after beers no less).

Thanks all once again. I am trying hard to learn Perl by working my way through several books a couple hours a day, but other things keep getting in the way. Your guidance is more appreciated than you know.

-- Dave

stillflame · Mar 19, 2001

well, i have an idea that won't take up as much memory as the hash idea, but will take MUCH longer.
basically, it's like this: read in a line. then, go through the file, checking for duplicates and deleting them, or printing them to the new file if they aren't. the biggest problem with this method, however, is after you're done checking the whole file against the first line, and printing out those that didn't match, you have to move onto the next line, checking the whole file (minus that which has already been checked) for duplicates. this would mean you'd have to copy the file once for every line. you may be able to get some speed boost by initially reading in lots of lines(couple hundred), then comparing this buffer against the whole file, and then moving onto the next bunch of lines. i started to play with the some code, but realized it's would be more than just UGLY. i finished a subroutine to compare a line buffer to itself, which i include below.
however, as a note, i have to say you would probly have an easier time of this if the file were in some sort of real database. s/probly/definitely/ there's gotta be a faster mathematical algorithm, maybe even based on my general idea, but which is already built into the database handler of a decently sophisticated database. (well, i don't actually know this, but i do know that just doing this would be alot simpler with SQL...)

[tt]
sub compare_buffer_to_self
{
my @buffer = @$_[0]; #or whatever

for (my $i = 0; $i < (@buffer + 0); $i++)
{
my $str = substr($buffer[$i], 28, 77);
my $counter = 0;
foreach (@buffer[0..($i-1),($i+1)..$#buffer])
{
if ($_ =~ m/^.{28}$str/) {$counter++; last;}
}

if (!$counter)
{
print OUTFILE $buffer[$i];
}
else
{
splice(@buffer, $i, 1);
}
}

return wantarray ? @buffer : \@buffer;
}[/tt]
Note: i tested this a little, and it does work in simple cases... "If you think you're too small to make a difference, try spending a night in a closed tent with a mosquito."

hughed00 · Mar 19, 2001

Full circle I guess. My first inclination was to do this in SQL with a DISTINCT command.

MikeLacey · Mar 19, 2001

Hughe -- have a look at NDBM files, they could be just what you need here. Mike
michael.j.lacey@ntlworld.com
Email welcome if you're in a hurry or something -- but post in tek-tips as well please, and I will post my reply here as well.

JohnYingling · Mar 26, 2001

This is a SORT problem. You could use a hash function on the 1st pass to determine those with no "possible" duplicates and then on the second pass, create a file of only those that could possibly be duplicate. Sort that output file (incore?) and do a "standard" sequential read and check previous key.

http://WWW.VBCompare.Com

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

simple, but UGLY 2

hughed00

Technical User

MikeLacey

MIS

hughed00

Technical User

MikeLacey

MIS

MikeLacey

MIS

sackyhack

Programmer

MikeLacey

MIS

hughed00

Technical User

stillflame

Programmer

hughed00

Technical User

MikeLacey

MIS

JohnYingling

Programmer

Similar threads

Part and Inventory Search

Sponsor