×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

simple, but UGLY
2

simple, but UGLY

simple, but UGLY

(OP)
I have a huge (500mg and growing) text file which is fixed field length with no delimiters.

I need to determine is the 28-->77 characters are ever repeated in a subsequent row/record, and if so, spit them and the entire record it is contained in out to another file leaving only one occurence of a record with the first occurence of the 48.

It seems like a very tedious process to grab the string and search the entire file for another occurence of record containing it, but maybe this is the only way?

Basically, data cleanup prior to use in a datawarehouse application.

I had been told I could do all kinds of magical tricks with SQL temp files and such, but all of it has left a bad taste. I was hoping someone over here who works with text files could point me in a cleaner and better solution.

Thanks.

RE: simple, but UGLY

Here's a start.


#!/usr/bin/perl

#
# To process the text file as it is written.
# This uses the UNIX program "tail" to read the log file and
# creates two new files. One is a "cleaned" version of the log file
# containing only "unique" data (unique as specified in *this*
# case anyway) and another of the duplicate lines detected.
#
# Stop the script with ^C
#
# Read the log file as it stands and then each new line as it is written
# If the chars 28->77 of the current line match chars 28->77 of the previous line
#     print the line to the DUPES file
# else
#     print the line to the cleaned up file
# fi
#
# This script relies on two UNIX utilities: tail and wc
#
#  tail -- used to read the log file and (-f)ollow new lines
#          as they are added
#  wc   -- used to count how many lines are in the logfile so
#          far, so that the whole file is processed.
#
# BUGS: Finding these is left as an excersise for the reader
#        But one is certainly that if the log file is written to between
#        counting the lines in it and starting to process them then
#        some lines at the beginning of the file will not be processed.
#        Start the script when the log file is quiet, if possible.
#        Another is that I haven't actually tested it, so there might be
#        one or two typos......
#

use strict;

my $LogFile = "logfile.txt";
my $CleanedFile = "cleaned.txt";
my $Dupes = "dupes.txt";

open(CF,">$CleanedFile")        || die "Can't create $CleanedFile\n$!";
open(DUPES,">$Dupes")            || die "Can't create $Dupes\n$!";

my $prev_line = ';

# how many lines in the logfile so far then?
my $lines = `wc -l $LogFile`; chomp($lines);

open(TAIL,"tail -$lines -f $LogFile|")    || die "Can't run tail -f on $LogFile\n$!";
while(<TAIL>){
    # just so we don't bother with the first line
    next if $prev_line eq ';
    if(substr($_,28,77) eq substr($prev_line,28,77)){
        # a match, so write the dupes file
        print DUPES $_;
    } else {
        # no match, so write the cleaned up version of the log file
        print CF $_;
    }
    # save the current line to compare with the next one.
    $prev_line = $_;
}

Mike
michael.j.lacey@ntlworld.com
Email welcome if you're in a hurry or something -- but post in tek-tips as well please, and I will post my reply here as well.

RE: simple, but UGLY

(OP)
Thanks Mike, unfortunately, I am using a WIN32 machine so I wouldn't have access to the tail program.

RE: simple, but UGLY

oh, right then ..... <wince>

Mike
michael.j.lacey@ntlworld.com
Email welcome if you're in a hurry or something -- but post in tek-tips as well please, and I will post my reply here as well.

RE: simple, but UGLY

Ok -- well the idea was that you would be able to leave that script running and it would automatically keep on creating the two files; never mind I guess....

The approach is still applicable, but you would have to run it against the log file every so often, like this:

mv the log file to another name, "newfile.txt" for instance, let the application carry on writing to the old filename.

Run the script below on newfile.txt


use strict;

my $LogFile = "logfile.txt";
my $CleanedFile = "cleaned.txt";
my $Dupes = "dupes.txt";

open(CF,">$CleanedFile") ||
    die "Can't create $CleanedFile\n$!";
open(DUPES,">$Dupes") ||
    die "Can't create $Dupes\n$!";

my $prev_line = ';

open(F,$LogFile)
    || die "Can't open $LogFile\n$!";
while(<F>){
    # just so we don't bother with the first line
    next if $prev_line eq ';
    if(substr($_,28,77) eq substr($prev_line,28,77)){
        # a match, so write the dupes file
        print DUPES $_;
    } else {
        # no match, so write the cleaned up version of the log file
        print CF $_;
    }
    # save the current line to compare with the next one.
    $prev_line = $_;
}

Mike
michael.j.lacey@ntlworld.com
Email welcome if you're in a hurry or something -- but post in tek-tips as well please, and I will post my reply here as well.

RE: simple, but UGLY

The code samples above will only find duplicates when the duplicates are in adjacent records - one following immediately after the next.  They also keep the first occurrence and write later occurrences to the duplicates file.

I interpret the original question differently...  I think that the duplicates could occur anywhere in the file, and that if a duplicate is found, the last occurrence is the one you want to keep.  The whole record is not duplicated, just the 50 bytes in the range 28-77 are duplicated so which record to keep is important.

I can envision a solution that involves using bytes 28-77 as the key to a hash that stores the entire record.  As each record is read the hash is checked for a previous occurrence.  If found, the previous occurrence is written to the duplicates file and then the hash value is replaced by the later occurrence.  After the final record is processed the hash values containing the last entries for each key are written out.  This code would not preserve record order - if that is important the solution would have to be modified.

With a 500MB file memory usage would also be an issue.  This solution would require enough memory to hold one record for each unique occurrence of the 28-77 bytes + 50 bytes for the key + other hash overhead.  Unless you have a large percentage of duplicates this could be a problem.

RE: simple, but UGLY

Sackyhack,

Yes, good points there; comments hughed00?

The memory usage problem, could possibly be gotten around by using ndbm files (look the same as hashes from the perl code but save things to disk)

Mike
michael.j.lacey@ntlworld.com
Email welcome if you're in a hurry or something -- but post in tek-tips as well please, and I will post my reply here as well.

RE: simple, but UGLY

(OP)
Exactly!! I was going to get back to you all on this today, but we had a power meltdown and I have been futzing about with UPSs all morning.

sackyhack's reading of my question was right on as duplicates can occur anywhere in the source file.

I didn't even get a chance to look at Mike's code until after lunch last Friday (after beers no less).

Thanks all once again. I am trying hard to learn Perl by working my way through several books a couple hours a day, but other things keep getting in the way. Your guidance is more appreciated than you know.

-- Dave

RE: simple, but UGLY

well, i have an idea that won't take up as much memory as the hash idea, but will take MUCH longer.
 basically, it's like this: read in a line.  then, go through the file, checking for duplicates and deleting them, or printing them to the new file if they aren't.  the biggest problem with this method, however, is after you're done checking the whole file against the first line, and printing out those that didn't match, you have to move onto the next line, checking the whole file (minus that which has already been checked) for duplicates.  this would mean you'd have to copy the file once for every line.  you may be able to get some speed boost by initially reading in lots of lines(couple hundred), then comparing this buffer against the whole file, and then moving onto the next bunch of lines.  i started to play with the some code, but realized it's would be more than just UGLY.  i finished a subroutine to compare a line buffer to itself, which i include below.
 however, as a note, i have to say you would probly have an easier time of this if the file were in some sort of real database.  s/probly/definitely/  there's gotta be a faster mathematical algorithm, maybe even based on my general idea, but which is already built into the database handler of a decently sophisticated database. (well, i don't actually know this, but i do know that just doing this would be alot simpler with SQL...)



sub compare_buffer_to_self
{
  my @buffer = @$_[0]; #or whatever
  
  for (my $i = 0; $i < (@buffer + 0); $i++)
  {
    my $str = substr($buffer[$i], 28, 77);
    my $counter = 0;
    foreach (@buffer[0..($i-1),($i+1)..$#buffer])
    {
      if ($_ =~ m/^.{28}$str/) {$counter++; last;}
    }

    if (!$counter)
    {
      print OUTFILE $buffer[$i];
    }
    else
    {
      splice(@buffer, $i, 1);
    }
  }

  return wantarray ? @buffer : \@buffer;
}

Note: i tested this a little, and it does work in simple cases...

"If you think you're too small to make a difference, try spending a night in a closed tent with a mosquito."

RE: simple, but UGLY

(OP)
Full circle I guess. My first inclination was to do this in SQL with a DISTINCT command.

RE: simple, but UGLY

Hughe -- have a look at NDBM files, they could be just what you need here.

Mike
michael.j.lacey@ntlworld.com
Email welcome if you're in a hurry or something -- but post in tek-tips as well please, and I will post my reply here as well.

RE: simple, but UGLY

This is a SORT problem. You could use a hash function on the 1st pass to determine those with no "possible" duplicates and then on the second pass, create a file of only those that could possibly be duplicate. Sort that output file (incore?) and do a "standard" sequential read and check previous key.

WWW.VBCompare.Com

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members! Already a Member? Login


Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close