comparing two files

naiveuser · Aug 9, 2001

I have a task I need to do daily on a RedHat Linux box, which is:

Compare two files which are in the format:

Code:

fielda,fieldb,fieldc,fieldd,fielde
fielda,fieldb,fieldc,fieldd,fielde
fielda,fieldb,fieldc,fieldd,fielde

etc.

Each row is a record. You can't get duplicate records in either file, but you can have records that match across the 2 files. fielda is the primary field, i.e. uniquely identifies each record. However, with each day, some details may change. Specifically:

1. fields b-e corresponding to fielda, may change.
2. there may be a record with a fielda which has not yet been encountered.

e.g.

File1:

one,apple,tony,cat
two,orange,sarah,dog
four,apple,jack,dog

File2:

one,apple,tony,dog
two,orange,sarah,dog
three,lemon,tom,horse

In File2 the first record changes, the second record is the same, the third record is new.

In reality, File1 will be a 'master' file, and File2 will be a daily log which I'll compare to File1. If there are identical records, I'll ignore them. If there are new records in File2, I'll add them to File1. If there are differences between a record in File1 and a record in File2 which have the same fielda, then I want the records removed from both file and input into File3.temp which I will (for now) manually deal with.

So to continue the example, now:

File1:

two,orange,sarah,dog
four,apple,jack,dog
three,lemon,tom,horse

File3.temp:

one,apple,tony,dog
one,apple,tony,cat

I hope this is clear. Thanks for your help!!!!

Mark.

sackyhack · Aug 10, 2001

Code:

# Assumptions:
# Each 'fielda' value appears at most ONCE in the master
# file and ONCE in the log file.  You may want to add
# checks to this program to ensure that these assumptions
# are true.
#
# The ordering of the master file will be mixed up
# by the hash.  This is assumed to be unimportant.
# 
# The master file could be lost if the program is stopped
# before completion.  Maybe should write to a temp file
# instead.

my $masterfile = &quot;master.txt&quot;;
my $logfile = &quot;log.txt&quot;;
my $tempfile = &quot;temp.txt&quot;;

# Copy input files to arrays (one record per line)
open (MASTER, &quot;<$masterfile&quot;) || die &quot;can't open $masterfile\n&quot;;
my @master_recs = <MASTER>;
close MASTER;

open (LOG, &quot;<$logfile&quot;) || die &quot;can't open $logfile\n&quot;;
my @log_recs = <LOG>;
close LOG;

# Make hash of first fields of @master_recs to avoid
# splitting these records once per log record.  Assumes
# log is big enough for this to make a difference
my %master;
foreach my $rec (@master_recs) {
   my @fields = split(/,/, $rec);
   my $fielda = $fields[0];
   $master{$fielda} = $rec;
}

open (TEMP, &quot;>$tempfile&quot;) || die &quot;can't open $tempfile&quot;;

# Check each LOG record against MASTER
foreach my $log (@log_recs) {
   my @fields = split(/,/, $log);
   my $fielda = $fields[0];
   # is there a matching master record (fielda)?
   if (defined ($master{$fielda})) {
      # is there a difference in the entire record ?
      if ($master{$fielda} ne $log) {
         print TEMP $log;
         print TEMP $master{$fielda};
         undef ($master{$fielda});
      }
      # if no difference, ignore log record
   }
   # no matching master record
   else {
      # append new log record to master
      $master{$fielda} = $log;
   }
}

close TEMP;

open (MASTER, &quot;>$masterfile&quot;) || die &quot;can't open $masterfile for writing\n&quot;;
foreach my $k (keys %master) {
   print MASTER $master{$k};
}
close MASTER;

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

comparing two files

naiveuser

Technical User

sackyhack

Programmer

Similar threads

Part and Inventory Search

Sponsor