Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

File Comparison Algorithm 1

Status
Not open for further replies.

dirtyholmes

Programmer
Mar 17, 2004
43
AU
Hi Guys,

I am trying to come up with a suitable file comparison algorithm, and hoped someone here would have done something similar in the past. I am trying to compare 2 files to get an idea of how much they match each other. The files consist of a number of columns of numbers and text ( tel numbers , account numbers etc ) , each one representing a record.

However the records in each file may or may not be in the same order, so matching line by line is out of the question. Some files may also have additional lines or records and some records may 99% match a record in the other file ( eg 4 of the 5 columns may be identical but the 5th column may mismatch the equivalent record in the other file.

Can anyone provide a watertight algorithm to quantify how much these files match.
Data example

193 03989337 000060003+00060009.69003858
194 03989337 000060003+00060009.69003858
195 989337 000060003+00060009.69003858

 
Depends what you want to achieve. Do you want to select the matching records, the non-matching records, or just get some statistics on how well they match? (NB. on the subject of statistics, 4 out of 5 fields matching is 80% :))

In any case, you probably need to pre-process the files to normalise them first to make the later comparisons simpler. Then sort them both in the same order and compare them, doing whatever you need to do with matches, mismatches etc. This works with any size of file.

An alternative if the files aren't very big would be to read the first into a hash of arrays, and then work through the second file and look the records up in the hash.
 
surely this is very difficult to achieve - due to the varying column widths

for example, this section hardly matches at all...

Code:
194    03989337  000060003+00060009.69003858
       ||||||
195    98[red]9[/red]337    000060003+00060009.69003858

but this matches 6 digits

Code:
194    03989337  000060003+00060009.69003858
         ||||||
195      [red]989337[/red]  000060003+00060009.69003858


Kind Regards
Duncan
 
i've quickly put this together...

Code:
$line1 = "193    03989337  000060003+00060009.69003858";
$line2 = "195    989337    000060003+00060009.69003858";

@line1split = split (//, $line1);
@line2split = split (//, $line2);

$line1Length = @line1split;

print "number of characters: $line1Length\n\n";

$valuePerChar = 100 / $line1Length;

print "score per character match: $valuePerChar %\n\n";

for ($x=0; $x<=$#line1split; $x++) {

  print "$line1split[$x] | ";
  print "$line2split[$x] ... ";
  
  if ($line1split[$x] eq $line2split[$x]) {
    $matched = $matched + $valuePerChar;
  }
  
  print "$matched %\n"

}

it displays output like this...

Code:
number of characters: 44

score per character match: 2.27272727272727 %

1 | 1 ... 2.27272727272727 %
9 | 9 ... 4.54545454545455 %
3 | 5 ... 4.54545454545455 %
  |   ... 6.81818181818182 %
  |   ... 9.09090909090909 %
  |   ... 11.3636363636364 %
  |   ... 13.6363636363636 %
0 | 9 ... 13.6363636363636 %
3 | 8 ... 13.6363636363636 %
9 | 9 ... 15.9090909090909 %
8 | 3 ... 15.9090909090909 %
9 | 3 ... 15.9090909090909 %
3 | 7 ... 15.9090909090909 %
3 |   ... 15.9090909090909 %
7 |   ... 15.9090909090909 %
  |   ... 18.1818181818182 %
  |   ... 20.4545454545455 %
0 | 0 ... 22.7272727272727 %
0 | 0 ... 25 %
0 | 0 ... 27.2727272727273 %
0 | 0 ... 29.5454545454546 %
6 | 6 ... 31.8181818181818 %
0 | 0 ... 34.0909090909091 %
0 | 0 ... 36.3636363636364 %
0 | 0 ... 38.6363636363636 %
3 | 3 ... 40.9090909090909 %
+ | + ... 43.1818181818182 %
0 | 0 ... 45.4545454545455 %
0 | 0 ... 47.7272727272727 %
0 | 0 ... 50 %
6 | 6 ... 52.2727272727273 %
0 | 0 ... 54.5454545454546 %
0 | 0 ... 56.8181818181818 %
0 | 0 ... 59.0909090909091 %
9 | 9 ... 61.3636363636364 %
. | . ... 63.6363636363636 %
6 | 6 ... 65.9090909090909 %
9 | 9 ... 68.1818181818182 %
0 | 0 ... 70.4545454545455 %
0 | 0 ... 72.7272727272727 %
3 | 3 ... 75 %
8 | 8 ... 77.2727272727273 %
5 | 5 ... 79.5454545454545 %
8 | 8 ... 81.8181818181818 %


Kind Regards
Duncan
 
Thanks all for the tips. In response to some of your queries, basically i want to quantify the match in terms of saying

File 1 - Number of lines ( records ) 546
FILE 2 - Number of Lines ( records ) 565

Identical matches ( i.e lines in both files are same ) 450

Close Matches ( 4 of the 5 fields match ) 67

Lines in File 1 with no match 96
Lines in File 2 with no match 87


With regards to varying size fields Dunc in the example data i provided

194 03989337 000060003+00060009.69003858

195 989337 000060003+00060009.69003858

This is an example output from one file. So in the file i am trying to compare this with their should hopefully be an identical line in that file.

My thoughts on the algorithm would be:

1: Read each line of file 1 into an array
2: Read each line of file 2 into an array.
3. For each record in file 1 array, search for in file 2 array. If identical match found remove items from both arrays and write them to a file.

Following this i will have a file with all matching records, and 2 arrays left for file 1 and file 2 with either mismatching records or very closely matched records, eg 2 fields of a record may be joined in one file ( no whitespace ) and equivalent record exists in other file with whitespace seperator.

i.e
File 1
194 03989337 000060003+00060009.69003858

File2
194 03989337000060003+00060009.69003858

Therefore these 2 records are matching and still remain in
my array after the first sweep through, as they are not identical.


However the challenge now is to look at the two remaining arrays and sort them out into something meaningful.

If anyone has any comments or suggestions i would be greatful.

Thanks for your help.

DirtyHolmes
 
The following may help you...
I am not very familiar with perl, it is my first script of more than 15 lines. All remarks are welcome.
Code:
#!/usr/bin/perl -w
use strict;
use English;

#
# Array where input files are stored
#

my (@File1, @File2);

#
# SortFileToArray ( $file, $ref2array) - Store file in an array and sort array
# $file = Input file name
# $ref2array = Reference to array in which file is stored
#
# Input fields are in the form "f1 f2 f3+f4.f5" xhere fn is firld n.
# Records are stored in $SUBSEP delimited form (fields are delimited by $SUBSEP)
# The number of input records is displayed.
#

sub SortFileToArray ($$) {
   my ($file, $ref2array) = @_;
 
   open(F2A, $file) or die "Unable toopen file $file\n$!";
   while (<F2A>) {
      chomp; 
      next if (/^\s*$/);
      push @$ref2array,
           join($SUBSEP, /^([^ ]+)\s+([^ ]+)\s+([^+]+)\+([^.]+)\.([^ ]+)$/);
   }
   close(F2A);

   @$ref2array = sort @$ref2array;
   print "File: $file - Number of lines ( records ) ".(@$ref2array+0)."\n" ;
}


#
# OutRecord ($record) - Convert $SUBSEP delimited record for display
# $record = $SUBSEP delimited 
#
# Return record in the form "f1 f2 f3+f4.f5" xhere fn is field n.
#

sub OutRecord($) {
   my ($record) = @_;
   my @fields = split($SUBSEP,$record);
   return($fields[0]." ".$fields[1]." ".$fields[2]."+".
          $fields[3].".".$fields[4]."\n");
}


#
# IdenticalMatches($file) - Creates file with identical records
# $file = Resulting file.
#
# Identicals records from @File1 and @File2 are removed and stored to $file
# The identical record count is displayed.
#

sub IdenticalMatches ($) {
   my ($file) = @_;
   my ($count, $indx1, $indx2) = (0, 0, 0);
   my ($val1, $val2) = ($File1[0], $File2[0]);

   open(OUT, '>'.$file) or die "Unable to create file $file\n$!";
   while ($indx1 <= $#File1 and $indx2 <= $#File2) {
       
       if ($val1 eq $val2) {
          $count++;
          $File1[$indx1] = "";
          $File2[$indx2] = "";
          print OUT OutRecord($val1); 
          $val1 = $File1[++$indx1];
          $val2 = $File2[++$indx2];
       } elsif ($val1 lt $val2) {
          $val1 = $File1[++$indx1];
       } else {
          $val2 = $File2[++$indx2];
       }

   }
   close(OUT);

   print "\nIdentical matches ( i.e lines in both files are same ) $count\n";
} 

#
# BuildCloseArray ($ref2full, $ref2close, $ignored) - Build array of records
# from testing close matches
# $ref2full = Reference to array containing records
# $ref2close = reference to array to fill
# $ignored = Field to ignore
#
# The @$ref2close is filled with records of @$ref2full whith field $ignored 
# not copied.
# 

sub BuildCloseArray($$$) {
  my ($ref2full, $ref2close, $ignored) = @_;
  my ($indx, $elemf, @fields, $elemc);
 
  @$ref2close = ();
  $ignored -= 1;
  foreach $elemf (@$ref2full) {
     next if $elemf eq '';
     @fields = split($SUBSEP, $elemf);
     $elemc = '';
     for ($indx=0; $indx <= $#fields; $indx++) {
        next if $indx == $ignored;
        $elemc .= ($elemc eq '' ? '' : $SUBSEP) . $fields[$indx];
     }
     push(@$ref2close, $elemc);
  }
  @$ref2close = sort @$ref2close;

}

#
# SearchCloseMatches ($ignored) - Search for close matching records (ignore 
# one field).
# $ignore = Field to ignore when comparing records
#
# The records of the arrays @File1 and @File2 are compared for matching, 
# the field $ignored not considered.
# 
# return the close matching count.
#

sub SearchCloseMatches($) {
   my ($ignored) = @_;
   my (@close1, @close2);

   BuildCloseArray(\@File1, \@close1, $ignored);
   BuildCloseArray(\@File2, \@close2, $ignored);

   return 0 if $#close1 == 0;
   return 0 if $#close2 == 0;

   my ($count, $indx1, $indx2) = (0, 0, 0);
   my ($val1, $val2) = ($close1[0], $close2[0]);

   while ($indx1 <= $#close1 and $indx2 <= $#close2) {
       
       if ($val1 eq $val2) {
          $count++;
          $val1 = $close1[++$indx1];
          $val2 = $close2[++$indx2];
       } elsif ($val1 lt $val2) {
          $val1 = $close1[++$indx1];
       } else {
          $val2 = $close2[++$indx2];
       }

   }

   return $count;
}

#
# NoMatch ($file, $ref2array, $matches) - Compute non matching record count 
# for a file
# $File = File name (only used for display)
# $ref2array = Reference to array containing non natching records for $file
# $matches = Close matching count
#
# Non matching records in @$ref2array are non empty elements.
# Close matching count is substracted.
# The number of non matching records is displayed.
#

sub NoMatch($$$) {
   my ($file, $ref2array, $matches) = @_;
   my $nomatch = -$matches;

   foreach my $elem (@$ref2array) {
      $nomatch++ unless $elem eq '';
   }

   print "Lines in file $file with no match $nomatch\n";
}  

#
# CloseMatches ($file1, $file2) - Count close and non matching records
# $file1 = Input file 1 name
# $file1 = Input file 2 name
#
# Close matching count is the sum of matching records with one field ignored.
# Close and non matching counts are displayed
# 


sub CloseMatches($$) {
   my ($file1, $file2) = @_;
   my $count = 0;

   $count += SearchCloseMatches(1);
   $count += SearchCloseMatches(2);
   $count += SearchCloseMatches(3);
   $count += SearchCloseMatches(4);
   $count += SearchCloseMatches(5);

   print "\nClose Matches ( 4 of the 5 fields match ) $count\n\n";

   NoMatch($file1, \@File1, $count);
   NoMatch($file2, \@File2, $count);

}

#
# Main . . .
#

SortFileToArray('file1.txt', \@File1);
SortFileToArray('file2.txt', \@File2);

IdenticalMatches('file12.txt');
CloseMatches('file1.txt', 'file2.txt');

Jean Pierre.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top