Compare 2 files for matching lines, output non matches 2

CoryMa2c · May 31, 2005

Hello,
I am running into a little issue. What I am trying to accomplish is to compare two files and find which lines don't match. The problem is that the lines may not be in the same order. And the files may be too big for an array (not sure).

Basically:

## Some command outputs FileListing.txt
## listing files and directories, one per line

## Some other command outputs BackedUpFiles.txt
## listing files and directories, one per line

## I need to compare the two files and have the output
## be files in FileListing.txt and not in BackedUpFiles.txt

## I tried using File::compare, but there didn't seem to
## be a way to output the differences.
## I tried using Text::diff, but the app compares each text
## file line per line. I need to compare line per whole text
## file.

## I just saw a grep post; that might be the ticket.
## However, 1 question, file size is around 33 MB / file.
## This translates to roughly 550,000 lines per file.
## is this going to be a problem for the array?

# open(F,"yourfile");
# @list=<F>;close F;
# $this="String I want";
# @f=grep /$this/,@list;
#

open (F, "BackedUpFiles.txt");
@compare = <F>; close F;

open (F, "FileListing.txt");
@against = <F>; close F;

foreach (@against) {
@missing = grep -v /$_/,@compare;
}
print "These files did not get backed up: \n";
print "@missing \n";

## This looks like it might work. But can I pass -v to grep?
## Is there another way to pass the -v option?
##
## Will this work? Is there a better way to do this?

## Thanks to all who answer.

## Cory M.

TrojanWarBlade · May 31, 2005

I guess that your best bet then is to add a line number to the start of each one and sort them (ignoring the line number!).
Then you should be able to compare them a little more easily and when you report the different lines, you'll have line numbers to report with them.

You don't say what OS you're using. That makes a big difference for an operation like this. Unix and Linux have tools that will do most of this work for you.

Trojan.

CoryMa2c · May 31, 2005

Thanks for the response. Yes, the OS is my big issue; I can think about how to do this in Linux or Solaris in a heartbeat, however, this is Windows 2000.

So, I guess I didn't state it, but I would have to use the grep module:

use File::Grep;

So, yes, if I can sort both text files so they list the files and folders consistently, then it may be easier. However, if one file/line is missing, then all file/lines below it would be off and considered out of place w/ the Text::diff module.

Thanks,
- C

TrojanWarBlade · May 31, 2005

Absolutely right.

Firstly, is this a one-off test?
If so, you might like to consider using a LiveCD version of linux to do this (Knoppix for example).
If not, is it possible to install Cygwin? This might give you enough unix like tools to make life easy.
This kinda matching is not gonna be fun or easy if you have to hand code the whole thing yourself.

Trojan.

KevinADC · May 31, 2005

maybe something like this will work, although I am not sure it will work for your situation, but should be faster than what you are doing as it only goes through each file once:

Code:

my %compare = ();
open (F1, "BackedUpFiles.txt") or die "$!";
while(<F1>){
   chomp;
   $compare{$_}++;
}
close(F1);
open (F2,  "FileListing.txt") or die "$!";
while(<F2>){
   chomp;
   delete $compare{$_};
}
close(F2);
foreach my $keys (keys %compare) {
   print "$keys\n";
}

This should print the lines in F1 that are not in F2, regardless or order, if that is what you want, if not switch the files.

rharsh · May 31, 2005

I haven't used it but take a look at the Algorithm:

iff module. It looks like it might work a little more like the *nix diff.

KevinADC · May 31, 2005

this is just a variation on the same thing but might be a little faster:

Code:

my %compare = ();
open (F1, "BackedUpFiles.txt") or die "$!";
while(<F1>){
   chomp;
   $compare{$_}++;
}
close(F1);
open (F2,  "FileListing.txt") or die "$!";
while(<F2>){
   chomp;
   print "$_\n" unless (exists($compare{$_}));
}
close(F2);

ishnid · Jun 1, 2005

Using Kevin's solutions, you'll at most have one copy of one file in memory at any one time (stored in your %compare hash). Particularly if it's a one-off task, 33Mb of memory shouldn't be a problem for a modern machine. If it's going to be run in parallel by a number of processes, you might need to consider other alternatives. I like TrojanWarBlade's suggestion of pre-sorting the files.

stevexff · Jun 1, 2005

If you can do it in memory using Kevin's solution, it will be the fastest way.

The alternative, pre-sorting the files and then reading records from each and comparing them, is way more complex. Particularly the bits where you have to read from one or the other depending on the results of the comparison, deal with premature EOF on one or the other, etc. etc. It may be the only solution when you have really large files, but don't try it unless you have to.

CoryMa2c · Jun 3, 2005

KevinADC,
Thanks! That worked really well.

One last question on this part. In the second while loop I would like to save the output to another variable or even a file. However, everytime I try, it either outputs only the last test which didn't exist or nothing at all. I have been trying to change the print option to $missing or @missing to no avail:
...
open (F2, "FileListing.txt") or die "$!";
while(<F2>){
chomp;
$missing = "$_\n" unless (exists($compare{$_}));
print "$missing \n";
}
close(F2);
print "$missing \n";
...
So, the issue is that the varaible, $missing, holds all the missing data when inside the while loop, but as soon as the while loop finishes, it only holds the last missing data. hmm.. that is starting to make sense. I guess the real question is how do I append or add to the variable instead of writing over the top of it? I tried

$missing{$_}++;

and got an error.. Suggestions? maybe shift?
Thanks,
- C

KevinADC · Jun 3, 2005

to an array:

Code:

my @missing = ();
open (F2,  "FileListing.txt") or die "$!";
while(<F2>){
   chomp;
   push @missing,$_ unless (exists($compare{$_}));
#   print "$missing \n";
}
close(F2);
print "$_\n" for @missing;

you could print the array to a file:

Code:

open (FILE, "yourfile.txt") or die "$!";
print FILE "$_\n" for @array;
close (FILE);

CoryMa2c · Jun 6, 2005

Thanks Kevin!!! Perfect! I guess I need to look at "my" a little closer and get a better understanding of it. It seems that has been the solution to my prayers.

- C

CoryMa2c · Jun 22, 2005

Hey Kevin,
I am having issues printing to a file from @missing. Here is what I have, what am I missing?

Code:

$workingdir = "c:\\tmp";
open (FILE, "$workingdir\\MissingFiles.txt") or die "$!";
print FILE "$_\n" for @missing;
close (FILE);

Thanks,
- C

duncdude · Jun 22, 2005

Code:

[b]#!/usr/bin/perl[/b]

open (FILE1, "< file1.txt");
chomp (@array1 = <FILE1>);
close FILE1;

open (FILE2, "< file2.txt");
chomp (@array2 = <FILE2>);
close FILE2;

foreach $a1 (@array1) {
  foreach $a2 (@array2) {
    if ($a1 eq $a2) {
      $a1 = ();
      $a2 = ();
    }
  }
}

print "FILE 1\n------\n";
foreach $a1 (@array1) {
  print "$a1\n" if $a1;
}

print "\n";

print "FILE 2\n------\n";
foreach $a2 (@array2) {
  print "$a2\n" if $a2;
}

Kind Regards
Duncan

duncdude · Jun 22, 2005

file1.txt
-------
line number 1
line number 2
line number 3
line number 4
line number 5
line number 6
line number 8
line number 9
line number 10
line number 21
line number 21
line number 21
line number 22
line number 23
line number 24
line number 25
line number 26
line number 27
line number 28
line number 29
line number 30

file2.txt
-------
line number 1
line number 2
line number 3
line number 4
line number 5
line number 6
line number 8
line number 9
line number 10
line number 21
line number 21
line number 21
line number 22
line number 23
line number 24
line number 25
line number 26
line number 27
line number 28
line number 29
line number 30

Kind Regards
Duncan

duncdude · Jun 22, 2005

Code:

[b]FILE 1
------[/b]
line number 21
line number 21

[b]FILE 2
------[/b]
line number 7
line number 3

Kind Regards
Duncan

stevexff · Jun 22, 2005

A solution from left field, to keep memory occupancy down...

Read the backups file, use the Digest module to create a binary signature for each record. Save it as the key of a hash.

Read the second file, digest the records, and see if the digest exists in the hash. If not, print the record.

chazoid · Jun 23, 2005

another option for working with large files is Tie::File which lets you work with a file as an array without actually loading the entire file into memory. It's a bit slow, but if you have a script that's already set up to use arrays, it only requires a few changes to the existing script.

duncdude · Jun 23, 2005

Could something like this be used? Will need help from the other programmers!

Code:

#!/usr/bin/perl

chdir "/Users/duncancarr/Desktop/compare";

$f1_lines = 21;

for ($x=1; $x<=$f1_lines; $x++) {
  chomp ($f1_line = `sed -n '${x}p' file1.txt`);  
  print `sed 's/^$f1_line/$1/g' file2.txt`;
}

Kind Regards
Duncan

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Compare 2 files for matching lines, output non matches 2

Technical User

Programmer

Technical User

Programmer

Technical User

Technical User

Technical User

Programmer

Programmer

Technical User

Technical User

Technical User

Technical User

Programmer

Programmer

Programmer

Programmer

Technical User

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor