Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

comparing images

Status
Not open for further replies.

SCelia

Programmer
Feb 27, 2002
82
CA
I have 6000 images, most of which I beleive to be duplicates. I wish to sort all the unique images into a new directory using perl. Some unique images have the same file size so they cannot be sorted by size. I wrote this program to do the job, but it only finds two unique images and I know there are more. Where am I going wrong?

#!/usr/bin/perl

use strict;

##init vars
my $doc = '';
my $undoc = '';
my $newdoc = '';
my $newundoc = '';
my $forcounter = 0;
my $skip = 0;


my $dir = 'D:\Earth\digits\\';

opendir (DIR, $dir) or die "cannot opendir $dir";
foreach my $file (readdir(DIR)) {
unless ($file eq '.' or $file eq '..') {
$forcounter = 0;
$skip = 0;
&process_file ($file);
}
}
closedir (DIR);


sub process_file() {
$_ = shift;
open (IMAGE, "$dir$_") or die "Cannot open file $dir$_: $!";
binmode IMAGE;
read( IMAGE, $doc, 10000 );
close IMAGE;

my $uniquedir = 'D:\Earth\unique\\';

opendir (DIR, $uniquedir) or die "cannot opendir $uniquedir";
foreach my $unfile (readdir(DIR)) {
unless ($unfile eq '.' or $unfile eq '..') {
$forcounter++;
open (IMAGE, "$uniquedir$unfile") or die "Cannot open file $dir$_: $!";
binmode IMAGE;
read( IMAGE, $undoc, 10000 );
close IMAGE;

my @chars = split (//,$doc);
my @unchars = split (//,$undoc);

while (@chars) {
if (shift(@chars) == shift(@unchars)) {
}
else {
$skip++;
last;
}
if ($skip) {
last;
}
}

}

}

if ($skip == $forcounter) {
open(IMAGE, ">D:\\Earth\\unique\\$_") || die"$_.jpg: $!";
binmode IMAGE;
print IMAGE $doc;
close IMAGE;

print "$dir$_ is unique and is saved away\n";
}
else {
print "$dir$_ is NOT unique\n";
}

closedir (DIR);

}

Thanks, Celia
 
How about calculating a digest for each file?

This code reads all the files in a directory, calculates the MD5 digest for each, then uses that as the key in a hash. Each hash element is an array which contains all the filenames that match that digest.

Then it looks at all hash elements for an element where the scalar of the array in the element is greater than 1. If so, then it prints all the filenames in that array.

use strict;
use Digest::MD5;

my $dir = "/path/to/images";

opendir (DIR, $dir);
my @files = grep { /\.jpg/ }readdir (DIR);
closedir (DIR);

my %digest_hash;

foreach my $file (@files)
{
my $filename = $dir."/".$file;
open (FILE, $filename);
binmode (FILE);
my $digest = Digest::MD5->new->addfile(*FILE)->hexdigest;
close (FILE);

if (exists ($digest_hash{$digest}))
{
push @{ $digest_hash{$digest} }, $filename;
}
else
{
$digest_hash{$digest} = [ $filename ];
}
}

foreach my $temp (keys %digest_hash)
{
if (scalar (@{ $digest_hash{$temp} }) > 1)
{
foreach my $temp2 (@{ $digest_hash{$temp} })
{
print $temp2."\t";
}
print "\n";
}
}

______________________________________________________________________
Perfection in engineering does not happen when there is nothing more to add.
Rather it happens when there is nothing more to take away.
 
I modified your code a little to try and write all the unique file names to a log file. It generated far too many. There must be something I don't understand fully.

#!/usr/bin/perl

use strict;
use Digest::MD5;

my $dir = 'D:\Earth\digits\\';

opendir (DIR, $dir);
my @files = grep { /\.jpg/ }readdir (DIR);
closedir (DIR);

my @filenames = @files;

my %digest_hash;

foreach my $file (@files)
{
my $filename = $dir."/".$file;
open (FILE, $filename);
binmode (FILE);
my $digest = Digest::MD5->new->addfile(*FILE)->hexdigest;
close (FILE);

if (exists ($digest_hash{$digest}))
{
push @{ $digest_hash{$digest} }, $filename;
}
else
{
$digest_hash{$digest} = [ $filename ];
}
}

open(LOG, ">D:\\Earth\\unique\\match.txt") || die"Could not create file $!";

foreach my $temp (keys %digest_hash)
{
if (scalar (@{ $digest_hash{$temp} }) > 1)
{
foreach my $temp2 (@{ $digest_hash{$temp} })
{
$temp2 =~ /(\d+\.jpg)/i;
foreach my $index (0 .. $#filenames) {
if ($1 == $index) {
delete $filenames[$index];
}
}

#print LOG $temp2."\t";
}
#print LOG "\n\n\n";
}
}

#print LOG "\n\n\n\n\n\n";

foreach my $temp4 (@filenames) {
print LOG $temp4."\n";
}

close(LOG);

print "Done";

I checked some of the files it thought were unique and they most definately are not.
Celia
 
Sorry. My code does not print out unique files. It prints out duplicated ones.

It can be used as I posted it, with one change. The line that reads:

if (scalar (@{ $digest_hash{$temp} }) > 1)

should be changed to read:

if (scalar (@{ $digest_hash{$temp} }) == 1)
______________________________________________________________________
Perfection in engineering does not happen when there is nothing more to add.
Rather it happens when there is nothing more to take away.
 
Are you certain that works? I just tried it and it found only one unique file... I know there must be a few dozen atleast. Celia
 
It works for me, finding both uniques (with '== 1') and duplicates (with '> 1').

The only thing I can think of is that perl on Win32 can't handle certain characters in filenames -- stuff like vowels with diacriticals. You don't have that in your filenames, do you?

Let me think about it. Or maybe someone else can give it a go. ______________________________________________________________________
Perfection in engineering does not happen when there is nothing more to add.
Rather it happens when there is nothing more to take away.
 
I can send the files in a zip to you if you give me an e-mail. There are only numbers in the file names but for some reason it found only one unique file. Wierd stuff. Celia
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top