comparing images

SCelia · Jun 16, 2002

I have 6000 images, most of which I beleive to be duplicates. I wish to sort all the unique images into a new directory using perl. Some unique images have the same file size so they cannot be sorted by size. I wrote this program to do the job, but it only finds two unique images and I know there are more. Where am I going wrong?

#!/usr/bin/perl

use strict;

##init vars
my $doc = '';
my $undoc = '';
my $newdoc = '';
my $newundoc = '';
my $forcounter = 0;
my $skip = 0;

my $dir = 'D:\Earth\digits\\';

opendir (DIR, $dir) or die "cannot opendir $dir";
foreach my $file (readdir(DIR)) {
unless ($file eq '.' or $file eq '..') {
$forcounter = 0;
$skip = 0;
&process_file ($file);
}
}
closedir (DIR);

sub process_file() {
$_ = shift;
open (IMAGE, "$dir$_&quot

or die "Cannot open file $dir$_: $!";
binmode IMAGE;
read( IMAGE, $doc, 10000 );
close IMAGE;

my $uniquedir = 'D:\Earth\unique\\';

opendir (DIR, $uniquedir) or die "cannot opendir $uniquedir";
foreach my $unfile (readdir(DIR)) {
unless ($unfile eq '.' or $unfile eq '..') {
$forcounter++;
open (IMAGE, "$uniquedir$unfile&quot

or die "Cannot open file $dir$_: $!";
binmode IMAGE;
read( IMAGE, $undoc, 10000 );
close IMAGE;

my @chars = split (//,$doc);
my @unchars = split (//,$undoc);

while (@chars) {
if (shift(@chars) == shift(@unchars)) {
}
else {
$skip++;
last;
}
if ($skip) {
last;
}
}

}

}

if ($skip == $forcounter) {
open(IMAGE, ">D:\\Earth\\unique\\$_&quot

|| die"$_.jpg: $!";
binmode IMAGE;
print IMAGE $doc;
close IMAGE;

print "$dir$_ is unique and is saved away\n";
}
else {
print "$dir$_ is NOT unique\n";
}

closedir (DIR);

}

Thanks, Celia

sleipnir214 · Jun 17, 2002

How about calculating a digest for each file?

This code reads all the files in a directory, calculates the MD5 digest for each, then uses that as the key in a hash. Each hash element is an array which contains all the filenames that match that digest.

Then it looks at all hash elements for an element where the scalar of the array in the element is greater than 1. If so, then it prints all the filenames in that array.

use strict;
use Digest::MD5;

my $dir = "/path/to/images";

opendir (DIR, $dir);
my @files = grep { /\.jpg/ }readdir (DIR);
closedir (DIR);

my %digest_hash;

foreach my $file (@files)
{
my $filename = $dir."/".$file;
open (FILE, $filename);
binmode (FILE);
my $digest = Digest::MD5->new->addfile(*FILE)->hexdigest;
close (FILE);

if (exists ($digest_hash{$digest}))
{
push @{ $digest_hash{$digest} }, $filename;
}
else
{
$digest_hash{$digest} = [ $filename ];
}
}

foreach my $temp (keys %digest_hash)
{
if (scalar (@{ $digest_hash{$temp} }) > 1)
{
foreach my $temp2 (@{ $digest_hash{$temp} })
{
print $temp2."\t";
}
print "\n";
}
}

______________________________________________________________________
Perfection in engineering does not happen when there is nothing more to add.
Rather it happens when there is nothing more to take away.

SCelia · Jun 17, 2002

I modified your code a little to try and write all the unique file names to a log file. It generated far too many. There must be something I don't understand fully.

#!/usr/bin/perl

use strict;
use Digest::MD5;

my $dir = 'D:\Earth\digits\\';

opendir (DIR, $dir);
my @files = grep { /\.jpg/ }readdir (DIR);
closedir (DIR);

my @filenames = @files;

my %digest_hash;

foreach my $file (@files)
{
my $filename = $dir."/".$file;
open (FILE, $filename);
binmode (FILE);
my $digest = Digest::MD5->new->addfile(*FILE)->hexdigest;
close (FILE);

if (exists ($digest_hash{$digest}))
{
push @{ $digest_hash{$digest} }, $filename;
}
else
{
$digest_hash{$digest} = [ $filename ];
}
}

open(LOG, ">D:\\Earth\\unique\\match.txt&quot

|| die"Could not create file $!";

foreach my $temp (keys %digest_hash)
{
if (scalar (@{ $digest_hash{$temp} }) > 1)
{
foreach my $temp2 (@{ $digest_hash{$temp} })
{
$temp2 =~ /(\d+\.jpg)/i;
foreach my $index (0 .. $#filenames) {
if ($1 == $index) {
delete $filenames[$index];
}
}

#print LOG $temp2."\t";
}
#print LOG "\n\n\n";
}
}

#print LOG "\n\n\n\n\n\n";

foreach my $temp4 (@filenames) {
print LOG $temp4."\n";
}

close(LOG);

print "Done";

I checked some of the files it thought were unique and they most definately are not.
Celia

sleipnir214 · Jun 17, 2002

Sorry. My code does not print out unique files. It prints out duplicated ones.

It can be used as I posted it, with one change. The line that reads:

if (scalar (@{ $digest_hash{$temp} }) > 1)

should be changed to read:

if (scalar (@{ $digest_hash{$temp} }) == 1)
______________________________________________________________________
Perfection in engineering does not happen when there is nothing more to add.
Rather it happens when there is nothing more to take away.

SCelia · Jun 17, 2002

Are you certain that works? I just tried it and it found only one unique file... I know there must be a few dozen atleast. Celia

sleipnir214 · Jun 17, 2002

It works for me, finding both uniques (with '== 1') and duplicates (with '> 1').

The only thing I can think of is that perl on Win32 can't handle certain characters in filenames -- stuff like vowels with diacriticals. You don't have that in your filenames, do you?

Let me think about it. Or maybe someone else can give it a go. ______________________________________________________________________
Perfection in engineering does not happen when there is nothing more to add.
Rather it happens when there is nothing more to take away.

SCelia · Jun 17, 2002

I can send the files in a zip to you if you give me an e-mail. There are only numbers in the file names but for some reason it found only one unique file. Wierd stuff. Celia

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

comparing images

SCelia

Programmer

sleipnir214

Programmer

SCelia

Programmer

sleipnir214

Programmer

SCelia

Programmer

sleipnir214

Programmer

SCelia

Programmer

Similar threads

Part and Inventory Search

Sponsor