Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

File I/O problem

Status
Not open for further replies.

ionstorm101

Programmer
Jun 18, 2005
9
IE
Hi, im writing a script to create lists of words from an input file. The script im writing is meant to take 3 arguments, an input file, a list of words not to be included in the dictionary and an output file. The problem is I want no repeated words in the dictionary. What im trying to do is check the file im writing (the output file) to to see if the word is already in it and if it isnt then add it.

My code never enters this check though i.e the while(<OUT>) and I cant tell why. I have the file opened for both read and write. Here is my script so far. If anyone could help that would be great. Im confused as it has no problem checking the word against anything in the file of words not to allow in i.e the while(<EXCLUDE>) part.

Code:
#!/usr/bin/env perl -w

if(scalar(@ARGV) != 3){
  die "Usage: dc inputfile.txt excludefile.txt outputfile.txt \n" ;
}

open(INPUT, "$ARGV[0]") or die "Error opening input: $!\n";

#And now we start to go through the file

while(<INPUT>){ #copy in lines as strings 
  @sentance = split(/\s+/); #split at any whitespace, tab, newline etc
  foreach $word (@sentance){
    @count = split(//, $word);
    $word=~s/\W//g;
    if((scalar(@count)) >= 4){
      $allow = 1;
      open(EXCLUDE, "$ARGV[1]") or die "Error opening exclusion file: $!\n";
      open(OUT, "+>>$ARGV[2]") or die "Error opening output file: $!\n";
      while(<EXCLUDE>){
	chomp $_;
	if($word eq $_){
	  print "match \n";
	  $allow = 0
	}
      }
      while(<OUT>){ #getting in here is the problem
	print "in the zone \n"; #print this if we get in here
	chomp $_;
	if($word eq $_){
	  $allow = 0;
	}
      }
      if($allow == 1){
	print "$word\n";
      }else{
	print "allow is $allow\n";
      }
      print OUT "$word\n" if $allow == 1;

      close(EXCLUDE);
      close(OUT);
    }
 
have you tried using ">>" instead of "+>>" to open the $ARGV[2] file? You should probably also open all your files once instead of reopening and closing them inside the first "while" loop. And I would probably build a hash from the exclude file so you check for words easily without having to loop through the file over and over. Something along thses lines maybe:

Code:
#!/usr/bin/env perl -w
use strict;

if(scalar(@ARGV) != 3){
  die "Usage: dc inputfile.txt excludefile.txt outputfile.txt \n" ;
}

my %exclude = ();
open(INPUT, "$ARGV[0]") or die "Error opening input: $!\n";
open(EXCLUDE, "$ARGV[1]") or die "Error opening exclusion file: $!\n";
while(<EXCLUDE>) {
   chomp;
   $exclude{$_} = $_;
}
open(OUT, ">>$ARGV[2]") or die "Error opening output file: $!\n";

while(<INPUT>){
  chomp;
  my @sentance = split(/\s+/);
  for (@sentance){
    $_ =~ s/\W//g;
    next if (length($_) < 4);
    next if exists($exclude{$_});
    print "'$_' is being added to the dictionary\n";
    print OUT "$_\n" 
   }
}
close(INPUT);
close(OUT);
 
The reason I used the +>> instead of >> (it was that originally) was I wanted to open the output file for both reading and writing.

Once i've checked everything else I want to run through the output file to see if the word is already in it so i can avoid duplicates. Any ideas?

Thanks for the help by the way, its been a while since i've used perl
 
And two other things, if you feel like answering : )

Why did you use a hash for exclude instead of an array and why do you use next in those two statements instead of last? Dont you want to break if those are true? Is next the equivelant of a continue statement? So wouldnt it be last?
 
I used the hash to avoid looping through the exclude file more than once. If the file is big this could be a considerable time saver. I used 'next' to jump to the next element of the @sentance array instead of using 'last' which would break out of the loop entirely.

You could do the same with the output file, load it into a hash and check for words that already are in the file to avoid dupicating them. Will depend on how the file is structed, I assume one word per line.

Post some sample lines of all three files.
 
the input file is any text file, i.e a book
the output text file is one word per line as is the exclusion file.

Thanks for explaining that stuff man. If i have 20,000 words in the output file and im still taking in words would loading them into a hash not be a bit too much?

Btw, i still dont get why your using hashes instead of arrays? I mean from what i can see all your using is the keys anyway and not assigning any values to them

Thanks
 
If i have 20,000 words in the output file and im still taking in words would loading them into a hash not be a bit too much?

20,000 should be no problem. Even 200,000 should be no problem. If the exclude and output files get to be big, you might want to look into Tie::File or start using a database like MySQL.

Btw, i still dont get why your using hashes instead of arrays? I mean from what i can see all your using is the keys anyway and not assigning any values to them

The keys are assinged the same value, literally like this:

%exclude = (
dog => 'dog',
cat => 'cat',
fish => 'fish'
);

now you can check if the word 'cat' is in the hash without looping:

print "cat is there" if exists $exclude{'cat'};

with an array you would have to loop through to find it:

Code:
@exlude = qw(dog cat fish);
for (@exclude) {
   print "cat is there" if ($_ eq 'cat');
}


not a big deal for short lists, and you can use "next" and "last" to make it more efficient for longer lists, but if you are going through the list many times and the lists are long, a hash should be more efficient and considerably faster than actually reading the files line by line from disk for each loop through the input file.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top