Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations bkrike on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Remove word from phrase, logical OR problem? 2

Status
Not open for further replies.

Neomalfoy

Programmer
Jan 2, 2004
14
US
The script is partially working. It should read in a list of words in the form word_word (or word_word_word etc. eventually), splits on the underscore and if a stopword is found for either position then it won't be in the new list created. This works for stopwords in the first position but phrases with stopwords in the second position are still showing up on the new list. It seems that the logical OR is not working and I don't understand why it is failing.

Help appreciated

Code:

#!/usr/bin/perl
use strict;

my (@stopW, $freq, $word, $tuple, $word, $sw2, $c, $d);

######################################### Reading Stopwords ###
open(STOPfile,"<./stopwords.txt");
while (<STOPfile>)
{
chop;
push(@stopW, $_);
}
close(STOPfile);
###################################################################

my $inputDIR="./";

my ($freq,$word);

for ($tuple=1; $tuple<=2; $tuple++)
{

open (INfile,"<".$inputDIR."file".$tuple.".txt");
open (OUTfile,">cleanup".$tuple.".txt");

while(<INfile>){
chop;
($freq,$word)=split(/:/,$_);

# leaving single words alone
if ($tuple==1)
{ foreach (@stopW) {
if (length($word)<=1 ) { next; }
if ($word eq $_) {
$word=$word."*"; last; }
}
}

# $tuple=2, eliminate phrases with a stopword
# or if any of the words is <=1 (0,1)
if ($tuple==2)
{ ($a,$b) = split("_",$word);
if ( (length($a)<=1) || (length($b)<=1) ) { next; }
foreach (@stopW)
{ if ( ($a eq $_) || ($b eq $_))
{ $sw2=1; last; }
}
}
if ($sw2==1) { $sw2=0; next; }

#print "$word\n";

#print $freq.":".$word."\n";
printf OUTfile "%s:%s\n", $freq,$word ;
} # end while

close INfile;
close OUTfile;
} # end for tuple

exit(0);
 
The easiest way to perform stopword-removal in Perl is to use a hash (look at the example in the Lingua::EN::StopWords for example).

Assuming your list of stopwords is one-per-line in stopwords.txt, this is how I'd do it:
Code:
my %stopwords;
open STOP, 'stopwords.txt';
while(<STOP>) {
   chomp; #don't use chop - too dangerous
   $stopwords{ $_ } = 1;
}
close STOP;

while(my $line = <DATA>) {
   print join '_', grep !$stopwords{ $_ }, split '_', $line;
}
__DATA__
some_sentence_that_contains_a_stopword_or_two
here_is_another_such_sentence
and_yet_another
 
Hi Ishnid,

thanks for the tip it will help me out when I want to remove all stopwords which is not what I want to do in this case. I should have explained better.

I am examining each word and determing if it's a stopword because the phrase will be in a certain format after processing that would allow stopwords in certain cases i.e.
one word phrase -> no change
two word phrases -> nonstopword_nonstopword
three word phrases -> nonstopword_stopword?_nonstopword
for four word phrases -> nonstopword_stopword?_stopword?_nonstopword



 
By any chance are you running this on a windows system?

If so, using chomp instead of chop should solve your problem. chomp removes newlines from the end strings, whereas chop removes the last character in the string. On windows systems, newlines are marked by "\r\n" and chop will only remove the "\n", leaving an extra "\r" character at the end of your second word, causing it not to match.

I notice you're using $a and $b as variable names. This is generally not a good idea as these are special variables used in sorting. They'll probably behave as expected most of the time but when they don't, it can take ages to find what the problem is.
 
I'm running it on both windows and unix. I get the same results in my output files even after changed to chomp and the names of the variables.

Sample output:

As I said earlier, I can filter out stopwords in the first position but can't get it to filter out from the second/third/etc.

original list ----> new list with phrases filtered out

two word phrase
191:eek:f_the ----> 23:magnetic_field
74:the_solar ----> 22:solar_wind
68:in_the ----> 22:sunspot_group
60:the_sun ----> 21:solar_eclipse
42:the_earth ----> 19:white_light
36:from_the ----> 18:earth_s *error
35:eek:n_the ----> 17:portion_of *error
24:the_moon ----> 17:solar_system
23:magnetic_field ----> 17:associated_with *error
21:to_the ----> 12:frequency_spectrum
20:eek:f_solar ----> 12:geomagnetic_storm
19:h_alpha ----> 12:radio_frequency
19:white_light ----> 12:unit_of *error
18:earth_s ----> 11:universal_time
18:and_the ----> 11:equal_to *error

three word phrase
31:eek:f_the_solar --> 17:portion_of_the *error
29:eek:f_the_sun --> 12:radio_frequency_spectrum
17:the_earth_s ---> 11:total_solar_eclipse
17:portion_of_the --> 10:region_of_the *error
16:eek:f_the_earth --> 10:bipolar_sunspot_group
13:the_solar_atmosphere --> 10:sunspot_group_with *error
12:eek:f_the_moon --> 9:layer_of_the *error
12:eek:f_the_radio --> 9:surface_of_the *error
12:the_solar_wind --> 9:earth_s_surface
12:that_portion_of --> 8:frequency_spectrum_from *error
12:radio_frequency_spectrum --> 8:group_with_penumbra

The code I'm using to check three word phrase:

if ($tuple==3)
{ ($as,$bs,$cs) = split("_",$word);
if (length($as)<=1 || length($bs)==0 || length($cs)<=1 ) { next; }
foreach (@stopW){
my $stopWord=$_;
if ( ($as eq $stopWord) || ($cs eq $stopWord) )
{ $sw3=1; last; }
}
}
if ($sw3==1) { $sw3=0; next; }

 
I gave up using split and used a regular expression instead, and it's working now.

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top