similar to UNIX GREP 2

3inen · Oct 23, 2006

Hi! i have a list of terms in file "search.txt" that i am checking against a multiple column containig file "sentence.txt". here is the linux grep command that works

grep -f search.txt sentence.txt >matches.txt

it prints the lines that match the terms in the first file. and this is what i want.

if the "search.txt" size is too big it does not run on linux. how can we do this in windows/linux for big files.

thanks in advance

BrianAtWork · Oct 23, 2006

What format is the "sentence.txt" file in?

Is it just a big file of paragraphs/plain text, or is it truely structured in columns as you say?

And when you say "how can we do this in windows/linux for big files" do you mean in perl?

3inen · Oct 23, 2006

sorry, here is the sample data. i need to print the lines that match "search" and "looking". grep is working on a small set, but my search.txt has 40000 terms and my sentence.txt file has 30000 lines.

search.txt
search
looking

sentence.txt (with tabs)
entry1 10/23/06 i am trying to search for this word
entry2 10/22/06 we are looking for this term

thanks

3inen · Oct 23, 2006

here is what i wrote. but it is not running through the entire search.txt list.

please suggest modification

#!/usr/bin/perl -w
use strict;
my ($item);
my (@array, @A, @B);
@A = ();
open(DATAA, "<search.txt ") or die "Couldn't read from datafile: $!\n";
while (<DATAA>) {
chomp;
push(@A, $_);
}

@B = ();
open(DATAB, "<sentence.txt") or die "Couldn't read from datafile: $!\n";
foreach $item (@A) {
while (<DATAB>) {
chomp;
if (/$item/) {

print FILEW1"$_\t";
}

}
}

MillerH · Oct 23, 2006

Here is your code editted so that it will actually work. This isn't the most efficient method, but it should at least scan all the terms now.

Code:

#!/usr/bin/perl -w
use strict;

my @A = ();
open(DATAA, "<search.txt ")       or die "Couldn't read from datafile: $!\n";
while (<DATAA>) {
	chomp;
	push(@A, $_);
}
close(DATAA);


open(DATAB, "<sentence.txt")       or die "Couldn't read from datafile: $!\n";
while (<DATAB>) {
	chomp;
	foreach my $item (@A) {
		if (/$item/) {
			print "$_\n";
			last;
		}
	}
}
close(DATAB);

rharsh · Oct 23, 2006

Assuming memory on the computer isn't problem, you'll want to use a hash to do the lookups. Something like this would probably work for you, or at least get you started:

Code:

my %lookup;
open WORDS, "< search.txt" or die "Bad stuff happened.\n$!";
while (<WORDS>) {
    chomp;
    $lookup{$_}++;
}
close WORDS;

open SENTENCES, "< sentence.txt" or die "More bad stuff happened\n$!";
while (<SENTENCES>) {
    my ($entry, $date, $sentence) = split /\t/, $_;
    my @temp = split ' ', $sentence;
    foreach my $word (@temp) {
        print $_ if $lookup{$word};
    }
}

I'm running out the door, so I didn't have time to test the code.

3inen · Oct 23, 2006

millerH modification does the work for me. thanks to rharsh for showing another logic. will have to look more closely later.

3inen · Oct 24, 2006

sorry folks i came to the conclsion too early. with millerH code it is printing all the lines in sentence.txt file weather there is a matching term in the search.txt or not.

i tried rharsh code and i get error messages.

can you help me out here.

thanks

stevexff · Oct 24, 2006

Code:

my %lookup;
open WORDS, "< search.txt" or die "Bad stuff happened.\n$!";
while (<WORDS>) {
    chomp;
    $lookup{[red]lc([/red]$_[red])[/red]}++;
}
close WORDS;

open SENTENCES, "< sentence.txt" or die "More bad stuff happened\n$!";
[red]LINE:[/red] while (<SENTENCES>) {
    my ($entry, $date, $sentence) = split /\t/, $_;
    my @temp = split ' ', $sentence;
    foreach my $word (@temp) {
        if [red](exists[/red] $lookup{[red]lc([/red]$word[red])} {
           print $_;
           next LINE;[/red]
        }
    }
}

Lower-cases the lookup to do case-insensitive matching, should stop the error messages by using exists, and only prints each line once. I suspect that rharsh's original actually worked, though.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

stevexff · Oct 24, 2006

Oops. Missed a closing paren

Code:

if (exists $lookup{lc($word)}[red])[/red] {

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

rharsh · Oct 24, 2006

Steve, good catch on the case problem and using 'next', I missed both of those.

3inen · Oct 24, 2006

thanks to stevexff for the improvement.

my @temp = split, $sentence;

is enough to get my work done.

thank you all for helping me here.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

similar to UNIX GREP 2

3inen

Technical User

BrianAtWork

Programmer

3inen

Technical User

3inen

Technical User

MillerH

Programmer

rharsh

Technical User

3inen

Technical User

3inen

Technical User

stevexff

Programmer

stevexff

Programmer

rharsh

Technical User

3inen

Technical User

Similar threads

Part and Inventory Search

Sponsor