Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

regex for duplicates 3

Status
Not open for further replies.

tonykent

IS-IT--Management
Jun 13, 2002
251
GB
This is entirely an academic question. A guy asked me if it was possible to produce a regex to find duplicate words in a sentence. "Easy" I said......and I've now spent about 3 hours getting nowhere!

Can anyone help?

Using this sentence:

Code:
this is some text with some duplicate words this is

We are looking for a regex to output:

Code:
some this is

Any ideas?

 
Code:
s/(\w+)/$hash{$1}++?$1:''/ge;
It has a few extra spaces in it but it does the job.
I'm sure you can remove those if you want with another regex.


Trojan.
 
F R E A K !!!

that is quite amazing! a star for you!!!


Kind Regards
Duncan
 
And one from me.

I think may be beginning to understand that......
 
hehehe
You'd be surprised what you can do with regexs.
I did cheat a little though, it's not pure regex, the /e switch causes the second half to be eval'd as perl code and I used that with a hash to unique down the words (and remove the unique ones). So you have a combination of regex, eval and hash to get this result.



Trojan.
 
Nice job, Trojan. If I may make a slight modification, though. That will output multiple copies of words that appear more than twice in the string (e.g. for the string `this is some text with some duplicate words this is is', the word `is' is printed twice). Here's a fix:
Code:
$text =~ s/(\w+)/$hash{$1}++==1?$1:''/ge;
 
& this sorts the strange(ish) space problem - if you don't mind the output on separate lines:-

Code:
[b]#!/usr/bin/perl[/b]

$_ = 'this is some text with some duplicate words this is';

s/(\w+)/$hash{$1}++==1?print"$1\n":''/ge;


Kind Regards
Duncan
 
sorry!

it's just that this is going on behind the scenes - and i can't see it is possible to 'clean' the output any more easily than outputting a newline

0 0 0 0 0 some 0 0 this is


Kind Regards
Duncan
 
This will handle the spaces too:
Code:
s/(\w+(?:\s+|^Z))/$hash{$1}++==1?$1:''/ge;
Ishnid: Excellent job spotting the repetition issue. That kind of bug is very important to spot. A star for you.



Trojan.
 
That is an outrageously powerful regex. Nice.


Kind Regards
Duncan
 
LOL.... I got booted down three positions in the MVP list in one day! Ah well.... you guys deserve to be on the top... of the MVP list that is!
 
I don't know why i am at the top... I'm certainly not the most able by any means!!!


Kind Regards
Duncan
 
...and thanks to all who contributed. I'll store this one away for the future.
 
Aren't MVPs supposed to be bad for the environment or something?

f

["]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.["]
--Maur
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top