Best way to add a space between touching Upper & Lowercase Characters?

PerlNewUser · Apr 30, 2008

Using something like this:

Code:

(pseudocode) 

$line=~s/lcUC/lc  uc/g;

I want to add one or more spaces (and perhaps a period) between upper and lower case characters that touch each other, i.e.:

This is a blue widgetThis is a red widgetThis is a brown widget

should be:

Thus is a blue widget. This is a red widget. This is a brown widget

I have an Excel file that was exported as delimited text. One of the record fields contains product descriptions with sentences that run-on and connect to each other because much of the information was originally in bulleted lists that, after export, were converted into single, run-on sentences.

If there were periods or other common punctuation at the end of each sentence, I would be able to easily use them in the above pseudocode to accomplish what I need; but with just lc and uc characters touching, I am not sure of the best way to proceed, since Perl offers multiple options.

Thanks.

brigmar · Apr 30, 2008

Code:

$_ = "This is a brown widgetThis is a red widgetThis is a green widget";
s/([a-z])([A-Z])/\1\. \2/g;
print;

PerlNewUser · Apr 30, 2008

Brigmar, thanks.

What do the 1 and 2 represent? So far, this is not working for me when used as:

$line=s/([a-z])([A-Z])/\1\. \2/g;

brigmar · Apr 30, 2008

The parentheses in regexes capture their contents to $1, $2 etc.
Within a regex command they can be referred to as \1, \2 etc

As for not working, you're not applying the substitution to the $line value. Strange, as you had the syntax in your original post.

Code:

$line = "This is a brown widgetThis is a red widgetThis is a green widget";
$line [b]=~[/b] s/([a-z])([A-Z])/\1\. \2/g;
print $line;

PerlNewUser · Apr 30, 2008

You mean this will work as-is?

$line=~s/lcUC/lc uc/g;

stevexff · Apr 30, 2008

No, he means that in your original post you were using the correct =~ operator, but in your second post you had the assignment operator = instead...

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

PerlNewUser · May 1, 2008

That was included in the code, but omitted when I copied it to the above reply.

This is what I have:

Code:

 $line =~ s/([a-z])([A-Z])/\1\. \2/g;

I also applied it with altenate characters, i.e.

Code:

 $line =~ s/([a-z])([A-Z])/\1AAA \2/g;

to see of I was simply missing its application, but still no results. Since similar subsitution routines work fine, i.e. adding spaces where periods already exist; or newlines after colons; etc., I am wondering if data quality is the issue, and not the code (as I am applying it).

The converted Excel file (tab-delimited) contains numerous "junk" characters -- some visible, some not -- so I will re-save it in MS-DOS format and then try this again.

brigmar · May 1, 2008

I'm going to take a guess that there are Carriage Returns and/or your bullet character in there, considering that the original was a bulleted list.

PerlNewUser · May 1, 2008

The code definitely works, just not on that particular field. (It does a terrific job on a URL field but, of course, I don't want to use it there.)

I have an additional chance to apply it separately to the target field and, if that does not work, the data is probably full of invisible carriage returns. That happens frequently, and I will need to re-save the data file as a pure DOS text file to eliminate them.

Thanks for your help.

brigmar · May 1, 2008

Once the file is saved as text, those characters are not invisible, and can be included in the regex.

Upload the file (to something like box.net) and enter the URL as step 3 (attachment) of the reply section.

PerlNewUser · May 1, 2008

Actually, without even re-saving as DOS text, the code seems to be working just fine on the target field -- after adding all fields to an array, and applying the substitution to just the target, i.e:

Code:

$string=$data[6]; [b]## data[6] is the target Field in the array.[/b]  
$string =~ s/([a-z])([A-Z])/\1\. \2/g;
$data[6]=$string;

So far, so good...

ishnid · May 1, 2008

Personally, I'd do it with a lookahead/lookbehind rather than bothering with capturing groups that you don't really need:

Code:

$string =~ s/(?<=[a-z])(?=[A-Z])/. /g;

Incidentally, there's no need to escape the dot in the replacement string with a backslash. It's a common mistake.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Best way to add a space between touching Upper & Lowercase Characters?

PerlNewUser

Technical User

brigmar

Programmer

PerlNewUser

Technical User

brigmar

Programmer

PerlNewUser

Technical User

stevexff

Programmer

PerlNewUser

Technical User

brigmar

Programmer

PerlNewUser

Technical User

brigmar

Programmer

PerlNewUser

Technical User

ishnid

Programmer

Similar threads

Part and Inventory Search

Sponsor

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Best way to add a space between touching Upper &amp; Lowercase Characters?

Technical User

Programmer

Technical User

Programmer

Technical User

Programmer

Technical User

Programmer

Technical User

Programmer

Technical User

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor

Best way to add a space between touching Upper & Lowercase Characters?