×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!
  • Students Click Here

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

Jobs

Find and Replace Regex question please

Find and Replace Regex question please

Find and Replace Regex question please

(OP)
Hi all

After a bit of help again please

Every day I have to go through a 3-400 Mb XML formatted file correcting errors. Some of these can be done using standard find/replace however
I have a set of issues where I guess Regex is better placed to help as I need to identify and replace /'s which of course is also on every close tag.......
problem is I'm very new to regex and need some pointers/advice please

Being XMl it contains the start/end tags which also contain the / character I need to remove however some of the fields will only accept A-Z0-9 as valid input

i.e. <Ref>1234/567890</Ref> the other block is that the numbers are random....

I need to first identify the tags which contain this error which Ive been able to do using a find for "<Ref>[0-9]{4}/" and I then manually correct the
data - fine if you find 3 or 4..but finding 100 + its a bit of a b****r.....

knowing that's its always inside the <Ref> tags what I need too be able to do is to Remove the in-between tag /'s leaving just the numbers
i.e <Ref>1234567890</Ref>

Can anyone help/provide any pointers of how or indeed if this can be done..?

Many thanks (Apologies if this is in the wrong forum but this Perl forum provided a good deal of Regex help....)

PaulSc

RE: Find and Replace Regex question please

Hi

Are you sure there will be always only one and only slash ?

Anyway here I added 3 regular expressions, play with commenting them out to see which one you need :

CODE --> Perl

use strict;
use warnings;

my $xml = do { local $/; <DATA> };

# assume there is at most 1 slash there
$xml =~ s:(<Ref>\d*)/:$1:g;

# remove any amount of slashes
$xml =~ s:(<Ref>)(.+?)(</Ref>):$1.$2=~y!/!!dr.$3:ge;

# remove any amount of non-digit characters
$xml =~ s:(?<=<Ref>).+?(?=</Ref>):$&=~s!\D!!gr:ge;

print $xml;

__DATA__
<foo>
    <Ref>1234/567890</Ref>
    <bar>
        <Ref>1234/567890</Ref><Ref>1234/567890</Ref>
    </bar>
    <nah>1234/567890</nah>
    <Ref>1234/5/6/7890</Ref>
    <Ref>1234:5/6-7890</Ref>
</foo> 

Feherke.
feherke.github.io

RE: Find and Replace Regex question please

(OP)
Feherke, Thank you for your detailed answer..much appreciated

Unfortunately Ive discovered were not going to be allowed to install/Run Perl.......

We do have NotePad++ (in its basic install form i.e no xmltools plugin etc) so can anyone suggest please how we can maybe use regex to remove any /'s (or -'s or *'s etc) that appear between <Ref></Ref> tags whilst still maintaining the standard xml tags and keeping the rest of the data?

i.e. <Ref>1234/567890</Ref>
<Ref>1234567890</Ref>
or
<Ref>000*000001234</Ref>
<Ref>000000001234</Ref>

We've seen that we can have 4 number/then 8+ numbers or 3numbers/numbers... so no fixed standard bar the (so far) single / between the <Ref> tags that shouldn't/we don't want to be there....

We know we can do an initial change for </ to <# then change all / to "" then change <# back to </ but that's not really "safe/feasible" on what's now a 1.7million row xml file and makes a massive assumption that there's not /'s elsewhere.........

Cheers and Thanks again

RE: Find and Replace Regex question please

Hi

Never used NotePad++ earlier, but this way works for my test data :
Find what : (<Ref>\d*)[/*]
Replace with : $1

However I would say for files of that size better use a dedicated tool instead of a text editor. In thread215-1789240: Inserting same line of text of every web page - "link" insert? our fellow member spamjim recommended grepWin as such tool. Better see whether you can get approval to install it ( or maybe already have it... ).

Feherke.
feherke.github.io

RE: Find and Replace Regex question please

feherke,
Nice example about using regular expressions in Perl.
Today I learned something from you again. You deserve the star.

RE: Find and Replace Regex question please

Just curious, why don't they allow Perl to be used? I work for a multi billion dollar company and we use a number of scripting languages including Perl. If they don't want it installed on the network, there are many free versions that can be installed on windows. Activestate is a good one. I say windows because the POSEX systems like Unix, Solaris, and Linux (among other) come with Perl in the base install.

Bill
Lead Application Developer
New York State, USA

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members!

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close