Find and Replace Regex question please 1

PaulSc · Aug 22, 2018

Hi all

After a bit of help again please

Every day I have to go through a 3-400 Mb XML formatted file correcting errors. Some of these can be done using standard find/replace however
I have a set of issues where I guess Regex is better placed to help as I need to identify and replace /'s which of course is also on every close tag.......
problem is I'm very new to regex and need some pointers/advice please

Being XMl it contains the start/end tags which also contain the / character I need to remove however some of the fields will only accept A-Z0-9 as valid input

i.e. <Ref>1234/567890</Ref> the other block is that the numbers are random....

I need to first identify the tags which contain this error which Ive been able to do using a find for "<Ref>[0-9]{4}/" and I then manually correct the
data - fine if you find 3 or 4..but finding 100 + its a bit of a b****r.....

knowing that's its always inside the <Ref> tags what I need too be able to do is to Remove the in-between tag /'s leaving just the numbers
i.e <Ref>1234567890</Ref>

Can anyone help/provide any pointers of how or indeed if this can be done..?

Many thanks (Apologies if this is in the wrong forum but this Perl forum provided a good deal of Regex help....)

PaulSc

feherke · Aug 23, 2018

Hi

Are you sure there will be always only one and only slash ?

Anyway here I added 3 regular expressions, play with commenting them out to see which one you need :

Perl:

[b]use[/b] strict[teal];[/teal]
[b]use[/b] warnings[teal];[/teal]

[b]my[/b] [navy]$xml[/navy] [teal]=[/teal] [b]do[/b] [teal]{[/teal] [b]local[/b] [navy]$/[/navy][teal];[/teal] [i][green]<DATA>[/green][/i] [teal]};[/teal]

[gray]# assume there is at most 1 slash there[/gray]
[navy]$xml[/navy] [teal]=~[/teal] [b]s[/b][fuchsia]:(<Ref>\d*)/:$1:[/fuchsia][b]g[/b][teal];[/teal]

[gray]# remove any amount of slashes[/gray]
[navy]$xml[/navy] [teal]=~[/teal] [b]s[/b][fuchsia]:(<Ref>)(.+?)(</Ref>):$1.$2=~y!/!!dr.$3:[/fuchsia][b]ge[/b][teal];[/teal]

[gray]# remove any amount of non-digit characters[/gray]
[navy]$xml[/navy] [teal]=~[/teal] [b]s[/b][fuchsia]:(?<=<Ref>).+?(?=</Ref>):$&=~s!\D!!gr:[/fuchsia][b]ge[/b][teal];[/teal]

[b]print[/b] [navy]$xml[/navy][teal];[/teal]

__DATA__
<foo>
    <Ref>1234/567890</Ref>
    <bar>
        <Ref>1234/567890</Ref><Ref>1234/567890</Ref>
    </bar>
    <nah>1234/567890</nah>
    <Ref>1234/5/6/7890</Ref>
    <Ref>1234:5/6-7890</Ref>
</foo>

Feherke.
feherke.github.io

PaulSc · Sep 5, 2018

Feherke, Thank you for your detailed answer..much appreciated

Unfortunately Ive discovered were not going to be allowed to install/Run Perl.......

We do have NotePad++ (in its basic install form i.e no xmltools plugin etc) so can anyone suggest please how we can maybe use regex to remove any /'s (or -'s or *'s etc) that appear between <Ref></Ref> tags whilst still maintaining the standard xml tags and keeping the rest of the data?

i.e. <Ref>1234/567890</Ref>
<Ref>1234567890</Ref>
or
<Ref>000*000001234</Ref>
<Ref>000000001234</Ref>

We've seen that we can have 4 number/then 8+ numbers or 3numbers/numbers... so no fixed standard bar the (so far) single / between the <Ref> tags that shouldn't/we don't want to be there....

We know we can do an initial change for </ to <# then change all / to "" then change <# back to </ but that's not really "safe/feasible" on what's now a 1.7million row xml file and makes a massive assumption that there's not /'s elsewhere.........

Cheers and Thanks again

feherke · Sep 6, 2018

Hi

Never used NotePad++ earlier, but this way works for my test data :
Find what : [COLOR=#cc9 #ff9][box][black][tt](<Ref>\d*)[/*][/tt][/black][/box][/color]
Replace with : [COLOR=#cc9 #ff9][box][black][tt]$1[/tt][/black][/box][/color]

However I would say for files of that size better use a dedicated tool instead of a text editor. In thread215-1789240 our fellow member spamjim recommended grepWin as such tool. Better see whether you can get approval to install it ( or maybe already have it... ).

Feherke.
feherke.github.io

mikrom · Sep 7, 2018

feherke,
Nice example about using regular expressions in Perl.
Today I learned something from you again. You deserve the star.

Beilstwh · Sep 7, 2018

Just curious, why don't they allow Perl to be used? I work for a multi billion dollar company and we use a number of scripting languages including Perl. If they don't want it installed on the network, there are many free versions that can be installed on windows. Activestate is a good one. I say windows because the POSEX systems like Unix, Solaris, and Linux (among other) come with Perl in the base install.

Bill
Lead Application Developer
New York State, USA

PaulSc · Nov 13, 2018

Thanks for your help...
I work for a "bank" meaning that everythings controlled/locked down etc etc so no additional software whether its free or not....so have to make do with the tools available and notepad++ is the tool of choice..

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Find and Replace Regex question please 1

PaulSc

MIS

feherke

Programmer

PaulSc

MIS

feherke

Programmer

mikrom

Programmer

Beilstwh

Programmer

PaulSc

MIS

Similar threads

Part and Inventory Search

Sponsor