Remove unwanted character from within field

confuseddddd · Sep 10, 2003

Have a file that uses * as its field separator. Within in the file, there are is a certain type of record that is containing * where it should not be, (within a specific field).

I wanted to write a script that would read in the file and edit out the * from the field where it should not be...

Here is some examples of the "bad" records:
CLM*1065WC*12755*173.00*WC**11^^1******EM^^^TXLM*1065WC*12755*173.00*WC**11^^1******EM^^^TX
(remove asterick between 1065WC and 12755, leave the other astericks)

CLM*829495*1*6*2430.00*WC**22^^1******EM^^^TX
(remove astericks between 829495 and 1 and 6, leave the other astericks) CLM*829495*1*6*2430.00*WC**22^^1******EM^^^TX

Here are some examples of the "good" records:
CLM*G3I261193*111.00*WC**11^^1******EM^^^TX
CLM*NA*26.70*WC**12^^1******EM^^^TX

I was hoping to find the CLM* record, then find within the CLM* record, the amount field with the leading * and then what is between those 2 fields, to check if there is another asterick(s). If there is, to remove the asterick(s).

Anyone have any ideas on how to accomplish this???

bi · Sep 10, 2003

Try to get the program that generates the file to use something else for a field separator?

confuseddddd · Sep 10, 2003

cannot, its "vendor" generated.
That's why I was trying to fix the records before it goes onto the set of processing.

Clairvoyant1332 · Sep 10, 2003

The main problem here, since the field separator is also a data character, is that you have no real way of knowing where one field ends and the next begins. The best way of handling this would be to check and edit the individual fields BEFORE they are written to the file. If that's not possible, then you'll have use some less exact methods.

In your examples, it was the 2nd field that contained the * character. I also noticed tht the 3rd field contains a decimal point. If you know for sure that ONLY the 2nd field has the problem, AND the 2nd field NEVER has a decimal point, AND the 3rd field ALWAYS has a decimal point, then you can use that to key off of. If that's not the case, then you risk "fixing" data you shouldn't be.

You also know how many field separators should exist (11 in this case), and should only attept to make a fix if you count more than that.

Dennis

vgersh99 · Sep 10, 2003

How about this assumptiom:

the number of numeric fields preceeding the with field with the value of "WC" should always be one.

If that's a valid assumption, we could possibly come up with with a regex and use either awk/sed to do the job.

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

Annihilannic · Sep 10, 2003

Working on vlad's assumption, I think this will do the trick:

[tt]perl -nwe '
# if it's a CLM* record
if (/^CLM\*/) {
# while there are still asterisks in the
# middle of the field, search and replace.
while (s/(CLM\*[^*]*)\*(.*\*[0-9.]+\*WC\*)/$1$2/) { };
print;
}
' inputfile > outputfile[/tt]

I'm sure there must be a neater version of that regex...

Annihilannic.

confuseddddd · Sep 11, 2003

Yes, the safe "assumption" is the 1st field will be the CLM* identifier and the 3rd field will be an amount field with a decimal point.

confuseddddd · Sep 11, 2003

I took the suggestion from above and created a nawk script; however, I am receiving syntax errors on the while statement. Can anyone help me out with this. Am new to using nawk (our system does not support perl), any assistance would be most appreciated.

nawk 'BEGIN{
}

##Now the processing begins, looking for all CLM* records
{
if (/^CLM\*/)
{
# while there are still asterisks in the
# middle of the field, search and replace.
while (s/(CLM\*[^*]*)\*(.*\*[0-9.]+\*WC\*)/$1$2/) { }
print
}

}

END {

Annihilannic · Sep 11, 2003

nawk's regex isn't quite as clever as Perl. To do substitution in nawk use the sub() or gsub() functions. I don't think they are capable of inserting matched strings in the resultant output though, on which my solution is dependent.

An alternative solution using sed:

[tt]grep '^CLM\*' inputfile | sed -e '
s/$CLM\*[^*]*$\*$.*\*[0-9.][0-9.]*\*WC\*$/\1\2/
' > outputfile[/tt]

Unfortunately I'm not aware of a while construct in sed, so if there are multiple asterisks the only way I know of to handle that is to have multiple search-and-replaces, e.g.:

[tt]grep '^CLM\*' inputfile | sed -e '
s/$CLM\*[^*]*$\*$.*\*[0-9.][0-9.]*\*WC\*$/\1\2/
s/$CLM\*[^*]*$\*$.*\*[0-9.][0-9.]*\*WC\*$/\1\2/
s/$CLM\*[^*]*$\*$.*\*[0-9.][0-9.]*\*WC\*$/\1\2/
' > outputfile[/tt]

Keep in mind that you may need to shorten the regex if some fields don't contain *WC*.

Annihilannic.

vgersh99 · Sep 11, 2003

# a bit hacky, but.....
BEGIN {
FS="*"
rightAncor="WC"
pat="[*][0-9.*]+[*]" rightAncor
}

{
while (match($0, pat)) {
before=RSTART-1;
after=RSTART+RLENGTH;
str=substr($0, RSTART+1, RLENGTH-2);
gsub("[*]", "", str)
str=FS str FS rightAncor
$0=substr($0, 1,before) str substr($0, after)
}
gsub("W[*]", "*", $0)
#print "output->[" $0 "]"
print $0
}

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

vgersh99 · Sep 11, 2003

a bit better:

BEGIN {
FS="*"
rightAncor="WC"
pat="[*][0-9.*]+[*]" rightAncor
}

{
while (match($0, pat)) {
before=RSTART-1;
after=RSTART+RLENGTH;
str=substr($0, RSTART+1, RLENGTH-2);
gsub("[*]", "", str)
sub(".$", SUBSEP, str)
str=FS str FS rightAncor
$0=substr($0, 1,before) str substr($0, after)
}
gsub(SUBSEP, "", $0)
#print "output->[" $0 "]"
print $0
}

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

Ygor · Sep 11, 2003

Using the assumptions given above...

awk 'BEGIN { FS=OFS="*" }
{
while ( $3 !~ "^[0-9]*\.[0-9][0-9]$" ) {
$2=$2 $3;
for (x=3; x<split($0,a); x++)
$x=$(x+1);
}
print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12;
}' file1 > file2

Tested....
$ diff file1 file2
1,2c1,2
< CLM*[blue]1065WC*12755[/blue]*173.00*WC**11^^1******EM^^^TX
< CLM*[red]829495*1*6[/red]*2430.00*WC**22^^1******EM^^^TX
---
> CLM*[blue]1065WC12755[/blue]*173.00*WC**11^^1******EM^^^TX
> CLM*[red]82949516[/red]*2430.00*WC**22^^1******EM^^^TX

vgersh99 · Sep 11, 2003

As far as I understood, you might potentially have 'bad' parts embeded into the same record:
CLM*1065WC*12755*173.00*WC**11^^1******EM^^^TXLM*1065WC*12755*173.00*WC**11^^1******EM^^^TX

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

confuseddddd · Sep 17, 2003

All your suggestions were great and I thank you but I tried the awk script and it doesn't appear to be working...

This is the error I receive:
awk: 0602-537 A single giant record was built
starting at `REF*87004010X0980040...'

The input line number is 5. The file is file1.
The source line number is 5.

The input file is many records all with the * as the delimiter. The complete record the system is balking at is...(with the REF*87*004010X098 referenced above)
ST*837*091500360
BHT*0019*00*2003091518000067*20030915*1800*CH
REF*87*004010X098
NM1*41*2*TEST*****46*999999999
PER*IC*TEST1*TE*555-555-5555
NM1*40*2*TEST*****46*999999999
HL*1**20*1
NM1*85*2*TEST*****24*2222222222
N3*ADDR1*
N4*CITY*CA*90675
HL*2*1*22*0
SBR*P*18*1234567 02
NM1*CA*1*LAST*FIRST****ZZ*1111111111
N3*ADDR1
N4*CITY*CA*90041
DMG*D8*19690827*M
REF*23*UNKNOWN
CLM*274*1979731*4.00*WC**21^^1******EM^^^IL
DTP*439*D8*20021015
DTP*523*D8*20030807
AMT*ST*4.00
REF*Y4*2700204525001
REF*D9*2003091510014338
NTE*TPO*000001
HI*BK^V72.6
NM1*82*2*NAME*****24*222222222222
PRV*PE*ZZ*888B99999N
N3*ADDR1
N4*CITY*CA*90045
LX*1
SV1*HC^85018^26*4.00*UN*1*21**1
DTP*472*D8*20030607
AMT*ST*4.00
SE*34*091500360

Any suggestions????

vgersh99 · Sep 17, 2003

if on Solaris, use either /usr/bin/nawk OR /usr/xpg4/bin/awk

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

Ygor · Sep 18, 2003

awk 'BEGIN { FS=OFS="*" }
{
if ($1=="CLM&quot

{
while ( $3 !~ "^[0-9]*\.[0-9][0-9]$" ) {
$2=$2 $3;
for (x=3; x<split($0,a); x++)
$x=$(x+1);
}
print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12;
} else {
print $0;
}
}' file1 > file2

confuseddddd · Sep 26, 2003

Tried with the both example and worked except for a few items:
diff CLM_TEST1 CLM_FIX1
These did not remove the unwanted * but removed the other fields at the end of the record....
58c58
< CLM*852005*916612*220.00*WC**11^^1******EM^^^JJ
---
> CLM*852005*916612*220.00*WC**11^^1*****
288c288
< CLM*62710337*1532*255.00*WC**11^^1******EM^^^MM
---
> CLM*62710337*1532*255.00*WC**11^^1*****
338c338
< CLM*LLR*16254*001*156.00*WC**11^^1******EM^^^LL
---
> CLM*LLR*16254*001*156.00*WC**11^^1****

These worked, which is pretty cool....
20c20
< CLM*3061951*11*725.00*WC**22^^1******EM^^^NN
---
> CLM*306195111*725.00*WC**22^^1******EM^^^NN
104c104
< CLM*15200591661*2*610.00*WC**11^^1******EM^^^JJ
---
> CLM*152005916612*610.00*WC**11^^1******EM^^^JJ
178c178
< CLM*1162842*1*12*1469.00*WC**24^^1******EM^^^NN
---
> CLM*1162842112*1469.00*WC**24^^1******EM^^^NN
216c216
< CLM*A8376281*1*405.00*WC**21^^1******EM^^^XX
---
> CLM*A83762811*405.00*WC**21^^1******EM^^^XX
254c254
< CLM*B22302848*1*1*80.00*WC**22^^1******EM^^^JJ
---
> CLM*B2230284811*80.00*WC**22^^1******EM^^^JJ

Ygor · Sep 29, 2003

I've just tried the code as posted. It seemed to work okay....

< CLM*852005*916612*220.00*WC**11^^1******EM^^^JJ
< CLM*62710337*1532*255.00*WC**11^^1******EM^^^MM
< CLM*LLR*16254*001*156.00*WC**11^^1******EM^^^LL
---
> CLM*852005916612*220.00*WC**11^^1******EM^^^JJ
> CLM*627103371532*255.00*WC**11^^1******EM^^^MM
> CLM*LLR16254001*156.00*WC**11^^1******EM^^^LL

Perhaps you have not cut-and-pasted the code correctly? Check that there is a backslash on line 4...

while ( $3 !~ "^[0-9]*[red]\[/red].[0-9][0-9]$" ) {

confuseddddd · Sep 29, 2003

This is the code being executed:
awk 'BEGIN { FS=OFS="*" }
{
if ($1=="CLM&quot

{
while ( $3 !~ "^[0-9]*\.[0-9][0-9]$" ) {
$2=$2 $3;
for (x=3; x<split($0,a); x++)
$x=$(x+1);
}
print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12;
} else {
print $0;
}
}' CLM_TEST1 > CLM_FIX1

and I do have a back slash

I just reran the script again and the same results as before.

Ygor · Sep 29, 2003

As I said, I tested the code and it worked fine. What version are you using? You could try a slight variation....

while ( $3 !~ /^[0-9]*[.][0-9][0-9]$/ ) {

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Remove unwanted character from within field

Programmer

Technical User

Programmer

Programmer

Programmer

MIS

Programmer

Programmer

MIS

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor