Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Re-formatting a file 3

Status
Not open for further replies.

meinida

Technical User
Dec 7, 2006
23
0
0
US
I have a fairly large file (40 meg) of text that looks like this

IMSI = 123456000000049
APNTPLID = 3
QOSTPLID = 108
APNTPLID = 1
QOSTPLID = 108
APNTPLID = 2
QOSTPLID = 108
APNTPLID = 5
QOSTPLID = 108
APNTPLID = 6
QOSTPLID = 108
IMSI = 123456000000011
APNTPLID = 3
QOSTPLID = 108
APNTPLID = 1
QOSTPLID = 108
APNTPLID = 2
QOSTPLID = 108
APNTPLID = 5
QOSTPLID = 108
APNTPLID = 6
QOSTPLID = 108
IMSI = 123456000000050
APNTPLID = 3
QOSTPLID = 108
APNTPLID = 1
QOSTPLID = 108
APNTPLID = 2
QOSTPLID = 108
APNTPLID = 5
QOSTPLID = 108
IMSI = 123456000000075
APNTPLID = 3
QOSTPLID = 108
APNTPLID = 1
QOSTPLID = 108
APNTPLID = 2
QOSTPLID = 108
APNTPLID = 5
QOSTPLID = 108
APNTPLID = 6
QOSTPLID = 108

I would like to make it look like this

IMSI = 123456000000049,APNTPLID = 3,QOSTPLID = 108,APNTPLID = 1,QOSTPLID = 108,APNTPLID = 2,QOSTPLID = 108,APNTPLID = 5,QOSTPLID = 108,APNTPLID = 6,QOSTPLID = 108
IMSI = 123456000000011,APNTPLID = 3,QOSTPLID = 108,APNTPLID = 1,QOSTPLID = 108,APNTPLID = 2,QOSTPLID = 108,APNTPLID = 5,QOSTPLID = 108,APNTPLID = 6,QOSTPLID = 108
IMSI = 123456000000050,APNTPLID = 3,QOSTPLID = 108,APNTPLID = 1,QOSTPLID = 108,APNTPLID = 2,QOSTPLID = 108,APNTPLID = 5,QOSTPLID = 108
IMSI = 123456000000075,APNTPLID = 3,QOSTPLID = 108,APNTPLID = 1,QOSTPLID = 108,APNTPLID = 2,QOSTPLID = 108,APNTPLID = 5,QOSTPLID = 108,APNTPLID = 6,QOSTPLID = 108

Is AWK the right tool to use? I have heard it is very powerful but I have not been able to figure out how to get the result I want.

I have tried different commands but they either don't work or don't have any change on the format.

awk -F'=' '$1=="IMSI" $2=="APNTPLID" $3=="QOSTPLID" {print $1, $2, $3}' output.txt
awk -F'=', '"$1=="IMSI", $2=="APNTPLID", $3=="QOSTPLID"" {print $1, $2, $3}' output.txt
awk '"$1 =/IMSI/, $2 =/APNTPLID/, $3 =/QOSTPLID/" {print $1, $2, $3}' output.txt
 
awk '
NR==1 {LINE=$0
next}
/^IMSI/ {print LINE
LINE=$0
next}
{LINE=LINE "," $0}
END {print LINE}' your-input-file>your-output-file

this assumes that the input file starts with the 'IMSI' record
 
Thank you for your reply

This is the error I get when I try to run the command

awk 'NR==1 {LINE=$0 next} /^IMSI/ {print LINE LINE=$0 next}{LINE=LINE "," $0}END{print LINE}' output.txt
awk: syntax error at source line 1
context is
NR==1 {LINE=$0 >>> next <<< } /^IMSI/ {print LINE LINE=$0 next}{LINE=LINE "," $0}END{print LINE}
awk: illegal statement at source line 1

Is it possible the formatting of the input file is bad?
 
Looks like the awk program was reformatted. My original post has 8 distinct lines. Also, what platform are you running on? I tested my script under AIX.
 
Turns out I am an idiot. I did not realize there was significance in the 8 distinct lines. I ran the command again as you posted it and it works.

Is there somthing other than AWK that might be better? The file I am running this on is around 40 meg in size and I think it is pulling it all into ram before it writes the output.

Thank you again for your help
 
Not sure if awk pulls the entire file into RAM for this type of processing. How long did it take to process? How many input lines?

I doubt that there is something better in terms of risk/reward. A high level language might process in a few seconds faster, but the only high level language I know is COBOL. Would have taken me maybe 30 minutes to write, compile an test in COBOL, took me under two minutes in awk.
 
COBOL is compiled, so it will be faster, than interpreted awk. But on the other hand awk (compared to COBOL) is free, simpler and supports regular expression.
 
You can compare the awk solution with my C solution.
I tried it only for my educational purposes, to see if I'm able to do it C :)
It took me long, because I'm not experienced C programmer and did't know the library functions for working with strings.
IMO, doing it in awk (or other scripting language) is simpler than in C.

meinida.c
Code:
[COLOR=#a020f0]#define MAXLINE [/color][COLOR=#ff00ff]10000[/color][COLOR=#a020f0]  [/color][COLOR=#0000ff]// maximum line length[/color]
[COLOR=#a020f0]#define substr  [/color][COLOR=#ff00ff]"IMSI"[/color][COLOR=#a020f0] [/color][COLOR=#0000ff]// substring[/color]
[COLOR=#0000ff]/*[/color][COLOR=#0000ff]**                                    **[/color][COLOR=#0000ff]*/[/color]

[COLOR=#a020f0]#include [/color][COLOR=#ff00ff]<stdio.h>[/color]
[COLOR=#a020f0]#include [/color][COLOR=#ff00ff]<string.h>[/color]
[COLOR=#a020f0]#include [/color][COLOR=#ff00ff]<stdlib.h>[/color]


[COLOR=#2e8b57][b]int[/b][/color] main([COLOR=#2e8b57][b]int[/b][/color] argc, [COLOR=#2e8b57][b]char[/b][/color] *argv[]) {
  [COLOR=#2e8b57][b]const[/b][/color] [COLOR=#2e8b57][b]char[/b][/color] *filename = argv[[COLOR=#ff00ff]1[/color]]; 
  [COLOR=#2e8b57][b]char[/b][/color] line[MAXLINE], line_out[MAXLINE];
  [COLOR=#2e8b57][b]FILE[/b][/color]* file = fopen(filename,[COLOR=#ff00ff]"r"[/color]);
  [COLOR=#2e8b57][b]long[/b][/color] nr_line;
  [COLOR=#2e8b57][b]int[/b][/color] len_line;

  [COLOR=#0000ff]// if file doesn't open then exit with error[/color]
  [COLOR=#804040][b]if[/b][/color] (file == [COLOR=#ff00ff]NULL[/color]) 
  {
    perror (filename);
    exit([COLOR=#ff00ff]EXIT_FAILURE[/color]);
  } 

  nr_line = [COLOR=#ff00ff]0[/color];
  [COLOR=#804040][b]while[/b][/color](fgets(line, [COLOR=#804040][b]sizeof[/b][/color](line), file) != [COLOR=#ff00ff]NULL[/color]) {
    nr_line++; 
    [COLOR=#0000ff]// chomp line[/color]
    line[strcspn(line, [COLOR=#ff00ff]"[/color][COLOR=#6a5acd]\n[/color][COLOR=#ff00ff]"[/color])] = [COLOR=#6a5acd]'\0'[/color];   
    [COLOR=#804040][b]if[/b][/color] (strstr(line, substr)) {
      [COLOR=#804040][b]if[/b][/color] (nr_line > [COLOR=#ff00ff]1[/color]) {
         [COLOR=#0000ff]// print line_out for output[/color]
         printf ([COLOR=#ff00ff]"[/color][COLOR=#6a5acd]%s[/color][COLOR=#6a5acd]\n[/color][COLOR=#ff00ff]"[/color], line_out);
      } 
      [COLOR=#0000ff]// create new line_out[/color]
      strcpy(line_out, line);
    }
    [COLOR=#804040][b]else[/b][/color] {
      [COLOR=#0000ff]// add  line to line_out[/color]
      strcat(line_out, [COLOR=#ff00ff]";"[/color]);
      strcat(line_out, line);
    }
  }
  [COLOR=#0000ff]// at end: print last line[/color]
  printf ([COLOR=#ff00ff]"[/color][COLOR=#6a5acd]%s[/color][COLOR=#6a5acd]\n[/color][COLOR=#ff00ff]"[/color], line_out);

  [COLOR=#0000ff]// close file[/color]
  fclose(file);

  [COLOR=#0000ff]// at end return[/color]
  [COLOR=#804040][b]return[/b][/color]([COLOR=#ff00ff]0[/color]);
}

Compilation and running:
Code:
$ gcc meinida.c -o meinida
$ meinida meinida.txt > meinida_out.csv

Output: meinida_out.csv
Code:
IMSI = 123456000000049;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108;APNTPLID = 6;QOSTPLID = 108
IMSI = 123456000000011;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108;APNTPLID = 6;QOSTPLID = 108
IMSI = 123456000000050;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108
IMSI = 123456000000075;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108;APNTPLID = 6;QOSTPLID = 108
 
You have made my point!! It only took me a couple of minutes to write the awk program, in only 8 lines. How did the run time compare?
 
Hi

meinida said:
The file I am running this on is around 40 meg in size and I think it is pulling it all into ram before it writes the output.
No, regular Awk implementations do not slurp in the entire input at once. The [tt]RS[/tt] may change anytime, affecting the next record to be read.

michaelvv's code reads one line of input, builds up one output line in the memory, then outputs it and discards it.

There is one way to do it even more efficiently from memory usage's point of view, but runs slower than michaelvv's code :
Code:
[teal]{[/teal]
    [b]printf[/b][teal]([/teal][i][green]"%s%s"[/green][/i][teal],[/teal] NR [teal]==[/teal] [purple]1[/purple] [teal]?[/teal] [i][green]""[/green][/i] [teal]:[/teal] [navy]$1[/navy] [teal]==[/teal] [i][green]"IMSI"[/green][/i] [teal]?[/teal] [i][green]"[/green][/i][lime]\n[/lime][i][green]"[/green][/i] [teal]:[/teal] [i][green]","[/green][/i][teal],[/teal] [navy]$0[/navy][teal])[/teal]
[teal]}[/teal]
[b]END[/b] [teal]{[/teal]
    [b]print[/b] [i][green]""[/green][/i]
[teal]}[/teal]
This reads one line of input and outputs it immediately. The trick is, it outputs no separator after the first line then always output first a separator then the current line.

Feherke.
feherke.ga
 
michaelvv said:
You have made my point!! It only took me a couple of minutes to write the awk program, in only 8 lines. How did the run time compare?
I didn't compare the run time of the C and awk progams - maybe the the OP could it do.

But, IMO the awk solution is flexibler, because awk is language specialised for text processing.
In awk we don't need to care about opening files, reading it line by line, about maximum length of string, ... etc. The string operation are very simple in comparition to the C. If the example were more complicated - for example we had to necessarily use regex - then the C code would have more lines.

You mentioned COBOL - I know it too but didn't have a free compiler avaiable on my desktop.
In COBOL it would be similar to C, we have to declare file, open it and read it line by line. Maybe the string operation would be little bit simpler - but the result code would not be comparable with the simplicity of awk.
The resulting program in COBOL would be more verbose than in C.
Awk is easier to learn than any other programming language and more productive.

When you say, that it took you only some minutes, then I am ashamed and I have to confess that it took me some hours
:)
 
Thank you all for your responses. I will try to answer all of the questions asked.
The number of lines in the input file is 965971
The original AWK script took around 3.5 hours to run.
The C script took less than 10 seconds to run, but the format when opened with Excel isn't quite right. After I loaded it in Excel I formatted the text to columns using ; as seperator but the output didn't end up in one line per "IMSI"

IMSI = 123456000000049
; APNTPLID = 3
; QOSTPLID = 108
; APNTPLID = 1
; QOSTPLID = 108
; APNTPLID = 2
; QOSTPLID = 108
; APNTPLID = 5
; QOSTPLID = 108
; APNTPLID = 6
; QOSTPLID = 108
IMSI = 123456000000011
; APNTPLID = 3
; QOSTPLID = 108
; APNTPLID = 1
; QOSTPLID = 108
; APNTPLID = 2
; QOSTPLID = 108
; APNTPLID = 5
; QOSTPLID = 108
; APNTPLID = 6
; QOSTPLID = 108
IMSI = 123456000000050
; APNTPLID = 3
; QOSTPLID = 108
; APNTPLID = 1
; QOSTPLID = 108
; APNTPLID = 2
; QOSTPLID = 108
; APNTPLID = 5
; QOSTPLID = 108
; APNTPLID = 6
; QOSTPLID = 108
IMSI = 123456000000075
; APNTPLID = 3
; QOSTPLID = 108
; APNTPLID = 1
; QOSTPLID = 108
; APNTPLID = 2
; QOSTPLID = 108
; APNTPLID = 5
; QOSTPLID = 108
; APNTPLID = 6
; QOSTPLID = 108

After looking at the AWK output it was the same way when loaded into excel. Maybe I am doing something wrong.
 
I used feherke code in a script and it runs pretty quick too less than 10 seconds. I must be doing something wrong with michaelvv script. Anyway they are all giving the same result now. Some with ; some with , seperators
IMSI = 123456000000049
; APNTPLID = 3
; QOSTPLID = 108
; APNTPLID = 1
; QOSTPLID = 108
; APNTPLID = 2
; QOSTPLID = 108
; APNTPLID = 5
; QOSTPLID = 108
; APNTPLID = 6
; QOSTPLID = 108
IMSI = 123456000000011
; APNTPLID = 3
; QOSTPLID = 108
; APNTPLID = 1
; QOSTPLID = 108
; APNTPLID = 2
; QOSTPLID = 108
; APNTPLID = 5
; QOSTPLID = 108
; APNTPLID = 6
; QOSTPLID = 108
IMSI = 123456000000050
; APNTPLID = 3
; QOSTPLID = 108
; APNTPLID = 1
; QOSTPLID = 108
; APNTPLID = 2
; QOSTPLID = 108
; APNTPLID = 5
; QOSTPLID = 108
; APNTPLID = 6
; QOSTPLID = 108


any ideas why the output is not like below


IMSI = 123456000000049;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108;APNTPLID = 6;QOSTPLID = 108
IMSI = 123456000000011;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108;APNTPLID = 6;QOSTPLID = 108
IMSI = 123456000000050;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108
IMSI = 123456000000075;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108;APNTPLID = 6;QOSTPLID = 108

 
IMO, your problems are caused by the format of your file.
You have probably problems with end-of-line characters i.e. "\r\n" vs. "\n" or with some blank characters on the begin or end of line.
Either set properly the variable RS and/or try to remove these characters on each line.
 
tr -d '\015' <output.txt >newoutput.txt

Did the trick.
 
To ALL
I have not used TekTips much in the past. My experience with all of the people who posted on this was outstanding. Is there a way to rate or acknowledge the programers that helped me? You are all topnotch!
Thank you all!
 
meinida said:
Is there a way to rate or acknowledge the programers that helped me?

To rate the answers, you can click at the star which is placed right in every reply:
tipmasterstar.png

Great post?
Star it!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top