INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Jobs

Re-formatting a file

Re-formatting a file

(OP)
I have a fairly large file (40 meg) of text that looks like this

IMSI = 123456000000049
APNTPLID = 3
QOSTPLID = 108
APNTPLID = 1
QOSTPLID = 108
APNTPLID = 2
QOSTPLID = 108
APNTPLID = 5
QOSTPLID = 108
APNTPLID = 6
QOSTPLID = 108
IMSI = 123456000000011
APNTPLID = 3
QOSTPLID = 108
APNTPLID = 1
QOSTPLID = 108
APNTPLID = 2
QOSTPLID = 108
APNTPLID = 5
QOSTPLID = 108
APNTPLID = 6
QOSTPLID = 108
IMSI = 123456000000050
APNTPLID = 3
QOSTPLID = 108
APNTPLID = 1
QOSTPLID = 108
APNTPLID = 2
QOSTPLID = 108
APNTPLID = 5
QOSTPLID = 108
IMSI = 123456000000075
APNTPLID = 3
QOSTPLID = 108
APNTPLID = 1
QOSTPLID = 108
APNTPLID = 2
QOSTPLID = 108
APNTPLID = 5
QOSTPLID = 108
APNTPLID = 6
QOSTPLID = 108

I would like to make it look like this

IMSI = 123456000000049,APNTPLID = 3,QOSTPLID = 108,APNTPLID = 1,QOSTPLID = 108,APNTPLID = 2,QOSTPLID = 108,APNTPLID = 5,QOSTPLID = 108,APNTPLID = 6,QOSTPLID = 108
IMSI = 123456000000011,APNTPLID = 3,QOSTPLID = 108,APNTPLID = 1,QOSTPLID = 108,APNTPLID = 2,QOSTPLID = 108,APNTPLID = 5,QOSTPLID = 108,APNTPLID = 6,QOSTPLID = 108
IMSI = 123456000000050,APNTPLID = 3,QOSTPLID = 108,APNTPLID = 1,QOSTPLID = 108,APNTPLID = 2,QOSTPLID = 108,APNTPLID = 5,QOSTPLID = 108
IMSI = 123456000000075,APNTPLID = 3,QOSTPLID = 108,APNTPLID = 1,QOSTPLID = 108,APNTPLID = 2,QOSTPLID = 108,APNTPLID = 5,QOSTPLID = 108,APNTPLID = 6,QOSTPLID = 108

Is AWK the right tool to use? I have heard it is very powerful but I have not been able to figure out how to get the result I want.

I have tried different commands but they either don't work or don't have any change on the format.

awk -F'=' '$1=="IMSI" $2=="APNTPLID" $3=="QOSTPLID" {print $1, $2, $3}' output.txt
awk -F'=', '"$1=="IMSI", $2=="APNTPLID", $3=="QOSTPLID"" {print $1, $2, $3}' output.txt
awk '"$1 =/IMSI/, $2 =/APNTPLID/, $3 =/QOSTPLID/" {print $1, $2, $3}' output.txt

RE: Re-formatting a file

awk '
NR==1 {LINE=$0
next}
/^IMSI/ {print LINE
LINE=$0
next}
{LINE=LINE "," $0}
END {print LINE}' your-input-file>your-output-file

this assumes that the input file starts with the 'IMSI' record

RE: Re-formatting a file

(OP)
Thank you for your reply

This is the error I get when I try to run the command

awk 'NR==1 {LINE=$0 next} /^IMSI/ {print LINE LINE=$0 next}{LINE=LINE "," $0}END{print LINE}' output.txt
awk: syntax error at source line 1
context is
NR==1 {LINE=$0 >>> next <<< } /^IMSI/ {print LINE LINE=$0 next}{LINE=LINE "," $0}END{print LINE}
awk: illegal statement at source line 1

Is it possible the formatting of the input file is bad?

RE: Re-formatting a file

Looks like the awk program was reformatted. My original post has 8 distinct lines. Also, what platform are you running on? I tested my script under AIX.

RE: Re-formatting a file

(OP)
I am running it on MAC OS 10.10.3

RE: Re-formatting a file

(OP)
Turns out I am an idiot. I did not realize there was significance in the 8 distinct lines. I ran the command again as you posted it and it works.

Is there somthing other than AWK that might be better? The file I am running this on is around 40 meg in size and I think it is pulling it all into ram before it writes the output.

Thank you again for your help

RE: Re-formatting a file

Not sure if awk pulls the entire file into RAM for this type of processing. How long did it take to process? How many input lines?

I doubt that there is something better in terms of risk/reward. A high level language might process in a few seconds faster, but the only high level language I know is COBOL. Would have taken me maybe 30 minutes to write, compile an test in COBOL, took me under two minutes in awk.

RE: Re-formatting a file

COBOL is compiled, so it will be faster, than interpreted awk. But on the other hand awk (compared to COBOL) is free, simpler and supports regular expression.

RE: Re-formatting a file

You can compare the awk solution with my C solution.
I tried it only for my educational purposes, to see if I'm able to do it C smile
It took me long, because I'm not experienced C programmer and did't know the library functions for working with strings.
IMO, doing it in awk (or other scripting language) is simpler than in C.

meinida.c

CODE

#define MAXLINE 10000  // maximum line length
#define substr  "IMSI" // substring
/***                                    ***/

#include <stdio.h>
#include <string.h>
#include <stdlib.h>


int main(int argc, char *argv[]) {
  const char *filename = argv[1]; 
  char line[MAXLINE], line_out[MAXLINE];
  FILE* file = fopen(filename,"r");
  long nr_line;
  int len_line;

  // if file doesn't open then exit with error
  if (file == NULL) 
  {
    perror (filename);
    exit(EXIT_FAILURE);
  } 

  nr_line = 0;
  while(fgets(line, sizeof(line), file) != NULL) {
    nr_line++; 
    // chomp line
    line[strcspn(line, "\n")] = '\0';   
    if (strstr(line, substr)) {
      if (nr_line > 1) {
         // print line_out for output
         printf ("%s\n", line_out);
      } 
      // create new line_out
      strcpy(line_out, line);
    }
    else {
      // add  line to line_out
      strcat(line_out, ";");
      strcat(line_out, line);
    }
  }
  // at end: print last line
  printf ("%s\n", line_out);

  // close file
  fclose(file);

  // at end return
  return(0);
} 


Compilation and running:

CODE

$ gcc meinida.c -o meinida
$ meinida meinida.txt > meinida_out.csv 

Output: meinida_out.csv

CODE

IMSI = 123456000000049;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108;APNTPLID = 6;QOSTPLID = 108
IMSI = 123456000000011;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108;APNTPLID = 6;QOSTPLID = 108
IMSI = 123456000000050;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108
IMSI = 123456000000075;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108;APNTPLID = 6;QOSTPLID = 108 

RE: Re-formatting a file

You have made my point!! It only took me a couple of minutes to write the awk program, in only 8 lines. How did the run time compare?

RE: Re-formatting a file

Hi

Quote (meinida)

The file I am running this on is around 40 meg in size and I think it is pulling it all into ram before it writes the output.
No, regular Awk implementations do not slurp in the entire input at once. The RS may change anytime, affecting the next record to be read.

michaelvv's code reads one line of input, builds up one output line in the memory, then outputs it and discards it.

There is one way to do it even more efficiently from memory usage's point of view, but runs slower than michaelvv's code :

CODE --> Awk

{
    printf("%s%s", NR == 1 ? "" : $1 == "IMSI" ? "\n" : ",", $0)
}
END {
    print ""
} 
This reads one line of input and outputs it immediately. The trick is, it outputs no separator after the first line then always output first a separator then the current line.

Feherke.
feherke.ga

RE: Re-formatting a file

Quote (michaelvv)


You have made my point!! It only took me a couple of minutes to write the awk program, in only 8 lines. How did the run time compare?
I didn't compare the run time of the C and awk progams - maybe the the OP could it do.

But, IMO the awk solution is flexibler, because awk is language specialised for text processing.
In awk we don't need to care about opening files, reading it line by line, about maximum length of string, ... etc. The string operation are very simple in comparition to the C. If the example were more complicated - for example we had to necessarily use regex - then the C code would have more lines.

You mentioned COBOL - I know it too but didn't have a free compiler avaiable on my desktop.
In COBOL it would be similar to C, we have to declare file, open it and read it line by line. Maybe the string operation would be little bit simpler - but the result code would not be comparable with the simplicity of awk.
The resulting program in COBOL would be more verbose than in C.
Awk is easier to learn than any other programming language and more productive.

When you say, that it took you only some minutes, then I am ashamed and I have to confess that it took me some hours
smile

RE: Re-formatting a file

(OP)
Thank you all for your responses. I will try to answer all of the questions asked.
The number of lines in the input file is 965971
The original AWK script took around 3.5 hours to run.
The C script took less than 10 seconds to run, but the format when opened with Excel isn't quite right. After I loaded it in Excel I formatted the text to columns using ; as seperator but the output didn't end up in one line per "IMSI"

IMSI = 123456000000049
; APNTPLID = 3
; QOSTPLID = 108
; APNTPLID = 1
; QOSTPLID = 108
; APNTPLID = 2
; QOSTPLID = 108
; APNTPLID = 5
; QOSTPLID = 108
; APNTPLID = 6
; QOSTPLID = 108
IMSI = 123456000000011
; APNTPLID = 3
; QOSTPLID = 108
; APNTPLID = 1
; QOSTPLID = 108
; APNTPLID = 2
; QOSTPLID = 108
; APNTPLID = 5
; QOSTPLID = 108
; APNTPLID = 6
; QOSTPLID = 108
IMSI = 123456000000050
; APNTPLID = 3
; QOSTPLID = 108
; APNTPLID = 1
; QOSTPLID = 108
; APNTPLID = 2
; QOSTPLID = 108
; APNTPLID = 5
; QOSTPLID = 108
; APNTPLID = 6
; QOSTPLID = 108
IMSI = 123456000000075
; APNTPLID = 3
; QOSTPLID = 108
; APNTPLID = 1
; QOSTPLID = 108
; APNTPLID = 2
; QOSTPLID = 108
; APNTPLID = 5
; QOSTPLID = 108
; APNTPLID = 6
; QOSTPLID = 108

After looking at the AWK output it was the same way when loaded into excel. Maybe I am doing something wrong.

RE: Re-formatting a file

(OP)
I used feherke code in a script and it runs pretty quick too less than 10 seconds. I must be doing something wrong with michaelvv script. Anyway they are all giving the same result now. Some with ; some with , seperators
IMSI = 123456000000049
; APNTPLID = 3
; QOSTPLID = 108
; APNTPLID = 1
; QOSTPLID = 108
; APNTPLID = 2
; QOSTPLID = 108
; APNTPLID = 5
; QOSTPLID = 108
; APNTPLID = 6
; QOSTPLID = 108
IMSI = 123456000000011
; APNTPLID = 3
; QOSTPLID = 108
; APNTPLID = 1
; QOSTPLID = 108
; APNTPLID = 2
; QOSTPLID = 108
; APNTPLID = 5
; QOSTPLID = 108
; APNTPLID = 6
; QOSTPLID = 108
IMSI = 123456000000050
; APNTPLID = 3
; QOSTPLID = 108
; APNTPLID = 1
; QOSTPLID = 108
; APNTPLID = 2
; QOSTPLID = 108
; APNTPLID = 5
; QOSTPLID = 108
; APNTPLID = 6
; QOSTPLID = 108


any ideas why the output is not like below


IMSI = 123456000000049;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108;APNTPLID = 6;QOSTPLID = 108
IMSI = 123456000000011;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108;APNTPLID = 6;QOSTPLID = 108
IMSI = 123456000000050;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108
IMSI = 123456000000075;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108;APNTPLID = 6;QOSTPLID = 108

RE: Re-formatting a file

IMO, your problems are caused by the format of your file.
You have probably problems with end-of-line characters i.e. "\r\n" vs. "\n" or with some blank characters on the begin or end of line.
Either set properly the variable RS and/or try to remove these characters on each line.

RE: Re-formatting a file

(OP)
tr -d '\015' <output.txt >newoutput.txt

Did the trick.

RE: Re-formatting a file

(OP)
To ALL
I have not used TekTips much in the past. My experience with all of the people who posted on this was outstanding. Is there a way to rate or acknowledge the programers that helped me? You are all topnotch!
Thank you all!

RE: Re-formatting a file

Quote (meinida)


Is there a way to rate or acknowledge the programers that helped me?

To rate the answers, you can click at the star which is placed right in every reply:

Great post?
Star it!

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members!

Resources

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close