Smart questions
Smart answers
Smart people
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Member Login

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips now!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

Join Tek-Tips
*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

LINK TO THIS FORUM!

Add Stickiness To Your Site By Linking To This Professionally Managed Technical Forum.
Just copy and paste the
code below into your site.

Partner With Us!

"Best Of Breed" Forums Add Stickiness To Your Site
Partner Button
(Download This Button Today!)

Feedback

"...Praise should be given to the Forum managers or the Tipmasters - they are what make it work - give them extra recognition!!! They are timely (prompt - unlike ACTUAL support sites) and on the ball!!!..."

Geography

Where in the world do Tek-Tips members come from?

Joining and sorting multiple text filesHelpful Member!(2) 

cdj02 (TechnicalUser)
2 Jun 12 8:28
Hello, I run systems that collect measurement data and dump it into delimited text files. I'm trying to join these files and sort them by Date, Target_Name, and site so that i can plot the data. Currently, i have the following code that does work to join the files into a single file called All_CST.cst but I'm new to Perl and I'm having trouble with sorting. I know how to sort simple lists in Perl but re-ordering more complex data like I have in these logfiles is a bit beyond me at the moment (and I'd really like to avoid having to use MS Excel to do my sorting). I'd appreciate any help in pointing me in the right direction on sorting this type of data. Thanks in advance for your time.

My Script
#!/usr/bin/perl
open (LOGFILEOUT, "> ALL_CST.cst");

@files = <*.cst,*>;

$totallines=0;

foreach(@files){
if ($_ =~ /ALL_CST.cst/){next;};
$filename=$_;
open (LOGFILEIN, "< $filename");
while (<LOGFILEIN>){
print LOGFILEOUT "$_ \n";
}

}

The Data files are here:
http://www.mediafire.com/?bu2xsk1652cajxv
http://www.mediafire.com/?hqlie3gjcpwzhi4
http://www.mediafire.com/?1k4p4wc2z479ovk
Helpful Member!  feherke (Programmer)
2 Jun 12 9:50
Hi

Sorting huge amount of data is not a trivial task. Based on the line separators used in the log files I suppose you are on a Unix-like operating system. In that case I would start with sort :

CODE

sort -k20 -k16 -k1n *.cst,* > ALL_CST.cst
Or to skip the header lines :

CODE

tail -qn+2 *.cst,* | sort -k20 -k16 -k1n > ALL_CST.cst

But there is a big problem : the date format. If it would be ISO date, it could be sorted as string, but this one need abit of preprocessing :

CODE

# if 04/12/2012-14:10:37 means 2012 April 12 tail -qn+2 *.cst,* | awk '!NF{next}$20=substr($20,7,4)"/"substr($20,1,5)substr($20,11)' | sort -k20 -k16 -k1n > ALL_CST.cst # if 04/12/2012-14:10:37 means 2012 December 04 tail -qn+2 *.cst,* | awk '!NF{next}$20=substr($20,7,4)substr($20,3,4)substr($20,1,2)substr($20,11)' | sort -k20 -k16 -k1n > ALL_CST.cst
This changes the data format. If you need it unchanged, then better add another column at the end of each line, for sorting purpose only.

As you asked in the Perl forum, here is a Perl solution too :

CODE

perl -nlae '$h=$_ if$.==1;($d=$F[19])=~s,(../..)/(....),$2/$1,;push@d,[@F,$d]if@F&&$_ ne$h;END{print join"\t",splice@{$_},0,-1for sort{$a->[-1]cmp$b->[-1]||$a->[15]cmp$b->[15]||$a->[0]<=>$b->[0]}@d}' *.cst,* > ALL_CST.cst
This one interprets the date as mm/dd/yyyy. All header lines are skipped. Note that field separators are replaced with tab characters.

This one preserves the original format, but may fail sooner :

CODE

perl -nlae '$h=$_ if$.==1;($d=$F[19])=~s,(../..)/(....),$2/$1,;push@d,[@F[0,15],$d,$_]if@F&&$_ ne$h;END{print $_->[3]for sort{$a->[2]cmp$b->[2]||$a->[1]cmp$b->[1]||$a->[0]<=>$b->[0]}@d}' *.cst,* > ALL_CST.cst

As both Perl codes load all data into the memory, they will hit memory allocation error at a certain file size. To avoid it a lot of extra work is needed.

Feherke.
http://feherke.github.com/

cdj02 (TechnicalUser)
2 Jun 12 10:44
Thanks for the speedy response!

The data is generated on a unix system but I pull the files remotely to my Windows based laptop and process it with strawberry perl. I'd be ok with processing the files line by line if that is a better way to do it. As i said, i'm new to perl so slurping a whole file at a time seemed easiest to me. If there is a better way to do this please let me know. I'm typically grabbing only a sample of 10 files at a time so I would think I'd be ok memory wise (an assumption on my part). At the most extreme I can think I might ever need would be a grab of 50 files. Based on the file sizes I've shown you (and without knowing my laptop specs), do you think I'd be likely to run into trouble with memory allocation?

btw- the date format is "04/12/2012-14:10:37 means 2012 April 12" and i have no problem transposing the date format to whatever is easiest to work with. I'm thinking I'm going to need to use the time from the date stamp for sorting as well to get things in the right order.

I'm going to give the code snippets you've shown me a try. If you can think of a better way to do the combined join/sort for these files please let me know. Thanks again for your time.
feherke (Programmer)
2 Jun 12 11:07
Hi

Quote (cdj02)

I'd be ok with processing the files line by line if that is a better way to do it.
Well, the sorting part could be hardly done that way, so there will be a moment anyway when all the data will be in memory.

Sorry, I can not estimate the memory requirement for that amount of files. I just mentioned the allocation error as a weak point of the concept. If you will have problems, my first suggestion would be to remove the columns containing nothing else than "NA".

Quote (cdj02)

I'm thinking I'm going to need to use the time from the date stamp for sorting as well to get things in the right order.
Parsing a string into time I would expect to be slower than just rearranging substrings. ( I mean, using substr(). I used the regular expression based mm/dd <-> yyyy swapping just because got bored by typing substr() while used it in the Awk code earlier in my post... If you notice speed problems, we can change that part. )

Feherke.
http://feherke.github.com/

cdj02 (TechnicalUser)
2 Jun 12 13:58
Feherke, I'm testing out this line

CODE

perl -nlae '$h=$_ if$.==1;($d=$F[19])=~s,(../..)/(....),$2/$1,;push@d,[@F,$d]if@F&&$_ ne$h;END{print join"\t",splice@{$_},0,-1for sort{$a->[-1]cmp$b->[-1]||$a->[15]cmp$b->[15]||$a->[0]<=>$b->[0]}@d}' *.cst,* > ALL_CST.cst ]

but I'm getting Bareword found near "* > ALL_CST" <Missing operator before ALL_CST?> and a syntax error near "*."

I think it doesn't like the ",*" at the end of the "*.cst,*". In my original code I had to put <> around the "*.cst,* to get perl to accept it but that doesn't seem to work here.

any thoughts on what I'm missing?
cdj02 (TechnicalUser)
2 Jun 12 14:01
I don't know where that ] came from at the end of the code snippet in my post above but it doesn't belong there....sorry if it caused any confusion.
feherke (Programmer)
3 Jun 12 8:44
Hi

There are years since I used Perl on Windows. Mostly CygWin's Perl, rarely ActivePerl, never Strawberry Perl. So I have no proper idea how and in what circumstances that works.

The code I wrote certainly works on Linux and I am sure it would work perfectly on CygWin. I usually recommend installing it to everybody. It provides the usual GNU tools ( and much more ) compiled into native Windows executables, so they are fast and functional. As CygWin has Bash too, that expands the *.cst,* wildcard so the script receives the list of file names as parameter.

In case you still prefer Strawberry Perl, save this in a file :

CODE --> Perl

#!/usr/bin/perl foreach $name (<*.cst,*>) { open FIL,"<$name"; <FIL>; while (chomp($line=<FIL>)) { next unless $line; @field=split /\s+/,$line; ($date=$field[19])=~s,(../..)/(....),$2/$1,; push @data,[@field,$date]; } close FIL; } @data=sort { $a->[-1] cmp $b->[-1] || $a->[15] cmp $b->[15] || $a->[0] <=> $b->[0] } @data; open FIL,">ALL_CST.cst"; foreach $one (@data) { print FIL join("\t",splice @{$one},0,-1),"\n"; } close FIL;
Then run it from the directory where the .cst files are. Needs no parameter.

Feherke.
http://feherke.github.com/

prex1 (Programmer)
4 Jun 12 4:31
cdj02, you didn't specify the typical size of your actual log files; if they were as small as the examples, then even thousands of them wouldn't be difficult to process.
Also you didn't describe the exact sorting procedure: the field site appears to be a dot separated list of two integers, so how to sort it? And in sorting by 'Date, Target_Name, and site', you mean in that order or first by site, then by Target_Name and finally by Date? Or what else? And in which order (asc or desc)?
This is a sketch of the (untested) code I would use for this task, disregarding the remark above concerning site.

CODE --> perl

my($header,@lines,$site,$target,$date,@sites,@targets,@dates,@days); for $name(<*.cst,*>){ open FIL,"<$name"; <FIL>; $header=$_ unless $header; while(<FIL>){ push @lines,$_; ($site,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef,$target,undef,undef,undef,$date)=split; push @sites,$site; push @targets,$target; @days=split/\D/,$date; push @dates,join('',$days[2],$days[1],$days[0],$days[3],$days[4],$days[5]); } close FIL; } open FIL,">ALL_CST.cst"; for(sort{$sites[$a] cmp $sites[$b] || $targets[$a] cmp $targets[$b] || $dates[$a] cmp $dates[$b]}0..$#lines){ print FIL $lines[$_]; } close FIL;

Franco
http://www.xcalcs.com : Online engineering calculations
http://www.megamag.it : Magnetic brakes for fun rides
http://www.levitans.com : Air bearing pads

cdj02 (TechnicalUser)
5 Jun 12 7:08
Prex1 - The example sizes are the typical file sizes I'm working with and as for sorting by site, the site represents XY coordinates of the measurements. I was thinking I'd remove the comma to make it easier to sort. Ultimately, I only need to match up the measurements from file to file so I can chart them.

Feherke - I haven't tried CygWin but I'll give it a try today.

Thanks again for the help!
cdj02 (TechnicalUser)
5 Jun 12 7:31
prex1 - your code works perfectly and sorts exactly the way I wanted. The only odd thing it does is that it adds a blank line at the top of the file for each .cst file input. I ran my test with 10 files and it put 10 blank lines at the top.

Helpful Member!  prex1 (Programmer)
5 Jun 12 15:32
There's a blank line at the end of your filesmad.
Simply add a line after the while (as feherke didblush):

CODE --> perl

while(<FIL>){ next unless $_; ...
If you want the header on the sorted file, just add the line (that I forgot) after the open:

CODE --> perl

open FIL,">ALL_CST.cst"; print FIL $header; ...

Franco
http://www.xcalcs.com : Online engineering calculations
http://www.megamag.it : Magnetic brakes for fun rides
http://www.levitans.com : Air bearing pads

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members!

Back To Forum

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close