INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Jobs

Add and update column based on date/time to csv file

Add and update column based on date/time to csv file

Add and update column based on date/time to csv file

(OP)
Friends,

New to the forum, posting for the first time. Looking for some help from the community,
will really appreciate if you folks could provide some input and share some knowledge.

I am trying to achieve 3 things together, source file is around 800 million records
(17 columns) and takes long time for each step. Hardware on AWS - 61gb RAM, 8CPU, 3TB disk.

Sample csv file with header pipe delimiter.
ID |DATE
1234|2015-01-01 00:01:15

1) Want to split DATE column into two (DATE and TIME) should look
like the following 3 columns.

ID |DATE |TIME
1234|2015-01-01|00:01:15

Used this code -> awk -F'|' '{gsub(" ", "|", $2) ; print )' file1.txt > file1.out
works fine takes around 2hr:30 mins.

2) Want to add 3rd column based on the TIME criteria.
if TIME is 00:00:00 there should be 1 in the fourth column and so on.

ID |DATE |TIME |TIMEMASK
1233|2015-01-01|00:00:00|1
2345|2015-01-01|00:00:15|2
...
1234|2015-01-01|00:01:15|6

Used this code works fine, but took close to 5 hours.
awk -F '|' -vOFS='|' '{
gsub("00:00:00", "00:00:00|1|", $3) ;
....
gsub("23:45:00", "23:45:00|96|", $3) ;
print )' file2.fle > file2.out

3) Have to add 5th column, based on the date we have to generate
day of the week. Output should look like this.

ID |DATE |TIME |TIMEMASK|DAYOFWEEK
1233|2015-01-01|00:00:00|1 |4
2345|2015-01-02|00:00:15|2 |5
...
1234|2015-01-03|00:01:15|6 |6

Used this code which is working but too slow, just for one file it was running
for over 24 hours. Had to kill the process, since we have multiple files looping
was done in this, which probably not efficient either.

#!/bin/bash

while IFS='|' read col1 col2 DATE col4 col5 col6 col7 col8 col9 col10 col11
col12 col13 col14 col15 col16 col17 ;
do
col6=`date --date="$DATE" '+%u'`
echo
"$col1|$col2|$DATE|$col4|$col5|$col6|$col7|$col8|$col9|$col10|$col11|$col12|
$col13|$col14|$col15|$col16|$col17"
done < file3.fle > file4.out

RE: Add and update column based on date/time to csv file

Hi

Please provide further details so we can think to a suitable strategy :
  • Confirm whether those 3 steps must stay separated. ( I would try a single Awk script. )
  • Explain how the TIMEMASK column's value is calculated. ( I would try to pick it from an array. )
  • Which Awk implementation and version you use. ( I would use Awk to get the week day, assuming you also have GNU Awk 2.13 or newer. )
  • Are the records sorted chronologically. ( I would try to cache the already got weekdays, as there is a chance some of those 800 million records to share the same date. )

Feherke.
feherke.github.io

RE: Add and update column based on date/time to csv file

(OP)
Thanks for your response, here are the answers to your question.

1) 3 steps can be one single awk script.
2) Timemask calculation.(used excel file to generate.)
hh:mm:ss TimeMask
00:00:00 -> 1
00:15:00 -> 2
00:30:00 -> 3
00:45:00 -> 4
01:00:00 -> 5
01:15:00 -> 6
.....
23:45:00 -> 96

3) GNU Awk 4.0.1
4) They are not sorted, but we can sort.

RE: Add and update column based on date/time to csv file

Hi

Try running this code :

CODE --> sushilps.awk

BEGIN {
    FS = "|"
    OFS = "|"

    for (hour = 0; hour < 24; hour++)
        for (minute = 0; minute < 60; minute++)
            timepart[sprintf("%02d:%02d", hour, minute)] = int((hour * 60 + minute) / 15) + 1
}

NR == 1 {
    $2 = "DATE"
    $3 = "TIME"
    $4 = "TIMEMASK"
    $5 = "DAYOFWEEK"

    print

    next
}

{
    $3 = substr($2, 12)
    $2 = substr($2, 1, 10)
    $4 = timepart[substr($3, 1, 5)]

    if (! ($2 in weekday)) {
        weekday[$2] = strftime("%u", mktime(gensub(/-/, " ", "g", $2) " 00 00 00"))
    }
    $5 = weekday[$2]

    print
} 
As

CODE --> line

awk -f sushilps.awk file1.txt > file1.out 

It uses Awk's own mktime() and strftime() functions to get the week day. Although is faster than running external command, is still quite slow, so it caches the week days. For the time part calculation it generates a lookup table on start, then just picks from there. The code assumes that month, day, hour and minute are always padded to 2 digits.

Tested with a 100 million lines file spanning over 3 years and a quarter. GNU Awk 4.1.4 processed it in ~3 minutes.

Feherke.
feherke.github.io

RE: Add and update column based on date/time to csv file

(OP)
Thanks and appreciate your help. I did some testing, timing looks good on test data. I am going to test the
results on live data over the weekend and post you the results. Sorry but I have to make two changes, Date and
time is coming as 201706070050130 and not as 2017-06-07 05:01:30, I was able to fix the timing part, but not
the Date part. Also I have to add more columns which are coming from incoming source, how do I add them to the
script to spit out the actual data without any manipulation on those columns.

RE: Add and update column based on date/time to csv file

Hi

You mean the new columns need to be inserted, not appended ? Something like this :

Before :
12345|201706070050130|some|more|columns 
After :
12345|20170607|050130|5|3|some|more|columns 

BTW, is there an extra "0" between date and time ?

Feherke.
feherke.github.io

RE: Add and update column based on date/time to csv file

(OP)
Actually, incoming file has 17 columns. I gave you just the initial/front two columns.
Basically, the original file will be

"ID |DATE |COL1|COL2|COL3....
1234|2015-01-01 00:01:15file is "|ABC|DEF|123....

And we have inserted columns and computation after DATE, which is ok. Those new
columns can be kept anywhere no problem.

I see there is a "T" between DATE and TIME in the input file, not sure what that means.

RE: Add and update column based on date/time to csv file

Hi

Quote (sushilps)

Those new columns can be kept anywhere no problem.
That is great. Moving a dozen of existing columns to right would slow it down even more.

Quote (sushilps)

I see there is a "T" between DATE and TIME in the input file, not sure what that means.
That is how ISO 8601 defines the date and time need to be separated.

But earlier you wrote that "Date and time is coming as 201706070050130", so not sure where could be that "T".

Feherke.
feherke.github.io

RE: Add and update column based on date/time to csv file

(OP)
Ok, was not aware of it. Re. "T" ISO8601.

Must have typed in manually extra zero for testing, in raw data there is "T".

RE: Add and update column based on date/time to csv file

(OP)
I've fixed the change, pl. take a look it might not be that efficient.

#!/usr/bin/awk




BEGIN {
FS = "|"
OFS = "|"

for (hour = 0; hour < 24; hour++)
for (minute = 0; minute < 60; minute++)
timepart[sprintf("%02d%02d", hour, minute)] = int((hour * 60 + minute) / 15) + 1
}

NR == 1 {
$12 = "DATE"
$13 = "TIME"
$14 = "TIMEMASK"
$15 = "DAYOFWEEK"

print

next
}

{
$13 = substr($2, 10)
$12 = substr($2, 1, 8)
$14 = timepart[substr($13, 1, 4)]

$12=((year = substr($12, 1,4)) "-" (month= substr($12, 5,2)) "-" (day = substr($12, 7,2)))

if (! ($12 in weekday)) {
weekday[$12] = strftime("%u", mktime(gensub(/-/, " ", "g", $12) " 00 00 00"))
}
$15 = weekday[$12]

print
}

RE: Add and update column based on date/time to csv file

Hi

Yes, it looks good.

Just one thing to mention : regular expression matching is among the slower operations, so while you already put the date parts in dedicated variables, better use them :

CODE --> Awk ( fragment )

weekday[$12] = strftime("%u", mktime(year " " month " " day " 00 00 00")) 

Feherke.
feherke.github.io

RE: Add and update column based on date/time to csv file

(OP)
Hi Feherke,

Here is the update.

First test.
Complete process finished in 2 hours and 40 minutes.

Second test.
(I re-arranged the columns going to my output file)
Completed in 3 hours and 7 mins.

Appreciate your help on this, and a big thank you.


Rgs,
Sushil...

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members!

Resources

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close