Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gmmastros on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Add and update column based on date/time to csv file

Status
Not open for further replies.

sushilps

MIS
Jun 5, 2017
7
0
0
US
Friends,

New to the forum, posting for the first time. Looking for some help from the community,
will really appreciate if you folks could provide some input and share some knowledge.

I am trying to achieve 3 things together, source file is around 800 million records
(17 columns) and takes long time for each step. Hardware on AWS - 61gb RAM, 8CPU, 3TB disk.

Sample csv file with header pipe delimiter.
ID |DATE
1234|2015-01-01 00:01:15

1) Want to split DATE column into two (DATE and TIME) should look
like the following 3 columns.

ID |DATE |TIME
1234|2015-01-01|00:01:15

Used this code -> awk -F'|' '{gsub(" ", "|", $2) ; print )' file1.txt > file1.out
works fine takes around 2hr:30 mins.

2) Want to add 3rd column based on the TIME criteria.
if TIME is 00:00:00 there should be 1 in the fourth column and so on.

ID |DATE |TIME |TIMEMASK
1233|2015-01-01|00:00:00|1
2345|2015-01-01|00:00:15|2
...
1234|2015-01-01|00:01:15|6

Used this code works fine, but took close to 5 hours.
awk -F '|' -vOFS='|' '{
gsub("00:00:00", "00:00:00|1|", $3) ;
....
gsub("23:45:00", "23:45:00|96|", $3) ;
print )' file2.fle > file2.out

3) Have to add 5th column, based on the date we have to generate
day of the week. Output should look like this.

ID |DATE |TIME |TIMEMASK|DAYOFWEEK
1233|2015-01-01|00:00:00|1 |4
2345|2015-01-02|00:00:15|2 |5
...
1234|2015-01-03|00:01:15|6 |6

Used this code which is working but too slow, just for one file it was running
for over 24 hours. Had to kill the process, since we have multiple files looping
was done in this, which probably not efficient either.

#!/bin/bash

while IFS='|' read col1 col2 DATE col4 col5 col6 col7 col8 col9 col10 col11
col12 col13 col14 col15 col16 col17 ;
do
col6=`date --date="$DATE" '+%u'`
echo
"$col1|$col2|$DATE|$col4|$col5|$col6|$col7|$col8|$col9|$col10|$col11|$col12|
$col13|$col14|$col15|$col16|$col17"
done < file3.fle > file4.out
 
Hi

Please provide further details so we can think to a suitable strategy :
[ul]
[li]Confirm whether those 3 steps must stay separated. ( I would try a single Awk script. )[/li]
[li]Explain how the TIMEMASK column's value is calculated. ( I would try to pick it from an array. )[/li]
[li]Which Awk implementation and version you use. ( I would use Awk to get the week day, assuming you also have GNU Awk 2.13 or newer. )[/li]
[li]Are the records sorted chronologically. ( I would try to cache the already got weekdays, as there is a chance some of those 800 million records to share the same date. )[/li]
[/ul]


Feherke.
feherke.github.io
 
Thanks for your response, here are the answers to your question.

1) 3 steps can be one single awk script.
2) Timemask calculation.(used excel file to generate.)
hh:mm:ss TimeMask
00:00:00 -> 1
00:15:00 -> 2
00:30:00 -> 3
00:45:00 -> 4
01:00:00 -> 5
01:15:00 -> 6
.....
23:45:00 -> 96

3) GNU Awk 4.0.1
4) They are not sorted, but we can sort.
 
Hi

Try running this code :
Code:
[b]BEGIN[/b] [teal]{[/teal]
    FS [teal]=[/teal] [i][green]"|"[/green][/i]
    OFS [teal]=[/teal] [i][green]"|"[/green][/i]

    [b]for[/b] [teal]([/teal]hour [teal]=[/teal] [purple]0[/purple][teal];[/teal] hour [teal]<[/teal] [purple]24[/purple][teal];[/teal] hour[teal]++)[/teal]
        [b]for[/b] [teal]([/teal]minute [teal]=[/teal] [purple]0[/purple][teal];[/teal] minute [teal]<[/teal] [purple]60[/purple][teal];[/teal] minute[teal]++)[/teal]
            timepart[teal][[/teal][b]sprintf[/b][teal]([/teal][i][green]"%02d:%02d"[/green][/i][teal],[/teal] hour[teal],[/teal] minute[teal])] =[/teal] [COLOR=orange]int[/color][teal](([/teal]hour [teal]*[/teal] [purple]60[/purple] [teal]+[/teal] minute[teal]) /[/teal] [purple]15[/purple][teal]) +[/teal] [purple]1[/purple]
[teal]}[/teal]

NR [teal]==[/teal] [purple]1[/purple] [teal]{[/teal]
    [navy]$2[/navy] [teal]=[/teal] [i][green]"DATE"[/green][/i]
    [navy]$3[/navy] [teal]=[/teal] [i][green]"TIME"[/green][/i]
    [navy]$4[/navy] [teal]=[/teal] [i][green]"TIMEMASK"[/green][/i]
    [navy]$5[/navy] [teal]=[/teal] [i][green]"DAYOFWEEK"[/green][/i]

    [b]print[/b]

    [b]next[/b]
[teal]}[/teal]

[teal]{[/teal]
    [navy]$3[/navy] [teal]=[/teal] [b]substr[/b][teal]([/teal][navy]$2[/navy][teal],[/teal] [purple]12[/purple][teal])[/teal]
    [navy]$2[/navy] [teal]=[/teal] [b]substr[/b][teal]([/teal][navy]$2[/navy][teal],[/teal] [purple]1[/purple][teal],[/teal] [purple]10[/purple][teal])[/teal]
    [navy]$4[/navy] [teal]=[/teal] timepart[teal][[/teal][b]substr[/b][teal]([/teal][navy]$3[/navy][teal],[/teal] [purple]1[/purple][teal],[/teal] [purple]5[/purple][teal])][/teal]

    [b]if[/b] [teal](! ([/teal][navy]$2[/navy] [b]in[/b] weekday[teal])) {[/teal]
        weekday[teal][[/teal][navy]$2[/navy][teal]] =[/teal] [COLOR=orange]strftime[/color][teal]([/teal][i][green]"%u"[/green][/i][teal],[/teal] [COLOR=orange]mktime[/color][teal]([/teal][COLOR=orange]gensub[/color][teal]([/teal][fuchsia]/-/[/fuchsia][teal],[/teal] [i][green]" "[/green][/i][teal],[/teal] [i][green]"g"[/green][/i][teal],[/teal] [navy]$2[/navy][teal])[/teal] [i][green]" 00 00 00"[/green][/i][teal]))[/teal]
    [teal]}[/teal]
    [navy]$5[/navy] [teal]=[/teal] weekday[teal][[/teal][navy]$2[/navy][teal]][/teal]

    [b]print[/b]
[teal]}[/teal]
As
Code:
awk -f sushilps.awk file1.txt > file1.out

It uses Awk's own [tt]mktime()[/tt] and [tt]strftime()[/tt] functions to get the week day. Although is faster than running external command, is still quite slow, so it caches the week days. For the time part calculation it generates a lookup table on start, then just picks from there. The code assumes that month, day, hour and minute are always padded to 2 digits.

Tested with a 100 million lines file spanning over 3 years and a quarter. GNU Awk 4.1.4 processed it in ~3 minutes.


Feherke.
feherke.github.io
 
Thanks and appreciate your help. I did some testing, timing looks good on test data. I am going to test the
results on live data over the weekend and post you the results. Sorry but I have to make two changes, Date and
time is coming as 201706070050130 and not as 2017-06-07 05:01:30, I was able to fix the timing part, but not
the Date part. Also I have to add more columns which are coming from incoming source, how do I add them to the
script to spit out the actual data without any manipulation on those columns.
 
Hi

You mean the new columns need to be inserted, not appended ? Something like this :

Before :
[pre]12345|201706070050130|some|more|columns[/pre]​
After :
[pre]12345|20170607|050130|5|3|some|more|columns[/pre]​

BTW, is there an extra "0" between date and time ?

Feherke.
feherke.github.io
 
Actually, incoming file has 17 columns. I gave you just the initial/front two columns.
Basically, the original file will be

"ID |DATE |COL1|COL2|COL3....
1234|2015-01-01 00:01:15file is "|ABC|DEF|123....

And we have inserted columns and computation after DATE, which is ok. Those new
columns can be kept anywhere no problem.

I see there is a "T" between DATE and TIME in the input file, not sure what that means.
 
Hi

sushilps said:
Those new columns can be kept anywhere no problem.
That is great. Moving a dozen of existing columns to right would slow it down even more.

sushilps said:
I see there is a "T" between DATE and TIME in the input file, not sure what that means.
That is how ISO 8601 defines the date and time need to be separated.

But earlier you wrote that "Date and time is coming as 201706070050130", so not sure where could be that "T".


Feherke.
feherke.github.io
 
Ok, was not aware of it. Re. "T" ISO8601.

Must have typed in manually extra zero for testing, in raw data there is "T".
 
I've fixed the change, pl. take a look it might not be that efficient.

#!/usr/bin/awk




BEGIN {
FS = "|"
OFS = "|"

for (hour = 0; hour < 24; hour++)
for (minute = 0; minute < 60; minute++)
timepart[sprintf("%02d%02d", hour, minute)] = int((hour * 60 + minute) / 15) + 1
}

NR == 1 {
$12 = "DATE"
$13 = "TIME"
$14 = "TIMEMASK"
$15 = "DAYOFWEEK"

print

next
}

{
$13 = substr($2, 10)
$12 = substr($2, 1, 8)
$14 = timepart[substr($13, 1, 4)]

$12=((year = substr($12, 1,4)) "-" (month= substr($12, 5,2)) "-" (day = substr($12, 7,2)))

if (! ($12 in weekday)) {
weekday[$12] = strftime("%u", mktime(gensub(/-/, " ", "g", $12) " 00 00 00"))
}
$15 = weekday[$12]

print
}
 
Hi

Yes, it looks good.

Just one thing to mention : regular expression matching is among the slower operations, so while you already put the date parts in dedicated variables, better use them :
Code:
    weekday[teal][[/teal][navy]$12[/navy][teal]] =[/teal] [COLOR=orange]strftime[/color][teal]([/teal][i][green]"%u"[/green][/i][teal],[/teal] [COLOR=orange]mktime[/color][teal]([/teal]year [i][green]" "[/green][/i] month [i][green]" "[/green][/i] day [i][green]" 00 00 00"[/green][/i][teal]))[/teal]


Feherke.
feherke.github.io
 
Hi Feherke,

Here is the update.

First test.
Complete process finished in 2 hours and 40 minutes.

Second test.
(I re-arranged the columns going to my output file)
Completed in 3 hours and 7 mins.

Appreciate your help on this, and a big thank you.


Rgs,
Sushil...
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top