×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!
  • Students Click Here

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

Jobs

Installing AWK on a Windows 7 64-bit computer - Novice
2

Installing AWK on a Windows 7 64-bit computer - Novice

Installing AWK on a Windows 7 64-bit computer - Novice

(OP)
Context: Have a rather large text file - over 800,000 records and over 180 fields that appear to have commas separating the fields and
also quite a few instances whereby several fields have commas within the fields!

Consequently, it appears that I cannot use bulk insert within TSQL to load the "clean records" into the Sql Server database table.

Therefore, it appears that AWK may be used to resolve this problem. Note, I have never used AWK and not at all familiar with its functionality but it appears that resolving this issue via AWK is preferable to opening the text file in MS Excel and using a volatile function to determine the number of commas for each record, extracting the "good" records and then converting to a tab-delimited file for bulk inserting into the Sql Server database.

Over the last hour or so, I have downloaded files from the Cygwin site but then restored the laptop to a system restore point because the laptop appeared to be "bogged down" with unnecessary files! Although I was able to search for "GAWK", it appeared that unnecessary un-related files were downloaded.


What is the simplest method to download and install AWK to a Windows 7 professional 64-bit laptop? Is it preferable to install 32-bit AWK instead of 64-bit?


What is an honest assessment as to a learning curve for a Novice to master the quick review of a text file via AWK to determine the proportion of "bad" records within any given text file relative to all of the records within the text file. Then, I can quickly send the file back to the source department instead of spending an inordinate amount of time to extract the "good" records.

(In other words, is it possible for me to develop an "AWK Template" that I can use to quickly review every text file from the source department to validate various fields such as;

1. Zipcode - is it a valid length and is it valid for the particular city?
2. Dates - is it a valid date
3. Other fields - Is it numeric or alpha when it should be the other...?
4. For comma delimited, are there commas within one or more fields?, etc. etc.

For example, if the number of "bad" records relative to "good" records is beyond the "Bad Record" Threshold (say, 50%), I can readily send back to the source department instead of spending three to four weeks to validate and load only the good records!

Any additional insight is appreciated that relates to the use of AWK to resolve the issue of "bad" records within a relatively large text file.








RE: Installing AWK on a Windows 7 64-bit computer - Novice

Hi BxWill,

When I get a new PC with Windows, first I install some utilities on it and one of them is MSYS = Minimal SYStem. MSYS is bash plus other basic utilities like awk, sed, ....
I'm not sure because I'm on Linux now and not on my windows PC, but I think that I used for the installation this installer mingw-get-setup.exe from the MSYS Getting Started page.

The other option would be to use for example this port of Gawk for Windows.

However, the most complete Linux-like environment for Windows is Cygwin, you mentioned that you tried it, but Cygwin is very huge, so be aware and don't try to install everything on your Laptop.

To learn awk, for the first steps you can use some tutorials from the web e.g. http://www.grymoire.com/Unix/Awk.html or buy a nice book about it.
Then later for the reference helps for example this manual: https://www.gnu.org/software/gawk/manual/html_node...

Regarding your other questions on how to validate a text file, please post a specific minimal example to this forum a we will look at it.

RE: Installing AWK on a Windows 7 64-bit computer - Novice

(OP)
Displayed below is a sample text file from the source department.

Is it possible for AWK to quickly validate the following fields;
1. Equipment_Flag - Should be one character, only the code - N for New or O for Old
2. Tracking Number - In record 1, there are two tracking numbers instead of just one. Hence, this record will have 9 commas whereas
the other records will have just have eight commas that delimit the 9 fields
3. Sale_Date - Records 2 and 4 are incorrect. Record 2 contain asterisks and Record 4 is nonsensical - not a date

So, is it possible for me to run several AWK commands and then quickly send this text file back to the source
department stating that all four records contain errors? ("Fix the errors and return the file!")








RE: Installing AWK on a Windows 7 64-bit computer - Novice

Hi BxWill,
You say that it's a text file, but you posted only a picture from it.
Can you please post here the example as text ?

i.e. like this:

CODE

No.; [Account_Number]; ...
  1;             1345; ...
  2;             1456; ... 
I think that it's CSV-file, but then I need to know how are the columns separated: with commas, semicolons or spaces ?

RE: Installing AWK on a Windows 7 64-bit computer - Novice

(OP)
The fields within the text file is indeed delimited by commas and there is no double quotes around each field. Hence, not a true CSV file. Usually the data is sent as a pipe delimited text file.

So, what is displayed in the image is exactly what I see upon opening the file in a text editor.

Based on research, it appears that I can extract the "bad" records by using something similar to;

CODE

awk -F , 'NF != 11' <file 

Now, the challenge remains - how to quickly load AWK...

RE: Installing AWK on a Windows 7 64-bit computer - Novice

Why NF != 11, If I look at the picture you posted I see 9 fields.

RE: Installing AWK on a Windows 7 64-bit computer - Novice

I took this example file
bxwill

CODE

No, [Account_Number],[Sequence_Number],[Equipment_Flag],[State_Code],[Tracking_Number],[Sale_Date],[Sale_Amount],[Purchase_Zipcode]
  1,            1345,                2,N               ,Ca          ,233432, 289908   ,   20170215,        10000,             90001
  2,            1456,                1,N               ,Ca          ,            23678,***********,       120000,             90003
  3,           15443,                4,New             ,Ca          ,            58973,   20170402,       498211,             85007
  4,           58722,                1,O               ,Az          ,            87963,ABC        ,       449390,             85011 

and according to your criterions given above I wrote this validation script
bxwill.awk

CODE

# Run:
#   awk -f bxwill.awk bxwill.txt
BEGIN {
  FS = ","
  # number of errors
  nr_err = 0
}

# Line rules
NR == 1 {
 # skip header line
 next
}

{ 
  # chomp current line
  chomp_line()
  # remove spaces
  gsub(/[ ]+/, "", $0)
}

$0 ~ /^[ ]*$/ {
  # skip empty line
  next
}

{
  #printf("Processing #%2d :  %s\n", NR, $0)
  if (NF != 9) {
    errors[++nr_err] = "Line " NR " : Unexpected number of columns =" NF " !"
    next
  }  

  equipment_flag = $4
  if ((equipment_flag == "N" ) || (equipment_flag == "O")) {
    # pass
  }
  else {
    errors[++nr_err] = "Line " NR " : Unexpected value of equipment_flag = \"" equipment_flag "\"!"
  }
  
  tracking_number = $6
  if (tracking_number ~ /^[0-9]+$/) {
    # pass because it's number 
  }
  else {
    errors[++nr_err] = "Line " NR " : Unexpected value of tracking_number = \"" tracking_number "\"!"
  }

  sale_date = $7
  if (sale_date ~ /^[0-9]{8}$/) {
    # pass because it's date YYYYMMDD
  }
  else {
    errors[++nr_err] = "Line " NR " : Unexpected value of sale_date = \"" sale_date "\"!"
  }
}

END {
  # *** print result ***
  if (nr_err == 0) {
    print "File \"" ARGV[1] "\" is valid."
  }
  else {
    print "Following errors were found:"
    for (j=1; j <= nr_err; j++) {
      printf("#%02d : %s\n", j, errors[j])
    }
  }
}

# -- fuctions --
function chomp_line() {
  # strip out the carriage return or line feed at the end of current line
  # the function modifies global variable $0 (current line)
  sub(/\r$/, "", $0)
  sub(/\n$/, "", $0)
} 

For the given file the script delivers this result:

CODE

$ awk -f bxwill.awk bxwill.txt
Following errors were found:
#01 : Line 2 : Unexpected number of columns =10 !
#02 : Line 3 : Unexpected value of sale_date = "***********"!
#03 : Line 4 : Unexpected value of equipment_flag = "New"!
#04 : Line 5 : Unexpected value of sale_date = "ABC"! 

When I use the corrected file
bxwill_correct.txt

CODE

No, [Account_Number],[Sequence_Number],[Equipment_Flag],[State_Code],[Tracking_Number],[Sale_Date],[Sale_Amount],[Purchase_Zipcode]
  1,            1345,                2,N               ,Ca          ,           233432,   20170215,        10000,             90001
  2,            1456,                1,N               ,Ca          ,            23678,   20180214,       120000,             90003
  3,           15443,                4,N               ,Ca          ,            58973,   20170402,       498211,             85007
  4,           58722,                1,O               ,Az          ,            87963,   20180101,       449390,             85011 

then the script delivers this:

CODE

$ awk -f bxwill.awk bxwill_correct.txt
File "bxwill_correct.txt" is valid. 

RE: Installing AWK on a Windows 7 64-bit computer - Novice

(OP)
Yes, I misstated - the line should have been "NF != 9."

Appreciate the insight!

Had no idea whatsoever that AWK could perform as such! Or, if any application/program could.

Definitely appears that this is worth learning! Although, the learning curve appears steep...

Stepping through each line of the code and will probably have to obtain a book for reference before I fully understand what is going on.

For large text files such as near 1,000,000 records and over 180 columns, will AWK still be able to process with ease?

By the way, how were you able to present the tabular data as a table?

RE: Installing AWK on a Windows 7 64-bit computer - Novice

Quote (BxWill)


Definitely appears that this is worth learning! Although, the learning curve appears steep...
But learning of each programming language requires a bit of effort and AWK is one of the simplest.

Quote (BxWill)


For large text files such as near 1,000,000 records and over 180 columns, will AWK still be able to process with ease?
I never had so huge files, so have zero experience with it. Try it and you will see ...

Quote (BxWill)


By the way, how were you able to present the tabular data as a table?
What you mean with tabular data? Data separated by tabulator (character \t)?
The best would be, when you post an example, how your input data look like and what would be the desired output.


RE: Installing AWK on a Windows 7 64-bit computer - Novice

(OP)
Was referring to how you display the data in aligned columns.

When I paste the data from MS Excel, I was not able to display the data in aligned columns...

RE: Installing AWK on a Windows 7 64-bit computer - Novice

(OP)
Still not able to successfully download and use Gawk for Windows.

Clicked "Gawk for Windows" in mikrom's post and then clicked "Setup" next to "Complete package except sources."

At this point, I assumed that the setup is completed.

Is it necessary to download "Sources?"

Currently at the command line prompt...

Will re-read the Instructions - GnuWin Installation and Usage...

Any additional insight as to what I may be overlooking is appreciated.

RE: Installing AWK on a Windows 7 64-bit computer - Novice

You don't need sources. When setup not works then download ZIP with binaries.

RE: Installing AWK on a Windows 7 64-bit computer - Novice

Now I tried the setup program.
I clicked at the Setup and then the setup program gawk-3.1.6-1-setup.exe was downloaded. I clicked at the exe and installed awk. It seems to work:

CODE

c:\Program Files (x86)\GnuWin32\bin>awk --version
GNU Awk 3.1.6
Copyright (C) 1989, 1991-2007 Free Software Foundation.

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/. 



Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members!

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close