×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

AWK - change the field delimiters within a closed CSV file from a comma to a pipe symbol
2

AWK - change the field delimiters within a closed CSV file from a comma to a pipe symbol

AWK - change the field delimiters within a closed CSV file from a comma to a pipe symbol

(OP)
thread271-1784472: Installing AWK on a Windows 7 64-bit computer - Novice

Revisited the post above regarding the processing of text files using AWK.

I have a large CSV file - 220 megabytes with approximately 1,300,000 records.


Upon reviewing the CSV file, there is a comma within one of the fields - the field "City" that contains data

such as "New York City, New York."



Due to the CSV file containing a comma within one of the fields, I am not able to successfully import the data into a Sql Server database.


Is it possible for AWK to change the field delimiter within a CSV file from a comma to a pipe symbol without opening the CSV file?

RE: AWK - change the field delimiters within a closed CSV file from a comma to a pipe symbol

How will awk know if the comma is data or a field delimiter?

I would consider using sed to change "New York City, New York" to New York City New York" and the same for any other similar occurrences of comma within the city field.

You can also check the unix/linux tr (translate) command as far as switching all commas to pipe (or any other character).

==================================
advanced cognitive capabilities and other marketing buzzwords explained with sarcastic simplicity


RE: AWK - change the field delimiters within a closed CSV file from a comma to a pipe symbol

The problem with specifically changing "New York City, New York" to "New York City New York", is that it's very specific and won't actually catch something like "Sacramento, CA". You'll end up playing whack-a-mole with every load.

Each record/row in the CSV should have one less comma than the number of fields you are expecting. That is, if you expect 10 fields, there should be 9 commas. I would first "filter" the file with grep and spit out those rows without the expected number of commas.

Example using five fields per valid record in the CSV...

CODE

$ cat commas.csv
one,two,three,four,five
1,2,3,4,5
a,b,hello, world,d,e
123,234,345,456,567
1,2,3
$
$ grep '^[^,]*,[^,]*,[^,]*,[^,]*,[^,]*$' commas.csv | tee commas.good
one,two,three,four,five
1,2,3,4,5
123,234,345,456,567
$
$ grep -v '^[^,]*,[^,]*,[^,]*,[^,]*,[^,]*$' commas.csv | tee commas.bad
a,b,hello, world,d,e
1,2,3 

Process the "good" file as normal, and manually inspect the "bad" records. The number should be fairly small. Fix those, then load them. You could even make your script notify you only when there are bad records to fix. That way you're not putting a manual step in an automated process. Only to handle exceptions.

RE: AWK - change the field delimiters within a closed CSV file from a comma to a pipe symbol

(OP)
Excellent thought - differentiating between a comma within a particular field relative to a comma separating a field.

Great insight to compare the total number of commas expected and separating the "good" vs. the "bad" records.

Especially considering that there may be more than one instance where commas are within a particular field.

Interesting point is that changing to a pipe-delimited format, I want to ensure that there is not a trailing pipe at the end of each record.

Will review grep and the example code.

RE: AWK - change the field delimiters within a closed CSV file from a comma to a pipe symbol

The problem with changing your comma separator to something like a pipe character, is that you still have the same problem, just with a different character.

For example, changing the commas to pipes in this record gets you a mess...

CODE

one,two,a|b|c,four,five 

If the data can contain a pipe character, you just broke a good record. That example record, five valid fields becomes seven fields and will fail to load.

RE: AWK - change the field delimiters within a closed CSV file from a comma to a pipe symbol

SamBones makes a good point. You need to search the file first to find an ASCII character that can become the delimiter. (one that isn't already within the file). When using the translate command to change the delimiter, translate is NOT limited to printable characters. It can take Hex or Octal values for the translate.

==================================
advanced cognitive capabilities and other marketing buzzwords explained with sarcastic simplicity


RE: AWK - change the field delimiters within a closed CSV file from a comma to a pipe symbol

Actually my point was to leave the separator as a comma. I speak from experience. Universally changing the separator character throughout the entire file is in effect altering your source data. There are a lot of hidden problems that will result in data corruption. Or having to intervene every time you have to run the load. Plus, if this data plays in any kind of financial or regulatory accountability, an auditor will freak when you tell him/her how much you're altering the data prior to loading it.

Personally, I would spend a little more time learning the data. Just using grep and the simple pattern I gave, you can find every single record that has too many, or not enough delimiters. Looks at those bad records and see what you're getting. Is it an occasional "City, State"? A name like "John Smith, Jr.". Maybe a numeric field with something like "1,200". The problem is, you just can't do a global search and replace of commas with some other character. The tools you are talking about don't have enough brains to do it right and you'll just be altering your data.

Actually, if you look at the "spec" for CSV files (RFC 4180), it does allow for quoting fields so they can legally contain delimiters. For example...

CODE --> data.csv

231,"Jim Jones, Junior","123 Main St, Apt 12a","New York, NY",10038 

That's five fields, the middle three enclosed in quotes. So, even though there are 7 commas in the record, there are only 4 comma delimiters. Load programs like Oracle's SQL*Loader handle this with no problems at all. A lot of other load programs I've used in the past can handle commas in quoted fields too. Excel can load CSVs with extra commas in quoted fields, but it's limited to 1 million rows. Informatica can handle this kind of CSV too.

https://en.wikipedia.org/wiki/Comma-separated_valu...

So what are you trying to load this data into? What kind of database, and using what kind of load tools?

Is there any way you could share a few records? Maybe obfuscate any PII details.


RE: AWK - change the field delimiters within a closed CSV file from a comma to a pipe symbol

(OP)
Reviewed the data in detail and it appears that there is indeed just one field within the CSV file that has a comma within the field.

That field is titled "City" as originally stated.

Further, it appears that this will always be the case.

Yes, considering the Auditor's perspective, altering the source data should not be performed as stated by SamBones.

However, considering that there is only one field that contains a comma, the idea of searching a 220 megabyte CSV file and replacing with a delimiter that is not within the file is a novel approach.


Loading into a Sql Server Db via Bulk Insert with example below;

CODE

BULK INSERT #TestData1
	FROM 'C:\Project\EquipmentData.txt' 
   WITH
      (
		 --FIELDTERMINATOR = ',',  --CSV Field Delimiter  Cannot bulk load...
		 FIELDTERMINATOR = '\|',  
		 --ROWTERMINATOR = '\n',    Error - column is too long ... for Row 1 column 34
		 --ROWTERMINATOR = '\r\n',  Error - column is too long ... for Row 1 column 34 
		 --ROWTERMINATOR = '\r',    Error - column is too long ... for Row 1 column 34 
		 ROWTERMINATOR = '0x0a',    
		 FIRSTROW = 2,
		 LASTROW = 100
      );
GO 


Noted is SamBones comment

"A lot of other load programs I've used in the past can handle commas in quoted fields too. Excel can load CSVs with extra commas in quoted fields, but it's limited to 1 million rows. Informatica can handle this kind of CSV too."


After several days of trial and error, this coincides with my present thought - importing a subset of the data from the CSV file into MS Excel and then saving/converting to a pipe delimited format for import into the Sql Server database.

But, I want to load all of the data at the same time.

johnherman, is it possible to provide an example of your approach - searching the CSV file and replacing the delimiter with a delimiter that is not within the file?

SamBones, Is it possible to provide a list of the other load programs that can handle commas in quoted fields for CSV files with over a million rows?

Thanks in advance for any additional insight!

RE: AWK - change the field delimiters within a closed CSV file from a comma to a pipe symbol

You still haven't shown us an example record with a comma within a field. I don't mind you changing names and numbers to protect the innocent, but it would be good to see an example record. If your comma split field is enclosed in quotes, you might be able to fix it with sed or awk as per your original request. Something like...

CODE

$ cat tt.csv
one,two,three
one,"hi, there",three
"123 main st., Apt 22","New York, NY",90023
$
$ awk -F'"' -v OFS='' '{ for (i=2; i<=NF; i+=2) gsub(",", "", $i) } 1' tt.csv
one,two,three
one,hi there,three
123 main st. Apt 22,New York NY,90023
$ 

The field has to be properly quoted though, otherwise there's no way for it to know which commas to lose.

Looking at the syntax for the bulk insert, it does look like there is a "FIELDQUOTE" parameter that can be specified. Something like this?

CODE

BULK INSERT #TestData1
	FROM 'C:\Project\EquipmentData.txt' 
   WITH
      (
		 FORMAT = 'CSV' ,
		 FIELDTERMINATOR = ',' ,  
		 FIELDQUOTE = '"' ,
		 ROWTERMINATOR = '0x0a' , -- This is a newline char (\n), is this coming from *nix?
		 FIRSTROW = 2 ,
		 LASTROW = 100 ,
		 FORMATFILE = '<path_to_format_file_if_used>'
      );
GO 

Also, that 'FORMATFILE' can let you do some more granular control I believe.

This is a TSQL BULK INSERT, right? Have you looked into SSIS BULK INSERT? One of my DBA homies here said SSIS BULK INSERT or BCP can handle quoted fields properly.

One last thing, most programs that can create a CSV file can specify a different field separator character. See if the app/program creating it could use a pipe character from the beginning instead of a comma. Or some other character. I would recommend nothing that's whitespace (i.e. tab) and not an easy to type character. Maybe something that requires an Alt-### to enter (see charmap for ideas.

RE: AWK - change the field delimiters within a closed CSV file from a comma to a pipe symbol

(OP)
Using EmEditor, I was able to review the CSV file in more detail.

Noted are the following;

1. There are 1,259,409 records
2. From Row 1,094,712 to Row 1,259,408 - There is a comma at the end of the record
3. The last Row contain text - evidently a comment.

Consequently, upon receiving large text files with the number of records exceeding the number that can be opened with
MS Excel, I need to develop a process to quickly review and clean the data.

My understanding is that AWK can perform the "cleansing" of the data.

Further, I can see why I was not readily able to Bulk Insert the data into the Sql Server database because of the inconsistent record endings and the comment on the last line.

In this situation, there were just 1,259,409 records. What if there were 5,000,000 + records?!

Consequently, I need to develop a process to identify records with errors such as commas at the end of some of the records, comments on the last row, etc.

Question - Does functionality exist within AWK to eliminate the commas at the end of the line for some of the records and any text on the last record that has a different pattern than the preceding records?

Attached is a copy of sample data for reference. Note, within the sample file, records 16 through 8061 have a comma at the end of each record. There are a total of 8061 records within the sample file.

Thanks in advance for any additional insight.

RE: AWK - change the field delimiters within a closed CSV file from a comma to a pipe symbol

Yes, with 'sed' and 'awk' you can do anything you need to do with the records.

What OS are you on? If this is Linux or some Unix, you're good to go. If it's Windows, your options change a little.

This command will remove that comment on the last row...

CODE

head -n -1 filename.csv 

This command will remove the first line of headers if you so desire...

CODE

tail -n +2 filename.csv 

This command will remove that extra comma in the quotes (replacing it with <space><dash><space>, "New York, NY" becomes "New York - NY")...

CODE

sed 's/\(".*\), \(.*"\)/\1 - \2/1' filename.csv 

If that trailing comma at the end of some lines is an extra, this will remove it...

CODE

sed 's/,$//g' filename.csv 

Actually, I just checked the data and that trailing comma is not extra. You actually have a NULL field at the end of some of the records. So the trailing comma needs to stay. This will show you that the comma at the end needs to stay...

CODE

head -20 filename.csv | sed 's/[^,]//g' 

Putting it all together, this might clean your file...

CODE

head -n -1 filename.csv | tail -n +2 | sed 's/\(".*\), \(.*"\)/\1 - \2/1' > filename_cleaned.csv 

Again, I'm more comfortable with 'sed', so this is using 'sed', not 'awk'. Just personal preference (and more appropriate for a serial edit of a file).

RE: AWK - change the field delimiters within a closed CSV file from a comma to a pipe symbol

(OP)
Great catch on the trailing comma issue!

Upon opening the file with EmEditor, I was not able to display the data in a columnar mode where I could have readily see that there was a null value at the end of some of the records.

Now, that I think about this a little more, actually parsing the field into two columns may provide more flexibility within the database. Instead of "New York , NY" in one field, maybe having the city in one column and the state in a column.

Therefore, can AWK or sed remove quotes and allow for the parsing of the data that results in the city name being in one column and
the State abbreviation being in a separate column?

My initial thought is that as long as there is a city name, state name within quotes on each record, then there will not be an issue.

The other thought that I have is - Can there be conditional tests performed using AWK and/or sed?

For example, maybe like nested IF Statements in MS Excel -

If there are comments at the end of a file, delete the comments; If there is this problem, resolve it; If there is this problem resolve it, and so on?

Sort of like a process that can be run on any large test file to detect the common problems (and not so common problems), catalog them and clean the text files so that they can be bulk inserted into Sql Server?


RE: AWK - change the field delimiters within a closed CSV file from a comma to a pipe symbol

(OP)
As far as if AWK can process the text file via nested IF functionality, this was already addressed during the previous thread.

Will re-visit the previous thread...

However, still interested if functionality exists within AWK or sed to parse the text that is within the quotes into two separate columns.

RE: AWK - change the field delimiters within a closed CSV file from a comma to a pipe symbol

Quote (BxWill)

Therefore, can AWK or sed remove quotes and allow for the parsing of the data that results in the city name being in one column and
the State abbreviation being in a separate column?

CODE

$ # Remove double quote and space between comma and state abbreviation
$ sed 's/"//g;s/, /,/g' filename.csv > filename_fixed.csv 

Just killing the double quote and the space before the state abbreviation is all that's needed. The comma is still in between the fields. The only possible problem is if you have any record that doesn't match this format ("City, ST").

RE: AWK - change the field delimiters within a closed CSV file from a comma to a pipe symbol

(OP)
Downloaded the "Complete Package, except sources" for sed (version 4.2.1) and installed on a Windows 7 64-bit laptop.


Copy pasted the following
"sed 's/\(".*\), \(.*"\)/\1 - \2/1' C:\EquipmentData_Latest.txt"

at the command prompt "C:\Program Files (x86)\GnuWin32\bin>" to remove the comma within the field "City."

Just to recap, the field "City" contains the City and State with a comma between the city and state and the data within this field is enclosed with double quotes for each of the over 1.3 million records.


Receive the following error:

sed: -e ex[ressopm #1.char 1: unknown command: '''


Currently attempting to resolve the error. However, considering that I am using sed for the first time, I am not quite familiar with the nuances/intricacies of sed to readily determine what is causing the error.

Any immediate insight as to the resolution of the above-mentioned error?



RE: AWK - change the field delimiters within a closed CSV file from a comma to a pipe symbol

That command like was assuming a Unix like environment. Windows handles single and double quotes differently. You could try putting the sed commands into a file.

CODE --> sed_commands.txt

s/\(".*\), \(.*"\)/\1 - \2/1 

Then run it like this...

CODE

sed -f sed_commands.txt C:\EquipmentData_Latest.txt 

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members! Already a Member? Login

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close