INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Jobs

Remove quotes inside a text field

Remove quotes inside a text field

(OP)
HI,
I have been given a csv file to import that contains almost every type of problem you can think of.
I have dealt with tabs, commas and pipes, which are all normally used as separators. But I am stuck with the quote char.

An example. the following is part of a text field: "UK games firms "stuck in the slow lane", HTC M8 landing March, cheap Xbox One denied"
I need to keep the surrounding quotes but the ones in the middle need to go. If I use the replace, then the outside pair also go.
How would I remove the middle ones? Note only very few records in the database have this issue.

I also have no connection to the original data and have to work from the csv they supply - they will not make any changes to the output in any way.

thanks for reading

regards
Graham B.

pc

Graham

RE: Remove quotes inside a text field

Quote:

Note only very few records in the database have this issue.

Is this only a one time thing, or will this happen very often? That is can you do it manually or does it need to be automated to be done regularly?

Otherwise, you would likely need a regular expression to get the interior quotes only. And doing it in a query is not the best approach.

Perhaps a script (PHP, or something else) that can process the data before insertion to the DB may be a better alternative.



----------------------------------
Phil AKA Vacunita
----------------------------------
Ignorance is not necessarily Bliss, case in point:
Unknown has caused an Unknown Error on Unknown and must be shutdown to prevent damage to Unknown.

Web & Tech

RE: Remove quotes inside a text field

(OP)
HI Phil,

Unfortunately, this can happen daily. The data is e-mail extracts, where the subject line can contain any number of quotes etc. I receive the file into a windows server directory and have to work on it from there.

pc

Graham

RE: Remove quotes inside a text field

Hi Koresnordic

if you're familiar with sed and regular expressions,
i recommend that you use a sed (4.2.1) script from dos.
i use it extensively with csv and dos files before i import
them into our database (foxpro)

RE: Remove quotes inside a text field

(OP)
Hi coldbohemia,

Unfortunately, this will have to run exclusively inside MySQL on a box not under my control, else I would create a bit of vb.net to do what I need. Thanks for the option though.

pc

Graham

RE: Remove quotes inside a text field

Quote (OP)


I need to keep the surrounding quotes but the ones in the middle need to go

may I ask why? perhaps you could give a larger segment of the csv so we can see the pain you're having.

but i suspect if it is as messy as you make out, the best bet is to pre-parse and clean it rather than do this in the database engine. if you really need to do it in mysql then a regex is an option. but for that to be successful you need to have data that is consistently dirty.

for example i was recently asked to process about 800 songs and parse the data into the song lyrics, title and author etc. there was no consistency in how each datablock was presented so no way of using a regex. I ended up writing a line-level lexer in a finite state-machine. which worked very well. I felt that writing a regex to deal with every situation would have taken longer than a more brute force approach. and potentially longer than a manual approach.

anyway - show us the damage and maybe we'll come up with something good.

RE: Remove quotes inside a text field

(OP)
Two example lines that are causing me an issue:

[user]@[company],04/02/2014 02:11:59,customer-reviews-messages@amazon.co.uk,[user]@[company],"[name], did 'Kindle Fire HD 7", HD Display, Wi-Fi, 8 GB - Includes Special Offers' meet your expectations? Review it on Amazon.co.uk",14349,<00000143faaa7dac-bfa654d3-aca4-4115-898e-3d2058000540-000000@eu-west-1.amazonses.com>
&
[name]@[company],04/02/2014 14:05:01,[Sender e-mail],[user]@[company],"New Helpdesk #40328 Firecrest order upload- If no P.O value on EDI file then default \"NA\"",4168,<20140204140212.1D2813AC001@[company]>

I want to use the following but it fails because of the quote after HD 7:

load data infile "\\\\[path to file]\\DetailedMessageStats-04-02-14.csv"
ignore into table import_temp
fields terminated by ','
optionally enclosed by '"'
lines terminated by '\r\n';

I have removed the users e-mail address to stop scrappers picking it up.

pc

Graham

RE: Remove quotes inside a text field

that is unfortunate....

are you certain that you cannot pre-process the file? that use case is solvable through (for example) using php as the import conduit. I see that you say it would take a lot of work in vb. it would not in php; unless there are other use cases that you haven't included here?

RE: Remove quotes inside a text field

(OP)
OK,

I left it to start with and hand balled the data in correctly, so I could work on the what I needed to do with it after this stage. Then I came back to it.

The way I resolved this is:
Load the whole record into one data field with:

CODE -->

-- Empty the temp table
TRUNCATE temp_holding;
-- Load in the data into the temp table as one log string
LOAD DATA INFILE "C:\\DetailedMessageStats-04-02-14.csv"
IGNORE INTO TABLE temp_holding
-- select charcters that are very unlikely to exist in the record
FIELDS TERMINATED BY '!!**@'
LINES TERMINATED BY '\r\n';

CALL import_routine; 

With the import_routine stored procedure as:

CODE -->

-- Set up variables
DECLARE temp_text VARCHAR(4096);
DECLARE temp_int INT DEFAULT 0;
DECLARE counter INT default 0;
DECLARE i INT default 0;
DECLARE first_quote INT default 0;
DECLARE next_quote INT default 0;
DECLARE split_one INT;
DECLARE split_two INT;
DECLARE split_three INT;
DECLARE split_four INT;
DECLARE split_five INT;
DECLARE split_six INT;
DECLARE comma_loc INT;

DECLARE testmode BOOLEAN DEFAULT false;

-- Empty the table the records will go into
TRUNCATE import_complete;

-- Remove blank lines & the header line
DELETE FROM temp_holding WHERE LENGTH(data) = 0 OR LEFT(DATA,9)='User,Time';

-- Remove the "total" lines, found by checking for no commas
DELETE FROM temp_holding WHERE INSTR(DATA,',') = 0;

-- Get the number of records
SELECT COUNT(*) FROM temp_holding INTO counter;

IF testmode THEN
	-- SET counter = 6;
	SELECT counter;
END IF;

SET i = 0;

-- loop through the records
WHILE i < counter do

	-- Get the data line
	SET temp_text = (SELECT data FROM temp_holding LIMIT i,1);
	
	-- Find how many quotes are in the data
	SET temp_int = LENGTH(temp_text) - LENGTH(REPLACE(temp_text,'"',''));

	-- only need to worry if there are more than 2 quotes in the data
	IF temp_int > 2 THEN
		
		IF testmode then
			SELECT i,temp_int,temp_text;
		END IF;

		-- Find the first quote
		SET first_quote = LOCATE('"',temp_text);

		-- Loop until there are only 2 quotes
		WHILE temp_int > 2 DO

			-- Get the next quote
			SET next_quote = LOCATE('"',temp_text,first_quote + 1);

			SET temp_text = CONCAT(LEFT(temp_text,next_quote - 1),MID(temp_text,next_quote + 1, 4096));

			SET temp_int = length(temp_text) - length(replace(temp_text,'"',''));

		END WHILE;

	END IF;

	-- Check between the quotes for commas
	SET first_quote = LOCATE('"',temp_text);
	SET next_quote = LOCATE('"',temp_text,first_quote + 1);
	SET comma_loc = LOCATE(',',temp_text,first_quote + 1);
		
	-- If the comma is beyond the second quote, then this is ok
	WHILE comma_loc < next_quote DO
		IF testmode THEN
			SELECT i,first_quote,next_quote,comma_loc,temp_text;
		END IF;

		SET temp_text = CONCAT(LEFT(temp_text,comma_loc - 1),' ',mid(temp_text,comma_loc + 1,4096));
		SET comma_loc = LOCATE(',',temp_text,first_quote + 1);
	END WHILE;

	-- Find where the fields break
	SET split_one = LOCATE(',',temp_text);
	SET split_two = LOCATE(',',temp_text,split_one + 1);
	SET split_three = LOCATE(',',temp_text,split_two + 1);
	SET split_four = LOCATE(',',temp_text,split_three + 1);
	SET split_five = LOCATE(',',temp_text,split_four + 1);
	SET split_six = LOCATE(',',temp_text,split_five + 1);

	-- Put into the table
	INSERT INTO import_complete
		VALUES(	
		MID(temp_text,1,split_one - 1), 								-- User
		MID(temp_text,split_one + 1, split_two - split_one - 1),		-- Date / time
		MID(temp_text,split_two + 1, split_three - split_two - 1),		-- Sender
		MID(temp_text,split_three + 1, split_four - split_three - 1),	-- Recipient
		MID(temp_text,split_four + 1, split_five - split_four - 1),		-- Subject
		MID(temp_text,split_five + 1, split_six - split_five - 1),		-- Size
		MID(temp_text,split_six + 1,4096)								-- Message ID
		);

	SET i = i + 1;

END WHILE; 
I am not saying this is the best or quickest way, but it works, so for the moment that is good enough for me.
Now I just need to find a way of getting an equivalent of the 'load data infile' that works in a stored procedure.

pc

Graham

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members!

Resources

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close