Can this be optimized? 1

MikeBronner · Nov 5, 2007

Hey Everyone!

I have a script that I am using to clean up some legacy data. It runs, and functions exactly like it should. However, there is one problem: it takes much too long to run.

Here's the scenario:
- I have 100,000 records that need to be processed.
- Currently it processes about 10,000 in the first hour, but it slows down more and more the farther it gets along.
- The main query (the one for the cursor) takes only ~ 4 seconds to complete when running by itself.

I've posted the code below, and was wondering if there is something obvious that I have missed regarding optimizing for performance?

Thanks

Code:

BEGIN
	DECLARE @s_account_no VARCHAR(20)
		,@s_company_name VARCHAR(255)
		,@s_tax_id VARCHAR(30)
		,@s_web_url VARCHAR(255)
		,@n_original_account_id BIGINT
		,@n_existing_company_id BIGINT
		,@s_company_reference VARCHAR(255)

	DECLARE c_list CURSOR FOR
		SELECT c1.accountno
			,LTRIM(RTRIM(c1.company)) AS company
			,LTRIM(RTRIM(cs_tax.contsupref)) AS tax_id
			,CASE
				WHEN dbo.regex_match('^(http[s]?://)?([a-zA-Z0-9]\.)?([a-zA-Z0-9]){1}([a-zA-Z0-9\-\.])+(\.[a-zA-Z]+){1}(/([a-zA-Z0-9\-\_])*)*(\.([a-zA-Z0-9\_]+))?$', LTRIM(RTRIM(cs_web.contsupref))) = 1 THEN
					LTRIM(RTRIM(cs_web.contsupref))
				ELSE
					NULL
				END AS web_url
			,dbo.strip_non_textual(LTRIM(RTRIM(REPLACE(REPLACE(REPLACE(REPLACE(c1.company, 'ltd', ''), 'llc', ''), 'inc', ''), ' ', '')))) AS company_name_reference
		FROM goldmine_test..contact1 AS c1
			LEFT OUTER JOIN goldmine_test..contsupp AS cs_tax ON c1.accountno = cs_tax.accountno
				AND (cs_tax.contact = 'Resale Tax ID'
					OR cs_tax.contact = 'Tax ID#')
			LEFT OUTER JOIN goldmine_test..contsupp AS cs_web ON c1.accountno = cs_web.accountno AND cs_web.contact = 'Web Site'
		WHERE LEN(ISNULL(c1.company, '')) > 0

	OPEN c_list

	FETCH NEXT FROM c_list
	INTO @s_account_no
		,@s_company_name
		,@s_tax_id
		,@s_web_url
		,@s_company_reference

	WHILE (@@FETCH_STATUS = 0)
	BEGIN
		SELECT @n_existing_company_id = NULL

		SELECT @n_existing_company_id = id
		FROM gm_companies
		WHERE dbo.strip_non_textual(LTRIM(RTRIM(REPLACE(REPLACE(REPLACE(REPLACE(name, 'ltd', ''), 'llc', ''), 'inc', ''), ' ', '')))) = @s_company_reference

		IF (ISNULL(@n_existing_company_id, -1) < 1)
		BEGIN
			INSERT INTO gm_companies
			(
				name
				,tax_id
				,web_url
			)
			VALUES
			(
				@s_company_name
				,@s_tax_id
				,@s_web_url
			)

			SELECT @n_existing_company_id = SCOPE_IDENTITY()
		END

		INSERT INTO gm_original_accounts
		(
			gm_id
			,company_id
		)
		VALUES
		(
			@s_account_no
			,@n_existing_company_id
		)

		FETCH NEXT FROM c_list
		INTO @s_account_no
			,@s_company_name
			,@s_tax_id
			,@s_web_url
			,@s_company_reference
	END

	CLOSE c_list
	DEALLOCATE c_list
END
GO

Take Care,
Mike

SQLSister · Nov 5, 2007

Never use a cursor to do an insert. It is exptremely bad for performance. SQL Server should be done ina setbased fashion.

you have the information in a select, use that instead of the values clause in the insert. The basic syntax is

Code:

insert table1 (field1, field2)
select field1, field2 from table2

of course considerably more complicated selects can be used as long as it has the same number of fields as you want to insert.

Now sinceyou are using regular expressions in some kind of UDF, I wouls suggest that is also a time consumer. Perhaps you should do the regular expression work forst and put the results to a table variable or temp table and then selct from there for the import rather than locking up prodcution tables doping that kind of conversion work.

"NOTHING is more important in a database than integrity." ESquared

MikeBronner · Nov 5, 2007

Good points, thanks for the feedback.

I have thought of doing straight INSERTS using SELECTS as well, however, for this situation I didn't know how to handle it:
I need to split the data from the source table into two tables, and at the same time preserve the referential integrity (i.e. inserting the foreign key).

How can I achieve that using your methods?

Thanks again!

Take Care,
Mike

SQLSister · Nov 5, 2007

what i usually do is do the first insert and then add the id column to my work table or temp table (well actually I usually havea blank id column form the start) and grab the id by doing a join onthe natural key or all the fields if need bby waht ever it takes to uniquely identify the record. Once I have the id, then I can use it in the next insert in a set-Based fashion.

Alternatively you could use the cursor for just the first insert and add the id field to you r temptable using scope identity. Then do the second insert outside the cursor in a set-based fashion. This would be faster than the current process but slower than it ineeds to be. The only reason I might do this is if for some reason the records you are inserting do not have a natural key and might not be unique even if you match on all the fields you just inserted so that one record at a time is the only way you could guarantee to find the id for a particular record.

"NOTHING is more important in a database than integrity." ESquared

MikeBronner · Nov 5, 2007

Thanks again SQLSister!

I will try that. This means I have to add another field to the table (which I wanted to avoid, but that's ok; I think it will speed up the query enormously).

Then in a second round I will process the saved data with the regex, and dedup, etc.

Thanks for all your suggestions; I'll be back if I run into more trouble.

Take Care,
Mike

ESquared · Nov 5, 2007

And keep in mind that in SQL 2005 you have access to the contents of your modification queries in a new query using the OUTPUT clause:

[tt]<OUTPUT_CLAUSE> ::=
{
[ OUTPUT <dml_select_list> INTO { @table_variable | output_table } [ ( column_list ) ] ]
[ OUTPUT <dml_select_list> ]
}
<dml_select_list> ::=
{ <column_name> | scalar_expression } [ [AS] column_alias_identifier ]
[ ,...n ]

<column_name> ::=
{ DELETED | INSERTED | from_table_name } . { * | column_name }[/tt]

Code:

DELETE Sales.ShoppingCartItem
    OUTPUT DELETED.*;

Code:

USE AdventureWorks;
GO
DECLARE @MyTableVar table( ScrapReasonID smallint,
                           Name varchar(50),
                           ModifiedDate datetime);
INSERT Production.ScrapReason
    OUTPUT INSERTED.ScrapReasonID, INSERTED.Name, INSERTED.ModifiedDate
        INTO @MyTableVar
VALUES (N'Operator error', GETDATE());

--Display the result set of the table variable.
SELECT ScrapReasonID, Name, ModifiedDate FROM @MyTableVar;
--Display the result set of the table.
SELECT ScrapReasonID, Name, ModifiedDate 
FROM Production.ScrapReason;

[COLOR=black #e0e0e0]For SQL and technical ideas, visit my blog, Squared Thoughts.

_{The best part about anything that has cheese is the cheese.}[/color]

MikeBronner · Nov 5, 2007

Ah yes, if only we had SQL 2005 on that server.

Unfortunately it's SQL 2000. I have had problems setting up a linked server connection in the past, but I'll post that in a separate thread (don't want to high-jack my own thread).

Take Care,
Mike

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Can this be optimized? 1

MikeBronner

Programmer

SQLSister

Programmer

MikeBronner

Programmer

SQLSister

Programmer

MikeBronner

Programmer

ESquared

Programmer

MikeBronner

Programmer

Similar threads

Part and Inventory Search

Sponsor