INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Jobs

Insert Into Production Table from Temp Table - Exclude Duplicate Records
3

Insert Into Production Table from Temp Table - Exclude Duplicate Records

Insert Into Production Table from Temp Table - Exclude Duplicate Records

(OP)
I perform a monthly insert of multiple text files into a 2012 SQL server Database.

I bulk insert the text files into the respective temporary table and then use INSERT INTO to load from the temporary table into the production table after performing various validations.

The text files have duplicate records that I want to exclude from being loaded from the temporary table into the production table.

Duplicate records are defined to be any record in the text file that has the same value for [JurisdictionCode], [AccountNo], [SequenceNumber] and [SaleDate]. Per review, there are relatively few duplicate records - approximately 3% of the total records in any given text file. (Thought - Why let this relatively small number of records prevent the establishment of the composite primary key constraints? Load valid data and keep moving!)

By removing the duplicate records, I will be able to create a primary key constraint based on the afore-mentioned four fields.

The plan is to load all of the valid data. Then, investigate later the invalid data and records with duplicate data.

Using the code below, I have not been able to successfully load the data from the temporary table to the production table and exclude the duplicate records.

There are 0 records in the query result.

Currently trouble shooting but not quite able to locate the reason why there are no records in the query result.


What modifications are needed to load the data from the temporary table and exclude duplicate records that are within the text file?




CODE

INSERT INTO Equipment_2014 
SELECT RTRIM(LTRIM([JurisdictionCode])),
RTRIM(LTRIM([AccountNo])),
RTRIM(LTRIM([SequenceNumber])),
.
.
.

CASE
   when  len([SaleDate]) = 8
   and isdate([SaleDate]) = 1
   and rtrim(ltrim([SaleDate])) like '[0-9][0-9][0-9][0-9][0-1][0-2][0-3][0-9]'
   then convert(date,[SaleDate], 112) 
   else cast(null as date)
end as [SaleDate],
)
from #EquipmentData1
where 
[SaleDate] between convert(date, '20140101') and convert(date, '20141231')  
AND RTRIM(LTRIM([SaleAmount])) is not null
AND RTRIM(LTRIM([SaleDate])) is not null 
AND RTRIM(LTRIM([JurisdictionCode])) is not null 
AND RTRIM(LTRIM([AccountNo])) is not null  
AND NOT EXISTS(
SELECT t1.* FROM #EquipmentData1 t1
JOIN (
    SELECT [JurisdictionCode], [AccountNo], [SequenceNumber], [SaleDate]
    FROM #EquipmentData1
    GROUP BY [JurisdictionCode], [AccountNo], [SequenceNumber], [SaleDate]
    HAVING COUNT(*) > 1
) t2 ON t1.[JurisdictionCode] = t2.[JurisdictionCode] 
AND t1.[AccountNo] = t2.[AccountNo] 
AND t1.[SequenceNumber] = t2.[SequenceNumber] 
AND t1.[SaleDate] = t2.[SaleDate]
order by [JurisdictionCode], [AccountNo], [SequenceNumber], [SaleDate]
)

GO 

Another thought - Is it possible/feasible to bulk insert the text file into a staging table (rather than a temporary table) so I can track the records that were not loaded to the production table and facilitate future review of these "leftover" records? Further, I could have a field that designate the problem with the record - i.e. "Not Valid Sale Date", "Duplicate Record" etc.

Initially I created a "Duplicate Records Table" and loaded the duplicate records. Then, attempted to join the temporary table, #EquipmentData1 with the DuplicateRecordsTable and delete the duplicate records from the temporary table. Then, I was to load the validated data excluding the duplicate records from the temporary table into the production table.

I was not able to get the following code to work.

CODE

DELETE FROM #EquipmentData1 A
INNER JOIN DuplicateEquipmentData B
B.[JurisdictionCode] = A.[JurisdictionCode] 
AND B.[AccountNo] = A.[AccountNo] 
AND B.[SequenceNumber] = A.[SequenceNumber] 
AND B.[SaleDate] = A.[SaleDate] 

Appreciate any additional insight regarding best practices to handle duplicate records in the source text files.

RE: Insert Into Production Table from Temp Table - Exclude Duplicate Records

The patter to apply to bulk text file imports is

TEXTFIOLE -BULK LOAD-> Staging Tables
Staging Tables -MERGE-> Production Tables

What you do is nothing like that.

You decide about duplicates in the MERGE stage, not before.

The details depend on whether you want data coming from text to update the existing data or not. If something coming in text for the same Jurisdictiocode etc (all the criteria identifying an already existing record), than that is handled as an update, all rows with a non duplicate id are new.

And that's it.

We do have a command, that does the merge, it is MERGE.

For a simple case of a production and staging table both in the form (ID, Data) the merge statement is simply:

CODE

MERGE ProductionTabla as Target
USING StagingTable AS source (ID, Data)  
    ON (target.ID = source.ID)  
    WHEN MATCHED THEN   
        UPDATE SET Data= source.Data
WHEN NOT MATCHED THEN  
    INSERT (ID, DATA)  
    VALUES (source.ID, source.Data) 


And the same holds true, if you need more than one roow for the id and for the data, the statement just becomes more complex.

Before going into merge, you might need to do your data cleansing within the staging tables.

The way the merge changes, if you want to skip data already in production is simply skipping the WHEN MATCHED portion:

CODE

MERGE ProductionTabla as Target
USING StagingTable AS source (ID, Data)  
    ON (target.ID= source.ID)  
    WHEN MATCHED THEN   
        UPDATE SET Data= source.Data
WHEN NOT MATCHED THEN  
    INSERT (ID, DATA)  
    VALUES (source.ID, source.Data) 


Bye, Olaf.

RE: Insert Into Production Table from Temp Table - Exclude Duplicate Records

First some comments on your code

CODE

INSERT INTO Equipment_2014 
SELECT RTRIM(LTRIM([JurisdictionCode])),
RTRIM(LTRIM([AccountNo])),
RTRIM(LTRIM([SequenceNumber])),
.
.
.

CASE
   when  len([SaleDate]) = 8
   and isdate([SaleDate]) = 1
   and rtrim(ltrim([SaleDate])) like '[0-9][0-9][0-9][0-9][0-1][0-2][0-3][0-9]'
   then convert(date,[SaleDate], 112) 
   else cast(null as date)
end as [SaleDate],
)
from #EquipmentData1
where 
 -- this is wrong as saledate at this point is not yet a date
 -- if it is indeed a date at this point then the case statement above is not required
[SaleDate] between convert(date, '20140101') and convert(date, '20141231')

-- trims below are not required to check for nulls 
-- Unlike Oracle in sql server an empty string, which would be the result of rtrim/ltrim on a string with only spaces on it, is not converted to a null value
AND RTRIM(LTRIM([SaleAmount])) is not null
AND RTRIM(LTRIM([SaleDate])) is not null 
AND RTRIM(LTRIM([JurisdictionCode])) is not null 
AND RTRIM(LTRIM([AccountNo])) is not null  
AND NOT EXISTS (SELECT t1.*
                FROM #EquipmentData1 t1
                INNER JOIN ( SELECT [JurisdictionCode], [AccountNo], [SequenceNumber], [SaleDate]
                            FROM #EquipmentData1
                            GROUP BY [JurisdictionCode], [AccountNo], [SequenceNumber], [SaleDate]
                            HAVING COUNT(*) > 1
                           ) t2
                ON t1.[JurisdictionCode] = t2.[JurisdictionCode] 
                   AND t1.[AccountNo] = t2.[AccountNo] 
                   AND t1.[SequenceNumber] = t2.[SequenceNumber] 
                   AND t1.[SaleDate] = t2.[SaleDate]
-- remove order by from here as it is invalid code
--order by [JurisdictionCode], [AccountNo], [SequenceNumber], [SaleDate]
              )

GO 

as you with to both only insert into final table the "good" records, and insert into a error table the "bad" records I would advise you to do as follows instead.

CODE

create another intermediary temp table to hold all the records, good and bad including the ones that fail the saledate validadtion (which is wrong on your code)

if object_id('tempdb..#validate_Equipment' is not null
   drop table #validate_Equipment;

select rtrim(ltrim(ed1.JurisdictionCode))
     , rtrim(ltrim(ed1.AccountNo))
     , rtrim(ltrim(ed1.SequenceNumber))
       ...
       ...
       ...
     , case
       when len(ed1.SaleDate) = 8
        and isdate(ed1.SaleDate) = 1
        and rtrim(ltrim(ed1.SaleDate)) like '[0-9][0-9][0-9][0-9][0-1][0-2][0-3][0-9]'
            then convert(date, ed1.SaleDate, 112) 
       else cast(null as date)
       end as SaleDate
-- fields related to validation
     , ed1.SaleDate as Input_SaleDate
     , dup.count_dups
into #validate_Equipment
from #EquipmentData1 ed1
left outer join (select JurisdictionCode
                      , AccountNo
                      , SequenceNumber
                      , SaleDate
                      , count(*) as count_dups
                 from #EquipmentData1
                 group by JurisdictionCode
                        , AccountNo
                        , SequenceNumber
                        , SaleDate
                ) dup
   on dup.JurisdictionCode = ed1.JurisdictionCode
   and dup.AccountNo = ed1.AccountNo
   and dup.SequenceNumber = ed1.SequenceNumber
   and dup.SaleDate = ed1.SaleDate

the above sql will give you all records from table #EquipmentData1, with SaleDate already validated, and with a count of duplicates associated with each entry (or null if one of the 4 fields are null)


Then insert into final table the valid ones

insert into Equipment_2014
select ve.JurisdictionCode
     , ve.AccountNo
     , ve.SequenceNumber
       ...
       ...
       ...
     , ve.SaleDate
from #validate_Equipment ve
where ve.SaleDate between convert(date, '20140101') and convert(date, '20141231') -- note that this will also ignore the records that are null
 and ve.JurisdictionCode is not null
 and ve.AccountNo is not null
 and ve.SequenceNumber is not null
 and ve.count_dups = 0 -- note that this will also ignore the records that are null

and then insert into a errors table the ones that failed for any reason.

insert into Equipment_2014_errors
select ve.JurisdictionCode
     , ve.AccountNo
     , ve.SequenceNumber
       ...
       ...
       ...
     , ve.SaleDate
-- fields related to validation also added to errors table so they can be queried easily
     , ed1.SaleDate as Input_SaleDate
     , dup.count_dups

from #validate_Equipment ve
where not (ve.SaleDate between convert(date, '20140101') and convert(date, '20141231') -- note that this will also ignore the records that are null
       and ve.JurisdictionCode is not null
       and ve.AccountNo is not null
       and ve.SequenceNumber is not null
       and ve.count_dups = 0 -- note that this will also ignore the records that are null
          )

note that on the above I "negated" the valid condition to make it easier to code
Alternative to the above where clause would be

where ve.SaleDate is null 
   or ve.SaleDate not between  convert(date, '20140101') and convert(date, '20141231') 
   or ve.xx is null
   or ve.xx is null
   or ve.xx is null
   or ve.count_dups is null 
   or ve.count_dups > 0 

Regards

Frederico Fonseca
SysSoft Integrated Ltd
www.syssoft-int.com

FAQ219-2884: How Do I Get Great Answers To my Tek-Tips Questions?
FAQ181-2886: How can I maximize my chances of getting an answer?

RE: Insert Into Production Table from Temp Table - Exclude Duplicate Records

(OP)
Would the use of MERGE allow for the exclusion of duplicate data within the source text files or is it primarily to allow for the handling of only duplicate data that is defined as pre-existing data in the production table relative to data within the text file?

I want to exclude the duplicate data that is within the text file as "effortlessly" as possible.

My understanding is that if I set a primary key constraint on the four fields, no duplicate data will be allowed into the production table. However, I believe that the Insert Into statement will fail because there are duplicate data that violate the constraints.

For example, say I have a monthly text file with say, 500,000 records that contain 1000 records that have the same value in the four previously mentioned fields.

Will the MERGE statement allow for the import of the 499,000 "good" records whereas using the "Insert Into" will not or will the MERGE statement allow me to import all 500,000 records as long as there is no "conflict" with any pre-existing records in the table?

RE: Insert Into Production Table from Temp Table - Exclude Duplicate Records

MERGE only compares source and target row by row, that's right.Why would there be duplicate data in the text files? Or does this only come in, because you import multiple text files first? Then just change processing file by file.

To remove duplicate base in an ID, even on a compound ID, you will need one or more columns denoting which record is the one to keep, and you have to define what is the record to keep: The chronologically first? The last? The one with the highest amount stored in it? And if that is undecidable, you really have to go back to the source and correct the way these text files are generated in the first place, because that makes clear the problem is not solvable besides randomly always picking the correct row, which nobody can expect from you.

Bye, Olaf.



RE: Insert Into Production Table from Temp Table - Exclude Duplicate Records

(OP)
Error received upon running the portion of the sql script to insert into #validate_Equipment from #EquipmentData1 ed1;

CODE

An object or column name is missing or empty. For SELECT INTO statements, verify each column has a name. 
For other statements, look for empty alias names. Aliases defined as "" or [] are not allowed. Change 
the alias to a valid name. 

Did create the field "Count_Dups" as the last field (with an integer data type) on the temporary table, #validate_Equipment.

However, the error remains.

Troubleshooting continues...

Will aliasing the fields in the SELECT portion resolve the error?

CODE

select rtrim(ltrim(ed1.JurisdictionCode))
     , rtrim(ltrim(ed1.AccountNo))
     , rtrim(ltrim(ed1.SequenceNumber))
       ...
       ...
       ...
     , case
       when len(ed1.SaleDate) = 8
        and isdate(ed1.SaleDate) = 1
        and rtrim(ltrim(ed1.SaleDate)) like '[0-9][0-9][0-9][0-9][0-1][0-2][0-3][0-9]'
            then convert(date, ed1.SaleDate, 112) 
       else cast(null as date)
       end as SaleDate
-- fields related to validation
     , ed1.SaleDate as Input_SaleDate
     , dup.count_dups
into #validate_Equipment
from #EquipmentData1 ed1
left outer join (select JurisdictionCode
                      , AccountNo
                      , SequenceNumber
                      , SaleDate
                      , count(*) as count_dups
                 from #EquipmentData1
                 group by JurisdictionCode
                        , AccountNo
                        , SequenceNumber
                        , SaleDate
                ) dup
   on dup.JurisdictionCode = ed1.JurisdictionCode
   and dup.AccountNo = ed1.AccountNo
   and dup.SequenceNumber = ed1.SequenceNumber
   and dup.SaleDate = ed1.SaleDate 


RE: Insert Into Production Table from Temp Table - Exclude Duplicate Records

yes - I forgot to put them in.

should be
select rtrim(ltrim(ed1.JurisdictionCode)) as JurisdictionCode
, rtrim(ltrim(ed1.AccountNo)) as AccountNo
, rtrim(ltrim(ed1.SequenceNumber)) as SequenceNumber
...

Regards

Frederico Fonseca
SysSoft Integrated Ltd
www.syssoft-int.com

FAQ219-2884: How Do I Get Great Answers To my Tek-Tips Questions?
FAQ181-2886: How can I maximize my chances of getting an answer?

RE: Insert Into Production Table from Temp Table - Exclude Duplicate Records

(OP)
The error persists.

Comparing each field within the two temporary tables, #validate_Equipment and #EquipmentData1.

Also added the column "count_dups" to the temp table, #EquipmentData1?

Is there an alternative approach that accomplish the objective of loading the valid data into the production table?

RE: Insert Into Production Table from Temp Table - Exclude Duplicate Records

(OP)
Upon highlighting the line "into #validate_Equipment" within the section displayed below, I receive "No column was specified for column 1 of #validate_Equipment

CODE

into #validate_Equipment
from #EquipmentData1 ed1 

Therefore, it appears that a column alias for the Left Join is needed.

However, "dup" is already specified as the alias for the Left Join.

RE: Insert Into Production Table from Temp Table - Exclude Duplicate Records

(OP)

Did resolve the error by modifying the data type for several fields on the temporary table, #validate_Equipment
to coincide with the data type for the field as displayed on the temp table #EquipmentData1.

Therefore, data was successfully loaded into the "validate_Equipment" temporary table when I highlight the "Insert into #validate_Equipment" section.

This leads to another question, should the data type for all fields on the temporary fields mentioned above be "Not Null?"

Or, maybe it does not matter as long as the data types coincide and that the "Null" or "Not Null" is more a concern for the tables in production?

Now, upon highlighting the "insert into Equipment_2014 section of the sql", no records were displayed. Stepping through the sql script by just selecting the "select statement" going forward and excluding the WHERE section still did not result in any rows in the query result.

Only when modifying the line displayed below is when rows began to appear in the query result set.

CODE

From
where ve.SaleDate between convert(date, '20140101') and convert(date, '20141231')

To
where ve.SaleDate between convert(date, '2014-01-01') and convert(date, '2014-12-31') 

Then, re-running the sql script on the section "Insert into Equipment_2014" and highlighting successive lines in the WHERE section, there were fewer and fewer records displayed, as can be expected because "and" precedes each line. However, upon highlighting the full sql - including all of the lines that are preceded with "and" results in no records in the query result. Hence, no records will be imported into the table "Equipment_2014."


CODE

insert into Equipment_2014
select ve.JurisdictionCode
     , ve.AccountNo
     , ve.SequenceNumber
       ...
       ...
       ...
     , ve.SaleDate
from #validate_Equipment ve
where ve.SaleDate between convert(date, '20140101') and convert(date, '20141231') -- note that this will also ignore the records that are null
 and ve.JurisdictionCode is not null
 and ve.AccountNo is not null
 and ve.SequenceNumber is not null
 and ve.count_dups = 0 -- note that this will also ignore the records that are null 


If a field is null on a record, I still would like for that record to be displayed within the database. I believe that the previously recommendation of preceding the lines in the WHERE section with "and" will result in some records not displaying in the database.

Any additional insight is appreciated.

Note, due to setting up a database for the first time, the learning curve is not as steep as previously and hopefully, the questions are not redundant.

RE: Insert Into Production Table from Temp Table - Exclude Duplicate Records

(OP)
Spoke too soon.

Overlooked the statement that "ve.count_dups = 0" already considers the null records.

Therefore, it appears that I need only include this line and not include the previous lines prefaced with "and."

RE: Insert Into Production Table from Temp Table - Exclude Duplicate Records

regarding "where ve.SaleDate between convert(date, '20140101') and convert(date, '20141231')"
- this is an error that I had not noticed on your original code and didn't correct - should have been
"where ve.SaleDate between convert(date, '20140101', 112) and convert(date, '20141231', 112)"

regarding "ve.count_dups = 0" should be "ve.count_dups = 1" - my error on this.
(or "or ve.count_dups > 1" instead of "or ve.count_dups > 0" on the alternate version of the code)

as for having nulls on your database as you mentioned - your original code was excluding them so that is what I did - you will need to look at the data yourself and see if any of the records on table Equipment_2014_errors should be considered (after you do the change above).


Regards

Frederico Fonseca
SysSoft Integrated Ltd
www.syssoft-int.com

FAQ219-2884: How Do I Get Great Answers To my Tek-Tips Questions?
FAQ181-2886: How can I maximize my chances of getting an answer?

RE: Insert Into Production Table from Temp Table - Exclude Duplicate Records

(OP)
Did make the revisions and not able to get past the Insert into Equipment_2014 Errors section.

Error received is

CODE

Operand type clash: date is incompatible with int 


Using the sql like;

CODE

insert into Equipment_2014_Errors
select 
E.[JurisdictionCode],
E.[AccountNumber],
E.[SEQUENCE_NUMBER],
E.[SALE_AMOUNT],
E.[SALE_DATE] as Input_SALE_DATE

from #validate_EquipmentData E  

where not (E.SALE_DATE between convert(date, '20140101',112) and convert(date, '20141231',112)  
       and E.JurisdictionCode is not null
       and E.AccountNumber is not null
       and E.SEQUENCE_NUMBER is not null
       and E.count_dups = 1 
	   ) 

Note, when I query the table #validate_EquipmentData, the format of the Sale Date is like 2014-01-01.

CODE

select distinct top 10 SALE_DATE from #EquipmentData1  --format YYYYMMDD
select distinct top 10 SALE_DATE from Equipment_2014  --format YYYY-MM-DD
select distinct top 10 SALE_DATE from #validate_EquipmentData  --format YYYY-MM-DD
select distinct top 10 SALE_DATE from Equipment_2014_Errors  --format ?? 


Re-visited the dates throughout the sql script;
1. All fields declared as date format
2. Dates enclosed by apostrophe

Also data types across all temporary tables are consistent.


Any clue as to the cause and resolution to this error?

What am I missing?

Tried the following modifications to the filter on the Sale Date but the error persists.

CODE

where E.SALE_DATE between convert(date, '20140101') and convert(date, '20141231') 
where E.SALE_DATE between '20140101' and '20141231'   
where E.SALE_DATE between '2014-01-01' and '2014-12-31'   
where E.SALE_DATE between convert(date, '2014-01-01') and convert(date, '2014-12-31')  
where E.SALE_DATE between convert(date, '20140101',112) and convert(date, '20141231',112) 



RE: Insert Into Production Table from Temp Table - Exclude Duplicate Records

(OP)
Error resolved.

RE: Insert Into Production Table from Temp Table - Exclude Duplicate Records

I didn't read all the posts and this might be the clunkiest way but I use a CTE in a similar situation (Bulk copy and then manipulate)

I left all the column names just to show how granular the criteria can be:

CODE --> sql

WITH CTE AS(
   SELECT
CreateDate,
ServiceVendorNumber, 
ServiceVendorName, 
PolicyNumber, 
BSN,
PolicyName, 
TaxID,
PolicyStatusEffectiveDate, 
PolicyStatus,
ReportingPeriodStartDate,
ReportingPeriodEndDate,
ManualClass, 
ManualClassType,
ManuaClassDescription, 
BWCCustomerIDforincludedIndividuals, 
IndividualFirstName,
IndividualMiddleName,
IndividualLastName,
IndividualTaxID,
ManualClassRate,
ReportingType, 
NumberofEmployees, 
Payroll,
RN = ROW_NUMBER()OVER(
PARTITION BY
CreateDate,
ServiceVendorNumber, 
ServiceVendorName, 
PolicyNumber, 
BSN,
PolicyName, 
TaxID,
PolicyStatusEffectiveDate, 
PolicyStatus,
ReportingPeriodStartDate,
ReportingPeriodEndDate,
ManualClass, 
ManualClassType,
ManuaClassDescription, 
BWCCustomerIDforincludedIndividuals, 
IndividualFirstName,
IndividualMiddleName,
IndividualLastName,
IndividualTaxID,
ManualClassRate,
ReportingType, 
NumberofEmployees, 
Payroll
ORDER BY PolicyNumber)
FROM BWC_RateFileBulkCopy
)
DELETE FROM CTE WHERE RN > 1 

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members!

Resources

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close