Dedupe but with Time Range Criteria

dpgirl · May 9, 2007

I need help figuring out how to do the following:

I have a table that I want to remove duplicate records. However, my definition of a duplicate is any record that has the same Agent_ID and ANI values, but also have a Call_Date value that is within 1 hour of each other. (Call_Date is a date/time field. Agent_ID and ANI are character fields.) When there is a duplicate, I would like to keep the record with the last Call_Date only.

Here's sample data (where Rec_ID is a unique identifier):

Rec_ID Agent_ID ANI Call_Date
1 JOHN 123 5/9/07 09:23:05 AM
2 JOHN 123 5/9/07 10:09:05 AM
3 JOHN 123 5/9/07 10:57:24 AM
4 JILL 123 5/9/07 02:59:16 PM
5 JESS 456 5/9/07 04:50:22 PM
6 JESS 456 5/9/07 10:03:64 PM

My desired deduped result will be to keep records 3 through 6 but not 1. Note: The 1st record is dropped because it is within 1 hour of the 2nd record. The 2nd record is then compared to the 3rd record. The 2nd record is then dropped because it is within 1 hour of the 3rd record. (So the process needs to order the records by Call_Date first before comparing.)

Feel free to add flags to the existing table (to flag duplicates or non-duplicate records) or create an entirely new table with the desired results. However, I don't want to delete any records from the original table.

Thanks!

rac2 · May 9, 2007

Pseudo-code

Create a cursor
ORDER BY AgentID, ANI, and time.

Fetch the first record as the current record
Fetch the next record

Current record values are stored in @CurrentAgentID, @CurrentANI, and @CurrentTime

Next record values in @NextAgentID, @NextANI, and @NextTime

Begin Loop

If @CurrentAgentID = @NextAgentID
AND @CurrentANI = @NextANI
AND (@NextTime - @CurrentTime) < 1
Then mark current record as a dupe

Increment records
set current record values = next record values
Fetch a new next record

Loop

You would do this in a stored procedure.
If you wish to produce a recordset then follow this with a simple SELECT WHERE Duped IS NULL. Depending of whether this result can change with the passage of time and the addition of new data and performance considerations you may wish to create a temporary table instead of updating the real table.

And think about how to handle the final record in the cursor.

unclejimbob · May 10, 2007

I had a quick go at this as an exercise, I'm a newbie so I'm sure there is a more elegant approach ;-) I do not handle rows in any particular order, I just test that a row will not be within 1 hour of another row, this results in a scary execution plan and I wouldn't run it against a large result set, anyway for what its worth here it is...

[tt]
SELECT
-- Need this because we are not joining the table to itself via an equijoin
-- therefore we will have multiple child rows for each parent row
DISTINCT
mt1.Rec_ID,
mt1.Agent_ID,
mt1.ANI,
mt1.Call_Date
FROM
dbo.mytable mt1
-- Ensure that each row returned is not within 1 hour of another row
JOIN
dbo.mytable mt2
ON
mt2.Rec_ID <> mt1.Rec_ID
AND DATEADD(hour,1,mt2.Call_Date) <= mt1.Call_Date,
(
-- Find duplicates and return as a derived table
SELECT
Agent_ID,
ANI
FROM
dbo.mytable mt2
GROUP BY Agent_ID, ANI
HAVING COUNT(*)>1
) a
-- Ensure that the only results returned are for duplicates
WHERE
a.Agent_ID = mt1.Agent_ID
AND a.ANI = mt1.ANI
[/tt]

ujb

unclejimbob · May 10, 2007

The above solution retutns all rows that are duplicates and not withn 1 hour of each other, if you only want the last row then you could use something like this, but note that this will break if you have the same Agent_ID, ANI and Call_Date combination for 2 rows...

[tt]
SELECT DISTINCT
mt1.Rec_ID,
mt1.Agent_ID,
mt1.,
mt1.Call_Date
FROM
dbo.mytable mt1
JOIN
dbo.mytable mt2
ON
mt2.Rec_ID <> mt1.Rec_ID
WHERE
DATEADD(hour,1,mt2.Call_Date) <= mt1.Call_Date
AND
mt1.Call_Date = (
SELECT
MAX(Call_Date)
FROM
dbo.mytable mt3
WHERE
mt3.Agent_ID = mt1.Agent_ID
AND mt3.ANI = mt1.ANI
HAVING COUNT(*) > 1
)
[/tt]

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Dedupe but with Time Range Criteria

dpgirl

Technical User

rac2

Programmer

unclejimbob

Technical User

unclejimbob

Technical User

Similar threads

Part and Inventory Search

Sponsor