Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations bkrike on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Help with tuning of a query 1

Status
Not open for further replies.

johnnybee

Programmer
Dec 16, 2002
24
GB
I have two tables GCR_CONTACTS and GCR_EADDRESSES , which are linked on contact_id.

I need to select the first 500000 contact_id's of those contacts where there are duplicates on first name & email address (they can be different case, and have extra spaces).

I have written this query:

Code:
select * from
(select c.contact_id,'same first_name||email'
from cccadm.gcr_contacts c, cccadm.gcr_eaddresses e 
where c.contact_id = e.contact_id
and exists
(select 'x' from
(select upper(trim(first_name))||upper(trim(eaddress)) as first_name_email 
from cccadm.gcr_contacts c, cccadm.gcr_eaddresses e 
where e.contact_id = c.contact_id
and e.eaddress_type ='EMAIL' and primary_eaddress_flag = 'Y'
group by upper(trim(first_name))||upper(trim(eaddress))
having count(distinct c.contact_id) > 1) z
where z.first_name_email = upper(trim(c.first_name))||upper(trim(e.eaddress))))
where rownum <= 500000

Contact_id is the primary key of GCR_CONTACTS.

Counts are:

GCR_CONTACTS : 10 million records
GCR_EADDRESSES : 18 million records
GCR_EADDRESSES where primary_eaddress_flag = 'Y' and eaddress_type = 'EMAIL' : 5 million records

Explain Plan (apologies for rubbish formatting):

Operation Object Name Rows Bytes Cost

SELECT STATEMENT Hint=CHOOSE 500 K 36 G
COUNT STOPKEY
FILTER
HASH JOIN 885 K 28 M 92265
TABLE ACCESS FULL GCR_CONTACTS 9 M 100 M 48082
VIEW index$_join$_003 17 M 384 M 31520
HASH JOIN 885 K 28 M 92265
INDEX FAST FULL SCAN GCR_EADDRS_CONTACT_ID_FK_I 17 M 384 M 4875
INDEX FAST FULL SCAN GCR_EADDRS_EADDRESS_I 17 M 384 M 4875
FILTER
SORT GROUP BY 886 36 K 73723
HASH JOIN 17 K 726 K 73644
VIEW index$_join$_006 1 M 51 M 21077
HASH JOIN 1 M 51 M
INDEX RANGE SCAN GCR_EADDR_FUNC2_I 1 M 51 M 9262.439453125
INDEX FAST FULL SCAN GCR_EADDRS_EADDRESS_I 1 M 51 M 9262.439453125
TABLE ACCESS FULL GCR_CONTACTS 9 M 100 M 48082

There are indexes on contact_id,eaddress_type and primary_eaddress_flag.

The query is taking far too long to exexcute, any ideas as to how to change/tune it?
 
Did you try something more simple, kinda like:
Code:
select contact_id, first_name_email, count(*)
  from (
select c.contact_id
     , upper(trim(first_name))||upper(trim(eaddress)) 
        as first_name_email
  from cccadm.gcr_contacts c, cccadm.gcr_eaddresses e 
 where c.contact_id = e.contact_id
   and eaddress_type ='EMAIL' 
   and primary_eaddress_flag = 'Y')
 group by contact_id, first_name_email
having count(*) > 1 
/

Check the explain plan and see if this answers your requirement.



----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
 
Maybe still more concise:
Code:
select * from (
select c.contact_id
     , upper(trim(first_name))||upper(trim(eaddress)) 
        as first_name_email
     , count(*)   
  from cccadm.gcr_contacts c, cccadm.gcr_eaddresses e 
 where c.contact_id = e.contact_id
   and eaddress_type ='EMAIL' 
   and primary_eaddress_flag = 'Y'
 group by contact_id, first_name_email
having count(*) > 1 )
where rownum <= 500000
/
HTH



----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
 
Maybe still more concise:
Code:
select * from (
select c.contact_id
     , upper(trim(first_name))||upper(trim(eaddress)) 
        as first_name_email
     , count(*)   
  from cccadm.gcr_contacts c, cccadm.gcr_eaddresses e 
 where c.contact_id = e.contact_id
   and eaddress_type ='EMAIL' 
   and primary_eaddress_flag = 'Y'
 group by contact_id, 
     , upper(trim(first_name))||upper(trim(eaddress)) 
having count(*) > 1 )
where rownum <= 500000
/
HTH



----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
 
Thanks for the reply, however, this does not quite solve the problem.

What I am after is those with duplicate first_name & email address, regardless of contact_id.

As you are grouping by both contact_id and first_name_email, your query will only pick out duplicates within contact_id.

Which I think is why my query ended up more complicated.

e.g. For this data...

1,'anemail@hotmail.com'
2,'anemail@hotmail.com'

I would need both contact_id's 1 & 2 returned.

 
Sorry, clicking on any of these does not work, only opens e-mail 'compose' window:

1,'anemail@hotmail.com'
2,'anemail@hotmail.com'

I would need both contact_id's 1 & 2 returned.





----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
 
Ooops, thought I had to click on the example...



----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
 
If it's a one time deal it may work faster if you do it in two parts:

a) Create an indexed table:
Code:
create table tmp_email_dups
(first_name_email 
,cnt
,constraint pk_tmp_email_dups primary key(first_name_email)
)
organization index pcttheshold 40 including cnt
overflow tablespace users nologging
as
select upper(trim(first_name))||upper(trim(eaddress)) 
          as first_name_email
     , count(*) cnt
  from cccadm.gcr_contacts c, cccadm.gcr_eaddresses e 
 where c.contact_id = e.contact_id
   and eaddress_type ='EMAIL' 
   and primary_eaddress_flag = 'Y'
 group by upper(trim(first_name))||upper(trim(eaddress)) 
having count(*) > 1 
/

b) Join the tables:

Code:
select * from 
  from cccadm.gcr_contacts c, cccadm.gcr_eaddresses e 
     , tmp_email_dups z
 where c.contact_id = e.contact_id
   and z.first_name_email = upper(trim(c.first_name))
                          ||upper(trim(e.eaddress))
   and rownum <= 500000
/


----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top