Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations bkrike on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Using DISTINCT Properly 1

Status
Not open for further replies.

jw2000

Programmer
Aug 12, 2005
105
US
I am try to retrieve all fields for non Southwest Pilots from the Pilot table but I only want unique emails. The reason is because there are some pilots with duplicate records.

--no error but I need all fields
select distinct(pilotEmail1)
from Pilot
where pilotEmail1 NOT LIKE '%@southwest.com%'

--returns all records, distinct doesn't work here, using a group by returns an error
select distinct(pilotEmail1), *
from Pilot
where pilotEmail1 NOT LIKE '%@southwest.com%'


How can I return the correct records for pilots with unique Emails here?

 
the problem is, which ones are the "correct" records?

if you have two rows with the same email, which row do you want? you have to answer this question in terms of the values of the other columns, you can't say "oh, just pick one"



r937.com | rudy.ca
 
If 2 rows have the same email, I would like to return the record that has a value for pilotEmail2. Pilots can have up to 3 emails.
 
Basically pilots have pilotId (primary key), pilotFirstName, pilotLastName, pilotEmail1, pilotEmail2, pilotEmail3, pilotPhone, pilotFax, pilotAddress, pilotCity, pilotState, pilotCountry, and pilotNumberOfEmails.

pilotNumberOfEmails is the number of email addresses that a particular pilot has. The original database was designed poorly so there are some duplicate pilot records (pilotNumberOfEmails would be 1). Older pilot records only have pilotEmail1. Thus, if a pilot with duplicate records has a value for pilotEmail2, I only want to return that record.
 
Try this:
Code:
select     * 
from       Pilot
where      pilotemailid2 in (select pilotemail2
                             from   pilot 
                             where  politemail1 in (select    pilotEmail1
                                                    from      Pilot
                                                    where     pilotEmail1 NOT LIKE '%@southwest.com%'
                                                    group by  politEmail1
                                                    having    count(*) > 1)
                                    and pilotemail2 is not null)
union 
select     * 
from       Pilot
where      pilotemailid1 in (select    pilotEmail1
                             from      Pilot
                             where     pilotEmail1 NOT LIKE '%@southwest.com%'
                             group by  politEmail1
                             having    count(*) = 1)

Regards,
AA
 
Hi AA,

Can you explan how your selects and union works? I am not sure how it works.
 
AA,

Your select returns 680 pilots whereas:

select distinct(pilotEmail1)
from Pilot
where pilotEmail1 NOT LIKE '%@southwest.com%'

returns over 1,400 pilots (more than double) with unique pilotEmail1 ... why is this happening?


 
ok, Here goes the explanation for the code:
From what I understand, pilotEmail1 would be repeated if there were more than one entry for a pilot.

Something like
PolitID, PilotName, PolitEmail1, PolitEmail2
1, 'Jack', jack@nw.com, NULL
2, 'Jack', jack@nw.com, jack@cont.com

In which case this select would get jack@nw.com
(select pilotEmail1
from Pilot
where pilotEmail1 NOT LIKE '%@southwest.com%'
group by politEmail1
having count(*) > 1)
the outer select would get the corresponding politemail2 which is not null.

The code after the union gets all polits who have only 1 politemail1.

Try breaking down the code and running each part to figure out whats missing.

Regards,
AA
 
AA,

I broke down the code and still can't figure out why there are less than half of the pilots returned as compared to using:

select distinct(pilotEmail1)
from Pilot
where pilotEmail1 NOT LIKE '%@southwest.com%'

Is there another way to use distinct or union to return all pilot information for unique pilotEmail1?
 
This query will only return pilots who have one entry for pilotemail1. Is that a valid assumption?
Code:
select     * 
from       Pilot
where      pilotemailid1 in (select    pilotEmail1
                             from      Pilot
                             where     pilotEmail1 NOT LIKE '%@southwest.com%'
                             group by  politEmail1
                             having    count(*) = 1)

The other query returns all pilots who have more than one entry for politemailid1 and where pilotemailid2 is not null.

Are there any other cases that we are missing?
 
The problem with the above query is it only returns pilots with unique politEmail1, and it does not return pilots with duplicate pilotEmail1. The query should return pilots with unique politEmail1 as well as one of the records for pilots with duplicate pilotEmail1.

I think the problem could be that a pilot with duplicate pilotEmail1's may have null pilotemail2's.
 
Eg.

PolitID, PilotName, PolitEmail1, PolitEmail2
1, 'Jack', jack@nw.com, NULL
2, 'Jack', jack@nw.com, NULL
 
To handle cases where a pilot has more than one entry for pilotemailID1 we have the other select statement but Yes, if the PolitEmail2 for both the entries is null then that pilotemail2 is not picked.

How do you want to handle that?



 
Do this and let us know what you get:
Code:
select   pilotemailid1
from     pilot a
where    pilotemailid2 is null
         and not exists (select 'x' from pilot b where a.pilotemailid1 = b.pilotemailid1 and b.pilotemailid2 is not null)
group by pilotemailid1
having   count(*) > 1
 
The query above returns 760 pilot emails.

"To handle cases where a pilot has more than one entry for pilotemailID1 we have the other select statement but Yes, if the PolitEmail2 for both the entries is null then that pilotemail2 is not picked. How do you want to handle that?"

If both entries for PilotEmail2 are null and both PilotEmail1 are the same, I would return Pilot information for where RecordQuality = 1 (another field we are using to determine which records with duplicate PilotEmail1 is better quality. This field is also used for pilots with non-duplicate PilotEmail1. RecordQuality = 1 or 2 with 1 being better quality).
 
Questions for you:

1) Is this a one time select or will be used as a canned report or something?

2) how many total records are we talking about here?

3) Are you ok with using temp tables?

If the record set is not huge then adding another union should not hurt. I agree this is not the most efficient way to do it but should work.
Code:
union
select     * 
from       Pilot
where      pilotemailid1 in (select   pilotemailid1
                             from     pilot a
                             where    pilotemailid2 is null
                                      and not exists (select   'x' 
                                                      from     pilot b 
                                                      where    a.pilotemailid1 = b.pilotemailid1 
                                                               and b.pilotemailid2 is not null)
                             group by pilotemailid1
                             having   count(*) > 1)
           and RecordQuality = 1

Let us know your concerns.

Regards,
AA
 
Adding the 3rd union above returns incorrect results - pilots with duplicate PilotEmail1 are being returned.

1) This will be used often for reporting purposes.
2) There is around 2,700 total records.
3) Would prefer not to use temp tables. A cursor would be used for the records returned to print out pilot information.


Regards,
JW
 
My mistake ... the problem is that pilots with duplicate pilotEmail1 can both have RecordQuality = 1. How can I change this union to return records with the smaller pilotId (primary key for the pilot table)?
 
Well, the problem is you did not provide any sample data to test the code, there would be no way to know where the problem is? When you say pilots with duplicate pilotemail1 are returned, did you check what the recordquality for them is? Is it 1 for both?

If you currently do not know what data is the error data then take a subset of the pilots by adding a where condition to pick up only few pilots and test the select statement individually. You will then know where the problem is.

If you identify the problem fixing it is not an issue.

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top