Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

SQL Puzzle #... 8?: Median values 4

Status
Not open for further replies.

vongrunt

Programmer
Mar 8, 2004
4,863
HR
Overview
We all probably know wtf is median value, but anyway...

Imagine any ordered set of values, like this one:

1 3 7 8 17 23 42

Median represents "middle" value in set, or in above case 4th element - 8. For even count of elements median becomes mean (average) of two middle values. So if we remove 8, median becomes (7+17)/2 = 12.

Purpose of median value is statistical - it gives "representative" value from set, and usually does it better than average value - assuming data distribution is not too f**ed up. Visit wikipedia for more information.

Objective
The goal of this puzzle is to calculate median per group. Sorta like custom aggregate, though even good SQL2005/CLR implementation is somewhat of tricky. Here is sample gen script:
Code:
create table sampleData ( groupID int, numValue int )
insert into sampleData values ( 1,  1 )
insert into sampleData values ( 1,  2 )
insert into sampleData values ( 1,  6 )
insert into sampleData values ( 1,  16 )
insert into sampleData values ( 1,  7 )
insert into sampleData values ( 2,  3 )
insert into sampleData values ( 2,  9 )
insert into sampleData values ( 2,  23 )
insert into sampleData values ( 2,  31 )
insert into sampleData values ( 3,  10 )
insert into sampleData values ( 3,  17)
insert into sampleData values ( 3,  52 )
insert into sampleData values ( 3,  66 )
insert into sampleData values ( 4,  18 )
To make things simpler, let's assume numValue cannot be NULL. Expected result set is:
Code:
groupID medianValue
-------.-----------
      1         6.0
      2        16.0
      3        34.5
      4        18.0
Rules & restrictions
No OLAP/Analysis services allowed :p

You can do it with SQL2000, Yukon, whatever. Personally I'd still value 2000 code more - simply because it is harder to do.

There are no restrictions about T-SQL features used - as long as code is written set-based.

The most elegant answers (yup, subjective) get some stars...

-----
P.S. to George: what happened with points in polygon thing? Shall we continue testing?

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
How 'bout ROW_NUMBER()? [wink]

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
'xcuse me for some time, I found thread183-1192934 also quite challenging...

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
yep, take out distinct and use ROW_NUMBER fixes it

Code:
select avg( convert(money,numvalue)),z.groupid from (
select case when count(groupid)%2 =0 then (count(groupid)/2)else (count(groupid)/2) + 1  end as GroupStart,
case when count(groupid)%2 =0 then (count(groupid)/2)  + sign(count(groupid) +1 %2) else (count(groupid)/2)+1 end as GroupEnd,groupid 
from(
select ROW_NUMBER() OVER(PARTITION BY groupID ORDER BY numValue ) as 'rownumberdense',* 
from (select  * from sampleData) zz
) c
group by groupid) z join (select ROW_NUMBER() OVER(PARTITION BY groupID ORDER BY numValue ) as 'rownumberdense',* 
from (select  * from sampleData) xx
) p on z.groupid =p.groupid
and p.rownumberdense between groupstart and GroupEnd
group by z.groupid

Denis The SQL Menace
SQL blog:
Personal Blog:
 
and even shorter
Code:
select avg( convert(money,numvalue)),z.groupid from (
select case when count(groupid)%2 =0 then (count(groupid)/2)else (count(groupid)/2) + 1  end as GroupStart,
case when count(groupid)%2 =0 then (count(groupid)/2)  + sign(count(groupid) +1 %2) else (count(groupid)/2)+1 end as GroupEnd,groupid 
from(
select ROW_NUMBER() OVER(PARTITION BY groupID ORDER BY numValue ) as 'rownumberdense',* 
from sampleData) c
group by groupid) z join (select ROW_NUMBER() OVER(PARTITION BY groupID ORDER BY numValue ) as 'rownumberdense',* 
from sampleData ) p on z.groupid =p.groupid
and p.rownumberdense between groupstart and GroupEnd
group by z.groupid

Denis The SQL Menace
SQL blog:
Personal Blog:
 
Here is another idea.

Suppose that number of items per group is N.
Find item m1 at location FLOOR((N+1)/2)
Find item m2 at location CEILING((N+1)/2)
Find average of these two values.

Or:
N1 = 1 + COUNT(*)
For each group, calculate average for items with row_numbers between N1/2 and N1/2+N1%2

In other words: global AVG aggregate on CTE with ROW_NUMBER, joined with BETWEEN on vanila COUNT(*) aggregate query.

Thoughts?

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
Or in other words (in this case CTE does nothing that derived table can't do so...):
Code:
select R.groupID, AVG(1.0*R.numValue) as medianValue
from 
(	select GroupID, numValue, row_number() over(partition by groupID order by NumValue) as rowno
	from sampleData
) R
inner join
(	select GroupID, 1+count(*) as N
	from sampleData
	group by GroupID
) G
on R.GroupID = G.GroupID and R.rowNo between N/2 and N/2+N%2
group by R.groupID
I guess it should work but am not sure...

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
How about NTILE(2)?

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
I posted my SQL2000 solution, what did you guys think?

-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom
 
I'm jus testing it. That said, how about:
Code:
create function fn_getMedian( @groupID int )
returns decimal(5, 3)
as
begin
	declare @m1 int, @m2 int, @ret decimal(5, 3)
	select top 50 percent @m1= numValue from sampleData where groupID=@groupID order by numValue asc
	select top 50 percent @m2= numValue from sampleData where groupID=@groupID order by numValue desc

	set @ret = (@m1+@m2)/2.0

	return @ret
end
go
Test:
Code:
select groupID, dbo.fn_getMedian( groupID)
from sampleData
group by groupID
order by groupID
?

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
I don't think that works, NTILE(2) will put the whole set into 2 buckets

if your set has 8 rows the first 4 will be 1 the next 4 will be 2

if your set has 7 rows the first 4 will be 1 the next 3 will be 2


if your set has 6 rows the first 3 will be 1 the next 3 will be 2

if your set has 9 rows the first 5 will be 1 the next 4 will be 2


divide set by 2 and put extra in first bucket etc etc

I don't see a real use for this function





Denis The SQL Menace
SQL blog:
Personal Blog:
 
> I posted my SQL2000 solution, what did you guys think?

Yup, somewhat classic idea with identity table. I think it can be done without UNION. Otherwise looks OK... and works OK.

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
I tested the performance of your function based solution against my table variable solution on a table that has 10,800 records. My solution was slightly faster than yours, but not enough to be significant.

-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom
 
> I don't think that works, NTILE(2) will put the whole set into 2 buckets

Here is my thought process:

Find m1 = MAX() value from 1st bucket.
Find m2 = MIN() value from 2nd bucket.
If count(*) per group is even number, then m2=m1
Calculate (m1+m2)/2.0

I think it's worth trying...

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
ATM I'm trying to simplify George's code... maybe we can squeeze out some extra 10% speed or something.

And... stars for everyone.

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
Hehe... how about this one:
Code:
Declare @Temp Table(RowId Integer Identity(1,1), GroupId Integer, NumValue Integer)

Insert Into @Temp(GroupId, NumValue)
Select     GroupId, NumValue
From     SampleData
Order By GroupId, NumValue

select A.GroupID, avg(1.0*A.NumValue) as medianValue
from @Temp A
inner join
(	select GroupID, floor(avg(1.0*RowID)) as m1, ceiling(avg(1.0*RowID)) as m2
	from @temp
	group by GroupID
) B 
on A.GroupID = B.GroupID and A.RowID between B.m1 and B.m2
group by A.GroupID

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top