SQL Puzzle #... 8?: Median values 4

vongrunt · Feb 17, 2006

Overview
We all probably know wtf is median value, but anyway...

Imagine any ordered set of values, like this one:

1 3 7 8 17 23 42

Median represents "middle" value in set, or in above case 4th element - 8. For even count of elements median becomes mean (average) of two middle values. So if we remove 8, median becomes (7+17)/2 = 12.

Purpose of median value is statistical - it gives "representative" value from set, and usually does it better than average value - assuming data distribution is not too f**ed up. Visit wikipedia for more information.

Objective
The goal of this puzzle is to calculate median per group. Sorta like custom aggregate, though even good SQL2005/CLR implementation is somewhat of tricky. Here is sample gen script:

Code:

create table sampleData ( groupID int, numValue int )
insert into sampleData values ( 1,  1 )
insert into sampleData values ( 1,  2 )
insert into sampleData values ( 1,  6 )
insert into sampleData values ( 1,  16 )
insert into sampleData values ( 1,  7 )
insert into sampleData values ( 2,  3 )
insert into sampleData values ( 2,  9 )
insert into sampleData values ( 2,  23 )
insert into sampleData values ( 2,  31 )
insert into sampleData values ( 3,  10 )
insert into sampleData values ( 3,  17)
insert into sampleData values ( 3,  52 )
insert into sampleData values ( 3,  66 )
insert into sampleData values ( 4,  18 )

To make things simpler, let's assume numValue cannot be NULL. Expected result set is:

Code:

groupID medianValue
-------.-----------
      1         6.0
      2        16.0
      3        34.5
      4        18.0

Rules & restrictions
No OLAP/Analysis services allowed

You can do it with SQL2000, Yukon, whatever. Personally I'd still value 2000 code more - simply because it is harder to do.

There are no restrictions about T-SQL features used - as long as code is written set-based.

The most elegant answers (yup, subjective) get some stars...

-----
P.S. to George: what happened with points in polygon thing? Shall we continue testing?

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why
[banghead]

vongrunt · Feb 17, 2006

How 'bout ROW_NUMBER()? [wink]

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why
[banghead]

vongrunt · Feb 17, 2006

'xcuse me for some time, I found thread183-1192934 also quite challenging...

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why
[banghead]

SQLDenis · Feb 17, 2006

yep, take out distinct and use ROW_NUMBER fixes it

Code:

select avg( convert(money,numvalue)),z.groupid from (
select case when count(groupid)%2 =0 then (count(groupid)/2)else (count(groupid)/2) + 1  end as GroupStart,
case when count(groupid)%2 =0 then (count(groupid)/2)  + sign(count(groupid) +1 %2) else (count(groupid)/2)+1 end as GroupEnd,groupid 
from(
select ROW_NUMBER() OVER(PARTITION BY groupID ORDER BY numValue ) as 'rownumberdense',* 
from (select  * from sampleData) zz
) c
group by groupid) z join (select ROW_NUMBER() OVER(PARTITION BY groupID ORDER BY numValue ) as 'rownumberdense',* 
from (select  * from sampleData) xx
) p on z.groupid =p.groupid
and p.rownumberdense between groupstart and GroupEnd
group by z.groupid

Denis The SQL Menace
SQL blog:

http://sqlservercode.blogspot.com/

Personal Blog:

http://otherthingsnow.blogspot.com/

SQLDenis · Feb 17, 2006

and even shorter

Code:

select avg( convert(money,numvalue)),z.groupid from (
select case when count(groupid)%2 =0 then (count(groupid)/2)else (count(groupid)/2) + 1  end as GroupStart,
case when count(groupid)%2 =0 then (count(groupid)/2)  + sign(count(groupid) +1 %2) else (count(groupid)/2)+1 end as GroupEnd,groupid 
from(
select ROW_NUMBER() OVER(PARTITION BY groupID ORDER BY numValue ) as 'rownumberdense',* 
from sampleData) c
group by groupid) z join (select ROW_NUMBER() OVER(PARTITION BY groupID ORDER BY numValue ) as 'rownumberdense',* 
from sampleData ) p on z.groupid =p.groupid
and p.rownumberdense between groupstart and GroupEnd
group by z.groupid

Denis The SQL Menace
SQL blog:

http://sqlservercode.blogspot.com/

Personal Blog:

http://otherthingsnow.blogspot.com/

vongrunt · Feb 17, 2006

Here is another idea.

Suppose that number of items per group is N.
Find item m1 at location FLOOR((N+1)/2)
Find item m2 at location CEILING((N+1)/2)
Find average of these two values.

Or:
N1 = 1 + COUNT(*)
For each group, calculate average for items with row_numbers between N1/2 and N1/2+N1%2

In other words: global AVG aggregate on CTE with ROW_NUMBER, joined with BETWEEN on vanila COUNT(*) aggregate query.

Thoughts?

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why
[banghead]

vongrunt · Feb 17, 2006

Or in other words (in this case CTE does nothing that derived table can't do so...):

Code:

select R.groupID, AVG(1.0*R.numValue) as medianValue
from 
(	select GroupID, numValue, row_number() over(partition by groupID order by NumValue) as rowno
	from sampleData
) R
inner join
(	select GroupID, 1+count(*) as N
	from sampleData
	group by GroupID
) G
on R.GroupID = G.GroupID and R.rowNo between N/2 and N/2+N%2
group by R.groupID

I guess it should work but am not sure...

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why
[banghead]

SQLDenis · Feb 17, 2006

vongrunt, yes it works

Denis The SQL Menace
SQL blog:

http://sqlservercode.blogspot.com/

Personal Blog:

http://otherthingsnow.blogspot.com/

SQLDenis · Feb 17, 2006

Other solutions?

Denis The SQL Menace
SQL blog:

http://sqlservercode.blogspot.com/

Personal Blog:

http://otherthingsnow.blogspot.com/

vongrunt · Feb 17, 2006

How about NTILE(2)?

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why
[banghead]

gmmastros · Feb 17, 2006

I posted my SQL2000 solution, what did you guys think?

-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom

vongrunt · Feb 17, 2006

I'm jus testing it. That said, how about:

Code:

create function fn_getMedian( @groupID int )
returns decimal(5, 3)
as
begin
	declare @m1 int, @m2 int, @ret decimal(5, 3)
	select top 50 percent @m1= numValue from sampleData where groupID=@groupID order by numValue asc
	select top 50 percent @m2= numValue from sampleData where groupID=@groupID order by numValue desc

	set @ret = (@m1+@m2)/2.0

	return @ret
end
go

Test:

Code:

select groupID, dbo.fn_getMedian( groupID)
from sampleData
group by groupID
order by groupID

?

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why
[banghead]

SQLDenis · Feb 17, 2006

I don't think that works, NTILE(2) will put the whole set into 2 buckets

if your set has 8 rows the first 4 will be 1 the next 4 will be 2

if your set has 7 rows the first 4 will be 1 the next 3 will be 2

if your set has 6 rows the first 3 will be 1 the next 3 will be 2

if your set has 9 rows the first 5 will be 1 the next 4 will be 2

divide set by 2 and put extra in first bucket etc etc

I don't see a real use for this function

Denis The SQL Menace
SQL blog:

http://sqlservercode.blogspot.com/

Personal Blog:

http://otherthingsnow.blogspot.com/

vongrunt · Feb 17, 2006

> I posted my SQL2000 solution, what did you guys think?

Yup, somewhat classic idea with identity table. I think it can be done without UNION. Otherwise looks OK... and works OK.

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why
[banghead]

SQLDenis · Feb 17, 2006

Let's start the award ceremony

Denis The SQL Menace
SQL blog:

http://sqlservercode.blogspot.com/

Personal Blog:

http://otherthingsnow.blogspot.com/

gmmastros · Feb 17, 2006

I tested the performance of your function based solution against my table variable solution on a table that has 10,800 records. My solution was slightly faster than yours, but not enough to be significant.

-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom

vongrunt · Feb 17, 2006

> I don't think that works, NTILE(2) will put the whole set into 2 buckets

Here is my thought process:

Find m1 = MAX() value from 1st bucket.
Find m2 = MIN() value from 2nd bucket.
If count(*) per group is even number, then m2=m1
Calculate (m1+m2)/2.0

I think it's worth trying...

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why
[banghead]

SQLDenis · Feb 17, 2006

I get you, you mean buckets per partition, yes that would work

Denis The SQL Menace
SQL blog:

http://sqlservercode.blogspot.com/

Personal Blog:

http://otherthingsnow.blogspot.com/

vongrunt · Feb 17, 2006

ATM I'm trying to simplify George's code... maybe we can squeeze out some extra 10% speed or something.

And... stars for everyone.

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why
[banghead]

vongrunt · Feb 17, 2006

Hehe... how about this one:

Code:

Declare @Temp Table(RowId Integer Identity(1,1), GroupId Integer, NumValue Integer)

Insert Into @Temp(GroupId, NumValue)
Select     GroupId, NumValue
From     SampleData
Order By GroupId, NumValue

select A.GroupID, avg(1.0*A.NumValue) as medianValue
from @Temp A
inner join
(	select GroupID, floor(avg(1.0*RowID)) as m1, ceiling(avg(1.0*RowID)) as m2
	from @temp
	group by GroupID
) B 
on A.GroupID = B.GroupID and A.RowID between B.m1 and B.m2
group by A.GroupID

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why
[banghead]

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

SQL Puzzle #... 8?: Median values 4

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor