Yup. Same subject, 3rd FAQ
Tricks and methods found in this FAQ are recycled from thread183-1193016. First we'll explain yet another way how to calculate single median value, then how to calculate median per group without cursors and dynamic SQL - and finally how to exploit SQL2005 ranking functions.
Sample data
Code:
CREATE TABLE sampleData ( groupID int, numValue int )
INSERT INTO sampleData VALUES ( 1, 1 )
INSERT INTO sampleData VALUES ( 1, 2 )
INSERT INTO sampleData VALUES ( 1, 6 )
INSERT INTO sampleData VALUES ( 1, 16 )
INSERT INTO sampleData VALUES ( 1, 7 )
INSERT INTO sampleData VALUES ( 2, 5 )
INSERT INTO sampleData VALUES ( 2, 5 )
INSERT INTO sampleData VALUES ( 2, 5 )
INSERT INTO sampleData VALUES ( 2, 11 )
INSERT INTO sampleData VALUES ( 3, 10 )
INSERT INTO sampleData VALUES ( 3, 17 )
INSERT INTO sampleData VALUES ( 3, 52 )
INSERT INTO sampleData VALUES ( 3, 66 )
INSERT INTO sampleData VALUES ( 4, 18 )
INSERT INTO sampleData VALUES ( 5, 0 )
INSERT INTO sampleData VALUES ( 5, 0 )
Five groups of rows = five median values to calculate at once. For purpose of this testing we assumed numValue cannot be NULL.
Single median value
Let's take 1st group (groupID = 1). To calculate median value we need last value from first half and first value from second half, while taking care about odd count of elements. Looks like TOP 50 PERCENT works exactly as desired:
Code:
DECLARE @groupID int; SET @groupID = 1
DECLARE @M1 int, @M2 int
SELECT TOP 50 PERCENT @M1 = numValue FROM sampleData WHERE groupID = @groupID ORDER BY numValue ASC
SELECT TOP 50 PERCENT @M2 = numValue FROM sampleData WHERE groupID = @groupID ORDER BY numValue DESC
SELECT (@M1+@M2)/2.0
Credits go to donutman who posted that trick on TT long time ago.
Median values per group (SQL Server 2000 and above)
Simple: wrap above code into user-defined function:
Code:
CREATE FUNCTION fn_getMedian( @groupID int )
RETURNS decimal(5, 3)
AS
BEGIN
DECLARE @m1 int, @m2 int, @ret decimal(5, 3)
SELECT TOP 50 PERCENT @m1= numValue FROM sampleData WHERE groupID=@groupID ORDER BY numValue ASC
SELECT TOP 50 PERCENT @m2= numValue FROM sampleData WHERE groupID=@groupID ORDER BY numValue DESC
SET @ret = (@m1+@m2)/2.0
RETURN @ret
END
GO
... and then use it as custom aggregate:
Code:
SELECT groupID, dbo.fn_getMedian( groupID)
FROM sampleData
GROUP BY groupID
ORDER BY groupID
Although this code looks fairly elegant, scanning rows ASC/DESC combined with sequential nature of UDFs (called for each new groupID) may cause excessive amount of I/O reads (aka: query will run a bit slower on larger data set).
Another way is to select all rows into identity table and then use identity "row numbers" to get "middle" rows:
Code:
DECLARE @temp TABLE( rownum int identity(1, 1), GroupID int, numValue int )
INSERT INTO @temp( groupID, numValue )
SELECT groupID, numValue
FROM SampleData
ORDER BY groupID, numValue
SELECT A.groupID, AVG(1.0*A.NumValue) AS medianValue
FROM @temp A
INNER JOIN
( SELECT groupID, FLOOR(AVG(1.0*rownum)) AS m1, CEILING(AVG(1.0*rownum)) AS m2
FROM @temp
GROUP BY groupID
) B
ON A.groupID = B.groupID AND A.rownum BETWEEN B.m1 AND B.m2
GROUP BY A.groupID
Alternatively you can use temp #table instead of table @variable. Weird as it may sound, this works faster on larger sets of data - at least in lab conditions where tempdb is not tortured by concurrent users/processes.
Median values per group (SQL Server 2005 only)
Yukon provides several new ways to calculate median. Arguably the most obvious one is with ROW_NUMBER() function:
Code:
SELECT R.groupID, AVG(1.0*R.numValue) AS medianValue
FROM
( SELECT GroupID, numValue, ROW_NUMBER() OVER(PARTITION BY groupID ORDER BY NumValue) AS rowno
FROM sampleData
) R
INNER JOIN
( SELECT GroupID, 1+count(*) as N
FROM sampleData
GROUP BY GroupID
) G
ON R.GroupID = G.GroupID AND R.rowNo BETWEEN N/2 AND N/2+N%2
GROUP BY R.groupID
Note that RANK() and DENSE_RANK() cannot be used for this purpose due to possible duplicate numValues in group. But NTILE() can be used, though code is not so elegant:
Code:
SET ANSI_WARNINGS OFF
SELECT groupID, CASE WHEN cnt%2=1 THEN t1max ELSE (t1max+t2min)/2.0 END AS medianValue
FROM
( SELECT groupID, COUNT(numValue) AS cnt,
MAX(CASE WHEN tile=1 THEN numValue END) AS t1max,
MIN(CASE WHEN tile=2 THEN numValue END) AS t2min
FROM
( SELECT groupID, numValue, NTILE(2) OVER( PARTITION BY groupID ORDER BY numValue ) AS tile
FROM sampleData
) T
GROUP BY T.groupID
) G
ORDER BY groupID
FYI method with ROW_NUMBER() is by far and large the fastest of all mentioned.
Revision history:
------------------
2006/02/21 - initial draft