Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

SQL Challenge 1

Status
Not open for further replies.

ESquared

Programmer
Dec 23, 2003
6,129
US
Compete to give the most efficient query that returns a column grouped by key based on the most recent date of another column.

I will run your queries against a 359,000 row table I have on a production server. This table records certain information about certain occurrences.

CREATE TABLE DoneHistory (
Thing char(20) NOT NULL,
TimeDone datetime NOT NULL,
PlaceDone char(10) NOT NULL,
OtherInfo1 char(10) NULL,
OtherInfo2 char(10) NULL,
OtherInfo3 char(10) NOT NULL
)

The table has two nonclustered indexes but no clustered index. I'm willing to test against other tables that do have clustered indexes but this one simply does not (and don't ask me why, I don't know).

Indexes:

Thing
Thing, OtherInfo3

Expected output:

Thing, TimeDone (max per Thing), PlaceDone (from the same row as the TimeDone).

I do have a particular query in mind that beat the pants off the other ones I tried. So I both wanted to share it with you but also see if anyone else can find it.
 
version 1

Code:
SELECT a.*,t.PlaceDone
from (select max(TimeDone),Thing
from DoneHistory
group by Thing) as a(MaxTimeDone,Thing)
join DoneHistory t on t.Thing = a.Thing
and t.TimeDone = a.MaxTimeDone

I wonder if using exists would make a difference

Denis The SQL Menace
--------------------
SQL Server Code,Tips and Tricks, Performance Tuning
SQLBlog.com, Google Interview Questions
 
I suspect there's going to be a bit of creativity involved here. My first thoughts were EXACTLY what Denis posted. The real question is.... can we do better than the stock answer?

I assume that changing indexes and/or table structure is outside the scope of this challenge.

Let's put a deadline on this, shall we. Let's say... Thursday. Of course, ESquared, this is your challenge and you make the rules. I'm just thinking that anything beyond the standard answer may take a little while to accomplish.



-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
I'll modify answers to give just the correct columns, but technically your answers shouldn't be using *. :)

Thursday is a good deadline. My pants-kicking answer on Friday.

Should I tell you the approximate difference in reads, CPU, and duration for my final query? That might spur some ideas, but it might be considered cheating or poisoning the challenge. :)
 
If you like, I'll add indexes. There's only about 60 MB of data so I'm willing to copy it to another table so I can modify it at will. Of course all tests will be done on the same server so comparisons will be valid. So throw in your index creation script.

Here's the thinking that got me to the query I found. Don't read this if you want your mental space untainted by my ideas (which might be good, you might find your own way).

In MS Access, there are the aggregate functions First and Last. With those we could do this query very easily:

[tt]SELECT
Thing,
Last(TimeDone), --or max, either way works in this query
Last(PlaceDone)
FROM
DoneHistory
GROUP BY
Thing
ORDER BY
TimeDone -- but not legal in SQL Server to order by this aggregated column
-- , PlaceDone (not necessary but could make results deterministic per a given set of data)[/tt]

First and Last only select a known record when there is an order by clause, otherwise they return a row dependent on order the engine encountered them while reading.

In the order by clause you can switch all First and Last aggregations to the other by flipping the sort order.

I understand that SQL Server can't provide these aggregate functions because ordering in SQL Server always comes last after all rows are materialized (except in TOP queries or where the sort order is already correct and can be exploited, such as from an index or a sort in the data stream used for making a nested loop possible--or was it a merge join that requires sorting?). Do do last or first you're doing an aggregate function which comes much earlier in the processing of the query.

So my thoughts were... how do I simulate that? How do I get the data I need in the Max() pass without having to join back to get it again?

 
Shhhh !!!

I'm trying to think here!

-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
By the way, it's SQL 2000, though I'm willing to try SQL 2005-specific code (I'll find some suitable table somewhere).

It appears that the duration on my queries is unreliable as this is a server that is in use (though I am being careful not to impact anyone negatively). Durations for my "improved" query tend to be a little longer than the generic one, even 80% longer, but sometimes half the duration. So I'll have to run this on a server that I have exclusive access to to get better duration metrics. CPU and reads should be all the story we need, though, and reads are where this really shines.

And here is a hint if anyone wants one:

There are no subqueries or derived tables in this solution. In one pass with one query (and no variables or temp tables), I return the desired result set. There's only one read in the execution plan.
 
How 'bout this?

Good thread, E^2!

select Thing
, Max(TimeDone) TimeDone
, Substring(Max(convert(varchar(16), TimeDone, 121) + PlaceDone), 17, 10) PlaceDone
from DoneHistory
group by Thing
order by Thing

[small]----signature below----[/small]
I'm pushing an elephant up the stairs

My Crummy Web Page
 
You got it, Alex!

George actually had the packing and unpacking principle before you did, sent to me in chat, though you caught that the Max(TimeDone) can be done without having to unpack it as well.

Some significant cpu and duration gain can be had with converting the date to binary instead of char...
 
So here's my answer.

This this is complicated by NULLs and by the potential for there being more than one equal max(date) for each key.

So part of this challenge is up to the interpretation of the individual person. We may not all get the exact same recordsets. But as long as the results are approximately right, one or more values which share the maximum date (or null if there is none) for each key, then I consider it a win. Each query could be modified to achieve the same results as the others with some coalesces in strategic places.

So the classic query is as Denis said:

Code:
SELECT
   L.Thing,
   L.TimeDone,
   L.PlaceDone
FROM
   DoneHistory L
   INNER JOIN (
      SELECT
         Thing,
         TimeDone = Max(TimeDone)
      FROM DoneHistory
      WHERE TimeDone IS NOT NULL AND PlaceDone IS NOT NULL
      GROUP BY Thing
   ) X ON L.Thing = X.Thing AND L.TimeDone = X.TimeDone
WHERE L.TimeDone IS NOT NULL AND L.PlaceDone IS NOT NULL
ORDER BY
   L.Thing
And here's my new idea, which is to "pack" the correlated row value into the Max() function and extract it afterward:

Code:
SELECT
   Thing,
   TimeDone = Max(TimeDone),
   PlaceDone =
      Substring(Convert(varchar(10),
         Max(convert(binary(8), TimeDone) + Convert(binary(10), PlaceDone))
      ), 9, 10)
FROM DoneHistory L
WHERE L.TimeDone IS NOT NULL AND L.PlaceDone IS NOT NULL
GROUP BY Thing
ORDER BY Thing
This actually takes a little bit longer in duration than the first query for the table in question, but it has half the reads and 60-80% of the CPU. So that makes it a clear winner (this is a production server so the duration is not reliable anyway, but the CPU and reads are a good measure of real cost to the server).

I also ran the same style queries against a much larger 1.7 million row table, and had the surprising results of 1/12th the reads for the "packed" method! For some bizarre reason this huge table also has no clustered index, but even with a clustered index it will be able to do fewer reads, by definition.

So, final results:
[tt]Query CPU Reads Duration
Simple 2516 30757 5797
Packed 1859 16685 6297
Simple 8328 2316863 16640
Packed 3938 186900 16094
Simple, 3 columns 8281 2317314 15000
Packed, 3 columns 5238 188277 12610[/tt]

The 3 columns thing was exactly that, returning 3 columns from the max row per group instead of 1. I tried 2 methods but they had no appreciable difference between them (pack the columns once in a derived table and unpack that value three times or a flat query that packed all three columns three times and unpacked just the desired one).

I wouldn't use this technique just any old place, I'd probably reserve it for critical situations where that last little bit of performance was really important. And I'd document the code well and probably put in comments a query that was functionally the same but easier to understand.

[COLOR=black #e0e0e0]For SQL and technical ideas, visit my blog, Squared Thoughts.

The best part about anything that has cheese is the cheese.[/color]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top