Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

SQL/Select Random records from a table 7

Status
Not open for further replies.

tamayok

Programmer
Sep 4, 2001
99
PR
I am using VFP 8 and would like to be able to obtain a RANDOM selection of X number of records from a given table.

These records need to be chosen at RANDOM so that everytime the Select Statement is executed a different result is obtained. The resulting output/cursor/table cannot contain duplicate records.

 
Stella,

If you need an exact number, it sounds like a combination of my and TheRambler's techniques should be the best bet. Assuming MyField is any character, date, datetime or numeric field in the original table:

SELECT * FROM BigTable HAVING BETWEEN(RAND(),0.0,0.15) INTO CURSOR Temp

SELECT TOP 1000 * FROM Temp ORDER BY MyField INTO TABLE Sample

Note that I have substantially increased the second parameter in BETWEEN(). That will give you around 50% more records than you need in the intermediate cursor. The number in the TOP clause in the second command represents the exact number you need.

The point is that this will be substantially faster than TheRambler's method. That's because Rambler has to sort the entire input table in order to get the top so-many. I only have to sort the intermediate cursor, which is much smaller.

If speed isn't an issue, Rambler's original code would be OK.

Mike


Mike Lewis
Edinburgh, Scotland

My Visual Foxpro web site: My Crystal Reports web site:
 
Thanks Stella. This is a problem I didn't solve before, but if I had to, I would have done with xBase code.
But yesterday (instead of doing my work) I was trying to code your solution, then I read Ramani didn't recommend the use of RECNO() with SQL, and also I read Rick's code to generate unique random numbers and I guess they inspired me!
Somehow I feel there is a faster solution, maybe in the path outlined by Mike above...
 
MikeLewis,
Your strategy pays off for those LARGE tables...

TheRambler,
You are an INMORTAL from now on!!

-------------------------------------------

PD: I've made my fir$t contribution to Tek-Tips... Great vehicle and fantastic input from all!!!
 
Mike,

You are right, two statements sometimes are faster than one complex statement (although I like theRambler's statement anyway, it's neat and readable, and doesn't require much time for adjusting arguments :) ).

What I was actually trying to do with TOP X that I mentioned didn't work correctly was a combination of your statement with TOP X in the same statement. Although at first the number of selected records was less than X required, for the sake of experiment I adjusted X, not the arguments. But my main mistake was ordering by a wrong field (used numbers, not names, and miscalculated). If I had more time to experiment, I would probably get it right.

But on a big table your method should be faster. Star for you, too.

I also like theRambler's second method, but probably wouldn't test it for speed at this time, just keep in mind.

Stella.
 
Hi, everyone,

While I was typing my answer to Mike, so many more messages appeared.

I have done some time testing and noticed an important side effect. On a very big table with SET DECIMAL set to a small number, RAND() will often return not so unique results, so even with TOP X the actual number of records selected might be more than X if they have the same value in this field. I guess, this was also part of the problem I faced when tested TOP X. SET DECIMALS as high as possible.

Here is the test I've done just now (the BigTable contains over 2 million records, in 2-step test over 3000000 was selected to the Temp cursor. Yes, 2-step approach is actually faster. Actual times vary.

Code:
SET DECIMALS TO 16    && was 4 in first test, returned > 1000 records

start1=SECONDS()

SELECT RAND() AS dummy, * ;
   FROM BigTable ;
   HAVING BETWEEN(dummy, 0.0, 0.15) ;
   INTO CURSOR Temp

SELECT TOP 1000 * ;
   FROM Temp ;
   ORDER BY dummy ;
   INTO TABLE d:\temp\Sample1

end1=SECONDS()

start2=SECONDS()

SELECT TOP 1000 RAND() AS dummy, * ;
   FROM BigTable ;
   ORDER BY dummy ;
   INTO TABLE d:\temp\Sample2

end2=SECONDS()

?end1-start1
?end2-start2

Stella
 
tamayok,

While I said in my last post to SET DECIMALS as high as possible, it would be better to find an optimal SETting for the DECIMALS, since too little wouldn't do the job more or less often, and too many will slow down the processing. The speed seems to be definitely dependent on DECIMALS setting.

HAVING BETWEEN(dummy, 0.0, 0.15) selected >300 thousand records from >2 million into intermediate cursor, BETWEEN(dummy, 0.0, 0.1) selected >200 thousand, and BETWEEN(dummy, 0.0, 0.01) - only >20 thousand, which also speeded up the processing significantly in the second query, while SET DECIMALS TO 10 was enough to do the job in my case.

Find your optimum DECIMALS and RAND() range combination.
 
Stella,

Let's say nRecCnt = reccount("BigTable") then

BETWEEN(dummy, 0.0, 0.01) selects about 1% of nRecCnt
BETWEEN(dummy, 0.0, 0.10) selects about 10% of nRecCnt
BETWEEN(dummy, 0.0, 0.15) selects about 15% of nRecCnt

so, if you want to select 'n' records,

BETWEEN(dummy, 0.0, nRecords/nRecCnt)
selects about nRecords of nRecCnt
+ + + +
vanakkam Ramani!
Have you read Tomayok PD?
We also appreciate your spirit and gestures...
 
CREATE OR REPLACE VIEW random_employees AS
SELECT DBMS_UTILITY.GET_HASH_VALUE
(TO_CHAR(dbms_utility.get_time)||ename
,2
,1048576
) "RANDOM_ORDER"
, emp.*
FROM EMP
ORDER BY RANDOM_ORDER

SQL> select ename from random_employees where rownum < 5;

ENAME

----------

WARD

KING

ALLEN

JAMES



SQL> select ename from random_employees where rownum < 5;



ENAME

----------

JONES

TURNER

WARD

CLARK


In this example I chose 1048576 for the size of the hash table as oracle recommends that the size be a power of 2. Given that the lowest value is only 2, the highest will be 1048578 (2 + 1048576). You may not have seen dbms_utility.get_time function before – this is a nice (and also pragma’d) function that returns elapsed time in 100th of a second.

When you have created the view, select from it where the rownum is less than your desired maximum number or required records. Because the hash value for each row is dependant upon the time in hundredths of a second concatenated to the surname, the hash value for each row will be different each time you select from the view.

In this way the order or the records changes with time and hence if you select only the first 4 records you will get different records each time.

Hope this helps


Heydyrtt
 
Heydyrtt,
While very interesting as a way to do it in Oracle, how does this apply to a VFP situation?

Rick
 
theRambler,

Yes, I got this part. If you want to select approximately n records,
BETWEEN(dummy, 0.0, nRecords/nRecCnt)
will work just fine.

What I was actually talking about is the case when you need an exact number of records and use either 1-step or 2-step TOP N approach ordering by results of RAND() function. As per VFP Help,
Code:
Rows with identical values for the columns specified in the ORDER BY clause are included in the query result. Therefore, if you specify 10 for nExpr, the query result can contain more than 10 rows if there are more than 10 rows with identical values for the columns specified in the ORDER BY clause.
The results returned by RAND() function will be rounded according to your DECIMALS setting. So, if you table is big, and your DECIMALS setting is small, chances are you will get many non-unique values to order by. In this case, TOP N may return more than N records, not the exact number you are looking for. (That's what I got when I experimented with TOP N, having >2mln records table and DECIMALS set to 4. At first I thought about different reasons for that behavior.) If your DECIMALS setting is too big, it will prolong the processing time. So an optimum setting should be found.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top