Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations bkrike on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Producing a frequency distribution from Oracle 8i numeric data

Status
Not open for further replies.

Custom24

Programmer
Nov 27, 2001
591
GB
Hi
This seems to me such a common thing to want to do
I've got a table with a numeric column. In this case, it is a purity value, which ranges from 0 to 100%, to two decimal places. There are a few hundred thousand rows.

What I'd like to do is produce a histogram showing the frequency distribution for each "bin" range that I define.

The first step is to get the SQL

So I'd like the following result set

0-20% - 100,000
20-40% - 200,000
40-60% - 400,000
etc.

The only way I can think to do this is

Select Count(ID) from Table Group By MyBinFunction(Purity);

where MyBinFunction would be a PL/SQL function which returns the correct bin for any numeric value. For example, for 15.56% it would return "0-20%"

This seems inelegant to me, and my question is "Is there an easier way"

All my searching on the net points my to analyzing tables for the CBO, which is not what I want, but indicates that Oracle already does something like this internally?

Thanks
Mark

Mark [openup]
 
You can do this in one query by using something like

select
sum(decode(value_field,GREATEST(LEAST(value_field,20),0),'1','0')) "0 - 20",
sum(decode(value_field,GREATEST(LEAST(value_field,40),21),'1','0')) "21 - 40",
sum(decode(value_field,GREATEST(LEAST(value_field,60),41),'1','0')) "41 - 60",
sum(decode(value_field,GREATEST(LEAST(value_field,80),61),'1','0')) "61 - 80",
sum(decode(value_field,GREATEST(LEAST(value_field,100),81),'1','0')) "81 - 100"
from table_name

or something like that, i.e. I've just typed that out so there could be typos and misakes!!

Good luck
 
Hello.
Hope you don't mind if I 'tag' a comment and question on to your question.
I used to work in SAS (another database) all the time. This program (SAS) is totally jam-packed with analysis functions and built-in reports. Getting a frequency report was one of the most fundamental features of the system (proc freq was the command, I think...not sure). So I join you in being a little surprised that so much work has to be done in Oracle to get this information. Maybe my perception just reflects my ineptitude with the Oracle product. I have to believe it is...at least to some degree.
So much for my comment...
The tag-along question is this: I understand and have used GaryMiles suggestion above, and it worked well for me. What I'm dealing with now, however, is the need in having the ability to return a dataset that contains (what Microsoft Office calls the MODE) the maximum frequency in a set of values.
What I mean is, suppose you have a listing of 500,000 trip listings for delivery company. The 500k listings have 3k unique points of origin. What we are trying to find out is, "What is the most frequent (again MODE in MS Office) range of distances travelled for each point of origin?"
So, if one point of origin has 100 deliveries within 3 miles, 300 within 10 miles, 1000 within 20, 5000 within 50, and 10,000 within 100 miles, then the number we would be looking for here is the 100 mile radius.
Ultimately, we'd like to take the top three ranges (in terms of frequency) and then run an average on those distances only.
I mean I think I can do this with Oracle, but it's going to require one heck of a lot of work.
Any enlightenment you can provide to this data pilgrim would be very much appreciated.
Many thanks.
-Mike
 
First, thanks for Gary for his suggestion - sorry for not replying sooner.
In the end, I elected to use a custom PL/SQL function to do my histogram

Code:
CREATE OR REPLACE function MyBinRange(TheValue in number, TheBinWidth in number) return number
is
Multiple number;
begin
Multiple := floor(TheValue/TheBinWidth);
return (Multiple + 0.5) * TheBinWidth;
end;
/

and then to produce histogram data, the query is
Code:
select  count(ar.ID), mybinrange(ar.purity, 5) from analysis_reviews ar group by mybinrange(ar.purity, 5)
[code]

the 5 here is the bin width, or resolution to which we want the histogram.  I used this idea rather than Gary's solution because it is more flexible.

Mike777 - without knowing the definition of the table(s) you are looking at and the relationships between them, if any, I can't help.  Are you just looking at a single column and you want to find the range of values which have the highest frequency?  In that case, you should be able to do something like this, following on from the query I've just posted here

[code]
select SubQuery.*, rownum from (
select  count(ar.ID) frequency, mybinrange(ar.purity, 5) BinRange from analysis_reviews ar group by mybinrange(ar.purity, 5) order by BinRange) SubQuery
where RowNum = 1

Mark [openup]
 
Mark -
You might also want to take a look at the cume_dist command. This will give you the cumulative distribution over a range of values, which might be useful in future requirements.
 
Just to tag on to what carp says... Oracle offers several statistical functions similar to the already mentioned cume_dist function - under the heading of "Analytical Functions". Even the lowly count function has an analytical version that returns counts of items in sliding ranges. These are documented in the SQL Reference manual.
 
I did have a look at that, but unfortunately, we're using the Standard edition, and I think windowing and analytical functions are only enabled in the Enterprise edition?
:(

Mark [openup]
 
That may be true - sometimes it's hard to determine what is available with what version of Oracle. However, if you are trying to use these functions in PL/SQL, try using dynamic SQL - that may make it work. I think these functions don't work directly in PL/SQL until 9i.
 
Nope - I was just trying to use them in straightforward SQL. It may be that some functions are enabled and others are not. Never mind. It's a good argument to get enterprise anyway (yeah right, as if we can afford it!)

Mark [openup]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top