INTELLIGENT WORK FORUMS FOR COMPUTER PROFESSIONALS
Come Join Us!
Are you a Computer / IT professional? Join Tek-Tips now!
- Talk With Other Members
- Be Notified Of Responses
To Your Posts
- Keyword Search
- One-Click Access To Your
Favorite Forums
- Automated Signatures
On Your Posts
- Best Of All, It's Free!
*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.
Partner With Us!
"Best Of Breed" Forums Add Stickiness To Your Site

(Download This Button Today!)
Feedback
"...keep up the good work with this forum, I think this
is the best one around. ...you actually try to help
people learn for themselves. ...I commend you on providing a
very good, open learning atmosphere, where usually egos are left behind..."
Geography
Where in the world do Tek-Tips members come from?
|
Count unique codes by group
|
|
|
cte1 (TechnicalUser) |
16 Aug 10 14:29 |
I am trying to count the number of unique codes in a group. Each group consist of multiple rows and the codes are pipe delimited in each cell. A sample of my dataset looks like this:
GROUP CODES 1 |231|322|414| 1 |231|322| 2 |231| 2 |231|114| 2 3 3
So in GROUP 1 there are 3 unique codes (231, 322, and 414) and in GROUP 2 there are 2 unique codes (231 and 114) There are 0 codes in GROUP 3. Can anyone tell me how to get SAS to give me an output dataset like this:
Group Count 1 3 2 2 3 0
Thanks in advance.
Chris |
|
You just count the number of "|" and subtract 1?
data x; set master; rcount =(length(Name)-length(compress(Name,"|"))) - 1; run;
compress removes the '|' from whatever the variable is leaving you with the number of '|'. Since there is an extra '|' just subtract one. There would be other ways to do this, but this was the easiest I could think of. |
|
Oh, and if some don't have the '|', you could do the index function to make sure and then use an If, then statement.
So, basically:
if index(name,'|') > 0 then rcount = (statement above); else 0; |
|
In SAS 9 there is a countw function that counts the 'words' in your string. All you have to do is set the delimiter to the pipe '|' char. ex CODEtotal = countw(codes,'|'); Hope this helps.. Klaz |
|
|
cte1 (TechnicalUser) |
19 Aug 10 15:18 |
Thanks for the help. I really appreciate it. That worked great for counting the number of codes within each cell. The problem is I need to know the number of unique codes across multiple observations (or rows). See example above. Any ideas?
|
|
I figured that you would want the unique codes count too (and I was curious) so I experimented with some code. CODEdata YouroutputDS(keep=outstr wrdtot agestr total_unq); set inputds end = last; *** INITIALIZE YOUR OUTPUT STRING VAR ***; length outstr $500; *** SET UP AN ARRAY WITH 5000 FIELDS ***; array strsort (*) $500 strs1-strs5000 ; *** COUNT THE ITEMS IN YOUR STRING (CODES) ***; wrdtot = countw(CODE,'|'); do i=1 to wrdtot; strsort(i) = scan(CODE,i,'|'); end; *** THIS IS A SORT FUNCTION ***; *** THIS FUNCTION SHOULD HAVE A UNIQUE OPTION ***; *** WE SHOULD PETITION SAS :) ***; call sortc(of strsort (*)); *** CHECK FOR UNIQUE CODES ***; flag = 0; do i=1 to dim(strsort); *** WE DONT CHECK ON THE FIRST POSITION ***; if i gt 1 and trim(strsort(i)) ne '' then do; if trim(strsort(i))= trim(strsort(i-1)) then do; *** BLANK OUT ALL DUPS ***; strsort(i)=''; call sortc(of strsort (*)); end; end; end;
CALL CATX('|', OF strs1-strs5000); outstr = strsort(1); total_unq = countw(outstr,'|'); run; I hope this helps... Klaz |
|
|
cte1 (TechnicalUser) |
20 Aug 10 9:17 |
Thanks Klaz. I ran the code, however it is only giving the number of unique codes within each cell. For example, if my dataset looks like this:
CODE |191| |191| |187|189|191|278|934| |187|189|191|278| |191|278|
When I run the code I get a dataset that looks like this:
outstr wrdtot total_unq |191| 2 1 |191| 2 1 |187|189|191|278|934| 6 5 |187|189|191|278| 5 4 |191|278| 3 2
It is giving the total unique codes within each cell but not the total unique codes accross the group. In this case there are 6 unique codes within this group (or within these 5 rows).
I really appreciate your help. Do you have any other ideas?
Thanks, Chris |
|
|
cte1 (TechnicalUser) |
20 Aug 10 9:25 |
Sorry, there was an error in my last post. The example should look like this.
CODE |190| |191| |187|189|191|278|934| |187|189|191|278| |191|278|
When I run the code I get a dataset that looks like this:
outstr wrdtot total_unq |190| 2 1 |191| 2 1 |187|189|191|278|934| 6 5 |187|189|191|278| 5 4 |191|278| 3 2
It is giving the total unique codes within each cell but not the total unique codes accross the group. In this case there are 6 unique codes within this group (or within these 5 rows).
Thanks again, Chris |
|
Yes the code only gives you the unique count in each cell (record). If you wanted to combine rows and then get calculate the unique codes you have two options. 1) Use the retain statement and 'grow' your code string until all codes are in one record. Then run the script you have.
2) Use Proc transpose to transpose your data values into a vertical structure (1 column for all your values by groupID) then use Proc SQL to count with a distinct option.
Personaly, I would use the second option as the first may run out of room if you have a few hundred codes in a group.
Klaz |
|
|
cte1 (TechnicalUser) |
20 Aug 10 9:45 |
Thanks Klaz. I will try that. |
|
Here is an example of the Proc Transpose method. CODEdata f2; set your_ds end = last; array strsort (*) $500 strs1-strs5000 ; wrdtot = countw(code,'|'); do i=1 to wrdtot; strsort(i) = scan(code,i,'|'); end; run; *** DATA MUST BE SORTED BY ID VAR ***; proc transpose data =f2 out =f3; by id; var strs1-strs5000; run;
*** COUNTS NON-MISSING VALUES ***; proc sql; create table f4 as select id , count(distinct col1) as unq_tot from f3 group by id; quit; I hope this helps you. Klaz |
|
|
cte1 (TechnicalUser) |
24 Aug 10 16:57 |
Hi Klaz,
First, thanks so much for you help. The code you posted ran fine but it is only giving me the unique (or distinct) count for col1, which is not the distinct count across the entire group. In the sql statement I need the distinct count across all of the columns (i.e. col1, col2, col3, etc.). Any ideas?
Thanks again,
Chris |
|
I thought that you wanted the unique total across all your columns. If you don't all you have to do is in the PROC SQL step add a group by varname.
Look at the transposed dataset and you will find a column that has the original transposed variable name. Add that column to both the select and group statements.
select id, thetransposedcolumnname, count(distinct col1) as...
group by id, thetransposedcolumnname
change the above variable name to the correct one (I can't recall off-hand what that name is)
klaz |
|
|
cte1 (TechnicalUser) |
25 Aug 10 16:31 |
Hi Klaz, Thanks again for getting back to me. I do want the unique total across all columns. The code is not giving that result. Here is the code that you sent that I ran: data f2; set cltest end = last; array strsort (*) $500 strs1-strs5000 ; wrdtot = countw(code,'|'); do i=1 to wrdtot; strsort(i) = scan(code,i,'|'); end; run; proc sort data=f2; by id; run; proc transpose data =f2 out =f3; by id; var strs1-strs5000; run; *** COUNTS NON-MISSING VALUES ***; proc sql; create table f4 as select id , count(distinct col1) as unq_tot from f3 group by id; quit; The input data file, cltest, is available here: http://rapidshare.com/files/415124745/cltest.sas7bdatFor example there are 11 unique CODES for ID 601 in the cltest dataset. The codes are 178, 181, 186, 278, 204, 816, 208, 212, 214, 176, and 202. However the dataset f4 only shows 9 unique codes for ID 601, as it is only counting the unique observations in the first transposed column (col1). Sorry for so many questions. This stuff seems so easy for you. If you have any ideas to fix this problem it would really help me out a lot. Again, I really appreciate all of your help. thanks, chris |
|
|
kdt82 (Programmer) |
27 Aug 10 9:44 |
Hi Chris, Hopefully this is what you are looking for (one of many ways): First transposing, then getting the unique group/counts and using proc summary to perform the counts. The completetypes and classdata options allow proc summary to retrieve the groups that have 0 codes. CODEdata have ; infile cards missover ; input group codes :$20. ; cards ; 1 |231|322|414| 1 |231|322| 2 |231| 2 |231|114| 3 3 3 ;;;;; proc print;run; data long (keep=group code); set have ; do _n_=1 to countW(codes,'|')-1; code =input(scan(codes,_n_,'|'),best.) ; output; end ; run; proc sort data=long out=unique nodupkey ; by group code ; run; proc summary data=unique missing completetypes classdata=have nway ; class group ; output out= want(keep=group count) n(code)=count ; run; proc print;run; |
|
|
cte1 (TechnicalUser) |
30 Aug 10 11:11 |
That worked perfectly! Thank you so much.
Chris |
|
|
 |
|