Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations wOOdy-Soft on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

scannin values from within a file 2

Status
Not open for further replies.

devarshi77

Programmer
Oct 31, 2004
10
US
hi im new here and need some guidance as i dont wan swim any longer!
the problem is that i have created a file in R and i want to read p-values from the file. the file has 1000 simulations. someone told me that c can do it effortlessly.
this is the final aim of the project im workin on:

i wan to read the file and give me the list of p-values smaller than 0.05 .


the file output looks like this (1000 such observations) :
------
One sample Kolmogorov-Smirnov Test of Composite Normality

data: y
ks = 0.1062, p-value = 0.5
alternative hypothesis: True cdf is not the normal distn. with estimated parameters
sample estimates:
mean of x standard deviation of x
0.7940168 0.8786753
------------
if anyone can show me a way out of this, i d appreciate.

best
 
I am assuming this is a text file. forum68 should be tried with your question, as I believe you can parse the file into either Access or Excel to do the sorting/selection you desire.
 
Hello devarshi77,

I hope this script can help you as a quick-development intermediary. Make it to a .vbs file and double-click to go. p-values will be extracted to a text file as well with each occurance occupies one line.
Code:
infilespec="d:\test\abc.txt"    [blue]'edit to point to your data file[/blue]
outfilespec="d:\test\def.txt"   [blue]'edit to point to p-value file[/blue]
 
set fso=createobject("scripting.filesystemobject")
if not fso.fileexists(infilespec) then
    set fso=nothing : wscript.echo "Data file not exist. Operation aborted."
    wscript.quit 9
end if
sdata=fso.opentextfile(infilespec,1,false).readall

spattern="\bp-value(\s*)=(\s*)[0-9]{1,}(.[0-9]{1,}|[0-9]*)\s"
set regex=new regexp
with regex
	.pattern=spattern
	.ignorecase=true
	.global=true
end with
sout=""
set matches=regex.execute(sdata)
for each match in matches
    snum=trim(replace(replace(match,"p-value",""),"=",""))
    if sout<>"" then sout=sout & vbcrlf & snum else sout=snum
next
set regex=nothing

set ots=fso.opentextfile(outfilespec,2,true)
ots.write sout
ots.close : set ots=nothing
set fso=nothing
regards - tsuji
 
tsuji,

Since you are reading the thread, what about a conversion to a comma delimited file so that an import to Excell or Access is easier?


data: y
ks = 0.1062
p-value = 0.5
description: "alternative hypothesis: True cdf is not the normal distn. with estimated parameters"
mean of x = 0.7940168
standard deviation of x = 0.8786753

With these six record fields, the output would be an CSV format file something like:
[tt]
"y","0.1062","alternative hypothesis: True cdf is not the normal distn. with estimated parameters","0.7940168","0.8786753"
[/tt]

The user would have to make certain that the file did not contain a comma (likely in the description field), but a find and replace with a semicolon prior to converting should handle that issue.

I believe you could then directly import the file into Excell or Access.

No? It seems to me a more flexible plan of attack, and would allow the use of charting and other features of the Office applications.



 
Hi bcastner,

If the data block are structured as shown, parse it to a comma-delimited is relatively easy for the 1st and 2nd lines. The 3rd & 4th serve as description headers and we can tolerate easily as another entries. The most annoying is the 5th and 6th lines where header and data are presented column-wise.

If it is not because of 5th and 6th lines' structure, we can easily transform it to a comma-delimited or alike ready for further office processing.

But having said, transforming fully data block can be done relatively easily given 5th are always the same and 6th line's data are space- or tab-delimited.

So I totally agree with you.

I would say it's only a matter of "will" and "utility". I extract a single useful data the poster desired to have for quick (and hopefully not totally dirty) development/analysis---as you know, this kind of output coming from simulations for statistical analysis can be quick to be discarded in favor of one hypothesis than another. So an extract for quick analysis makes some sense. I don't know...

regards - tsuji
 
tsuji,

That makes sense.

I do not know if these values are tab delimited or the output file is using spaces:

mean of x standard deviation of x
0.7940168 0.8786753

The delimiter could be either as you suggest.
 
Hi Tsuji and Bcastner
thanks for initiating it. i was stuck!! anyways the thing is that i tried the scan syntax in Splus 2000 and that hopefully should work. the problem is that i have scant knowledge about programmin and so this dilemma!!

see the syntax in S plus 2000 looks like this (im sure u would have seen somethin like this)...

------------
scan(file="", what=numeric(), n=<<see below>>, sep=<<see below>>,
multi.line=F, flush=F, append=F, skip=0, widths=NULL,
strip.white=<<see below>>)

--------------------

is it possible to read the p-values somehow and then apply some conditional statement which would give us the all the p values less than 0.05 (the significance level) as an output in any file format...
and one more thin as i dont have C i could not work on the code u turned in...im hopin it will work!!

best
dev
 
yes the simulation gets stored as an output in the form of a text file.


take care
 
The only suggestion I can make is that to avoid this in the output file:

y,0.1062,"alternative hypothesis: True cdf is not the normal distn. with estimated parameters","0.7940168",0.8786753

The comma should be the delimter, and all values should be passed in "" marks. A mixed input file for CSV just does not work.

As in my earlier comment:

"y","0.1062","alternative hypothesis: True cdf is not the normal distn. with estimated parameters","0.7940168","0.8786753"

Then this is an importable CSV. You decide the column types in Excel, or the field types in Access.

 
devarshi77,

If you can tell what final form of filtered (with p<0.05) data you want to squeeze out of the output file of 1000+ blocks or smaller, I can help you get it via special made script. Are you willing to try?

In the previous script, it can already be done by modifying this segment.
Code:
for each match in matches
    snum=trim(replace(replace(match,"p-value",""),"=",""))
    [blue]if cdbl(snum)<0.05 then[/blue]
        if sout<>"" then sout=sout & vbcrlf & snum else sout=snum
    [blue]end if[/blue]
next
Only that the output text file contains p-value only, no other data. If you are happy with it, the script is already done. Try it.

- tsuji
 
tsuji
i am willin to try it. the output should look like this:

"The null hypothesis was rejected n times"

or

" the power of the test is "

based on the condition that whenever p value is less than 0.05, the null hyp gets rejected. actually the aim of the excercise is to calculate the power of the test.

that would be 'total nos of null hyp rejected/ total no. of runs'

if the null hyp is rejected say 109 times out of 1000 simulations then the power would be 0.109.

low power implies the test aint any good and thats what im workin on!!!!

well tsuji is that a c code that you turned in?
and is it possible to import it in R or S plus and run it...coz i dont have any access to C...


best n thanks a ton

dev
 
and ya i could import it in excel and sorted the p-values in one column.
is there any macro that would calculate the power of the test in excel?
any inputs will be deeply appreciated!!

best
 
devarshi77,

The script will in large part accomplish what needed. The only factor I am worried is that you have not run it to feedback what is lacking.

This is the revised script to produce the report.
Code:
const plimit=0.05    'This is the reject/accept limit

infilespec="d:\test\abc_data.txt"    'edit to point to your data file

set fso=createobject("scripting.filesystemobject")
if not fso.fileexists(infilespec) then
	set fso=nothing : wscript.echo "Data file not exist. Operation aborted."
	wscript.quit 9
end if

outfilespec=fso.getparentfoldername(infilespec) & "\" & _
	fso.getbasename(infilespec) & "_summary"
if fso.getextensionname(infilespec)<>"" then
	outfilespec=outfilespec & "." & fso.getextensionname(infilespec)
end if
	
sdata=fso.opentextfile(infilespec,1,false).readall

spattern="\bp-value(\s*)=(\s*)[0-9]{1,}(.[0-9]{1,}|[0-9]*)\s"
set regex=new regexp
with regex
	.pattern=spattern
	.ignorecase=true
	.global=true
end with
sout=""
set matches=regex.execute(sdata)
isamsize=matches.count
isam=0 : ireject=0
for each match in matches
	isam=isam+1
	snum=trim(replace(replace(match,"p-value",""),"=",""))
	snum=trim(replace(snum,chr(13),""))
	if ccur(snum)<plimit then	'reject criteria
		sout=sout & isam & vbtab & snum & vbcrlf
		ireject=ireject+1
	else
		sout=sout & isam & vbtab & vbtab & snum & vbcrlf
	end if
next
set regex=nothing

if isam<>0 then power=ireject/isam else power="n/a"
sout="The power of the test : " & power & vbcrlf & vbcrlf & sout
sout="The null hypothesis was rejected : " & ireject & vbcrlf & sout
sout="Sample size : " & isam & vbcrlf & sout

set ots=fso.opentextfile(outfilespec,2,true)
ots.write sout
ots.close : set ots=nothing
set fso=nothing
If your data file produced by the statistical package is :
[tt] d:\test\abc_data.txt[/tt]
The script will produce automatically the summary file :
[tt] d:\test\anc_data_summary.txt[/tt]
with the structure like this:
[tt]
Sample size : 1500
The null hypothesis was rejected : 300
The power of the test : 0.2

1 0.5232
2 0.04999
3 0.015
4 0.5
5 etc...
[/tt]
- tsuji
 
tjusi

thanks for the script but will it run on VB opened from excel?

devarshi
 
devarshi77,

Not without further effort.

- tsuji
 
tjusi

thnks for all you have done. i will run it n let u know about it!!

best

devarshi
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top