stronger correlation formula

bardcan · Nov 6, 2009

If I have two series of numbers, series A contains either 1s or 0s, depending on if a patient took a pill or not. Series B contains random numbers. All of the series B numbers that coincide with the patient taking a pill have an average of 100, whereas those that coincide with NOT taking a pill average to 101. There is a HUGE amount of data, so I am trying to find the formula that will show that there is a strong correlation between the two - that if the patient takes the pill, the most likely result is that their B measurement will go up by 1 point. A standard correlative coefficient shows a low correlation... around .15. Any help would be greatly appreciated.

johnherman · Nov 10, 2009

This is more of a statistical test than data mining. Actually, a paired t-test. You have two sets of data, one with pill, one without. The null hypothesis is that the two data sets are identical. Then you (likely) disprove the hypothesis with (say) 90% certainty using the t-test.

-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright

johnherman · Nov 10, 2009

Correction - this is not a paired t-test, it is an unpaired t-test.

-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright

lionelhill · Nov 20, 2009

johnherman, it would have been better had the study been designed properly to allow a paired t-test. The usual strategy is to give the product or a placebo to everyone at day 1, do the analysis, wait until you are certain the effects of the product have gone, then give product to all the former placebo people, and the placebo to all the product people. This way you have paired measurements, and the analysis gets much more sensitive.

bardcan · Nov 20, 2009

what formula would you enter to get the kind of result I'm looking for?

johnherman · Nov 23, 2009

Data Mining is used to find "unknown" trends and relationships in the data. You have a hypothesis regarding a relation in the data and are seeking to prove or disprove it, or, in other words, determine the degree of confidence in which the data supports your hypothesis. I would venture to guess that every statistical package on the market supports t-test.

-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright

Predictor · Dec 11, 2009

If I have two series of numbers, series A contains either 1s or 0s, depending on if a patient took a pill or not. Series B contains random numbers. All of the series B numbers that coincide with the patient taking a pill have an average of 100, whereas those that coincide with NOT taking a pill average to 101. There is a HUGE amount of data, so I am trying to find the formula that will show that there is a strong correlation between the two - that if the patient takes the pill, the most likely result is that their B measurement will go up by 1 point. A standard correlative coefficient shows a low correlation... around .15. Any help would be greatly appreciated."

The most commonly used correlation measure(Pearson's correlation) is not well-suited to this problem. I will suggest that you measure two things:

1. Magnitude: The difference between the mean of variable B for variable A 0s and for variable A 1s.

and...

2. Significance: Try a t-test or bootstrap to establish that the difference between the two means is unlikely to be zero.

bardcan · Dec 12, 2009

Right, but how would you write this as a formula in a spreadsheet format?

johnherman · Dec 12, 2009

It's probably not worth the effort to write your own t-test within Excel. It's been done and statistical packages are relatively cheap. Some stat packages might have Excel compatibility or plug-ins. Good Luck

-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

stronger correlation formula

bardcan

Technical User

johnherman

MIS

johnherman

MIS

lionelhill

Technical User

bardcan

Technical User

johnherman

MIS

Predictor

Programmer

bardcan

Technical User

johnherman

MIS

Part and Inventory Search

Sponsor