What does the p-value mean?

ncwalker · Oct 7, 2016

So I keep getting tripped up on the conclusion of a p-value.

We just had a vice presidential debate. Since my null hypothesis is my dull hypothesis, I submit the following two:

H0: The vice-presidential debate had no effect on voter decision.
HA: The vice-presidential debate changed the minds of voters.

So under the assumption I do all the surveying and statistical math right looking for a changed poll after the debate.....

Case 1: I get a p < 0.05 : or "p is low, so reject H0."
I reject my null hypothesis. Does this p-value tell me:

The vice-presidential debate changed the minds of voters.

Essentially, that the alternative hypothesis is true?

I think this is a misinterpretation. The definition of the thing is "the probability of obtaining a result equal to or 'more extreme' than what was actually observed, when the null hypothesis is true."

In my example, my low p means that if the debate had no effect, the change in polling results after the debate had a low probability of occurring at random. Why does this not mean the vp debate DID have an effect?

Case 2: I get a high p-value. So I do NOT reject the null hypothesis.
The high p-value says that if my null hypothesis is true (there is no effect) then my observed change in polling results could have happened at random. In other words, I have learned nothing other than the observed change is not large enough to make a difference. Can I say:

The vice-presidential debate had no effect on voter decision.

Or can I only say I cannot tell?

Bev D · Oct 17, 2016

WARNING: this a long and 2 part post. the first part addresses the question asked. The second part addresses the 'tyranny of the null hypothesis ritual' on the quality profession. (I paraphrased two others in that phrase...)

There are several issues with the p value and how it is used in the "null hypothesis" construct.

First the .05 limit is arbitrary and was never really intended to be used without replication of results. (Fisher*).

Second, it is important to understand what the response factor is in this - or any – scenario as well as the sampling and test structure. Particularly when dealing with sociological and other areas where replication and controlled experiments are difficult or impossible. To address the exact question that you asked, let’s assume that your response is the results of various polls (well written and from truly random and representative samples of those who are likely to vote OR from some a priori selected balanced group of likely voters who were known to be undecided or likely to vote for one or the other candidates at the top of the ticket before the VP debate – this will give you direct before and after results) ; in other words NOT from some voluntary survey monkey post on facebook). A comparison of the poll results of the 3 categories, undecided, candidate A or Candidate B, will result in a p value that will indicate how likely the magnitude of the change is to occur if there were no real change. In the world of the null hypothesis, a high p value would indicate that the magnitude of the change in the sample, was likely even though there was no real change in the ‘population’. A low p value would indicate that the change was large enough to be unlikely to have occurred if there really were no statistically significant change. One can then conclude that the change was statistically significant. But one cannot conclude that the debate was causal merely from the p value.

Now two things to remember:

Statistically significant doesn’t mean large or even of any practical importance.

Statistical significance of the change doesn’t mean that your theory of the cause of the change is the cause of the change.

Unfortunately, the second point is incredibly important for sociological studies that cannot be planned or replicated. The results are often confounded by many other things that are not controlled for. One might be able to discern cause through interviews of opinions and thought processes of those polled, in other old fashioned scientific research instead of blind application of statistical tests. So many more things than the statistical math are critical to the reliability of the statistical test: sample size, sample selection criteria, structure of the samples (two random samples or matched pairs), choice of statistical test to match the data type (in this case categorical data) and structure of the test (random groups or matched pairs and 3 comparison categories), what other factors are confounded with the event of interest. This is why one day coffee is bad for you and the next day it is good for you…

* Fisher said “He should only claim that a phenomenon is experimentally demonstrable when he knows how to design an experiment so that it will rarely fail to give a significant result. Consequently, isolated significant results which he does not know how to reproduce are left in suspense pending further investigation.” (from “Proceedings of the Society for Psychical Research” in 1929 as quoted by David Salsburg in his book: The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century, (p. 100). Henry Holt and Co.)

=========================================
Warning Rant Ahead: Interestingly, the way in which we use the p value and the null hypothesis ‘ritual’ was not endorsed by either of the two (opposing) forces that are often touted as championing them: Fisher and the Neyman-Pearson duo. (See “Bad statistical practice in pharmacology (and other basic biomedical disciplines): you probably don’t know P”, Michael J Lew, British Journal of Pharmacology, 2012, pp. 1959-1967 and “Mindless Statistics” by Gerd Gigerenzer, The Journal of Socio-Economics 33 (2004) pp. 587–606).

This approach was popularized by text book writers and ingrained into the common quality tool box by second generation 'six sigma' instructors. It was not endorsed by most of the Quality profession's founders and 'heroes' such as Shewhart, Deming, et al. (See “On Probablility as Basis for Action” by W. Edwards Deming, American Statistician, November 1975, Vol. 29, No. 4, pp. 146-152 https://www.deming.org/media/pdf/145.pdf ; “Shewhart-Deming Critique of Classical Statistics”, by Jonathon Siegel originally posted on the Deming Electronic Network Web Site available in the "Practical Quality Engineering Resources area of this forum; Statistical Method from the Viewpoint of Quality Control by Walter Shewhart, Dover Press

Well I suppose that is enough to start a debate that certainly shouldn't be 'dull'

ncwalker · Oct 11, 2016

Bev D said: ↑

.... In the world of the null hypothesis, a high p value would indicate that the magnitude of the change in the sample, was likely even though there was no real change in the ‘population’. A low p value would indicate that the change was large enough to be unlikely to have occurred if there really were no statistically significant change. One can then conclude that the change was statistically significant. Btu one cannot conclude that the debate was causal merely from the p value.

Now two things to remember:

Statistically significant doesn’t mean large or even of any practical importance.

Statistical significance of the change doesn’t mean that your theory of the cause of the change is the cause of the change.

Click to expand...

OK. Focusing on this. And restating my example hypotheses.

H0 (Null): The vice presidential debate had no effect on voter decision.
H1 (Alternative): The vice presidential debate changed the minds of voters.

If my p-value is low: (Reject H0!):

1) I CAN say that my sample population is a good reflector of the population at large - in other words, the statistics say I can use my sample as an estimator of the larger population. So in words I would say: "It is unlikely, with a probability of p, that this subset doesn't match the population as a whole."

2) I CANNOT say that the debate changed everyone's minds. My only conclusion is that I think statistically there was a change before and after the debate. Not that the debate was the cause. Only that my sample group has enough separation that a change (for some reason) is evident before and after the debate.

3) My overall conclusions from this study with a low p-value is that I have "grabbed" a good, representative subset of the population THAT I CAN USE FOR A CAUSAL STUDY.

If my p-value is high: (NOT accept H1, only cannot reject H0)

4) I CANNOT say the debate had no effect.

5) I CAN say there is not enough separation to tell either way.

6) My overall conclusions from the study is that the subgroup I grabbed is not effective for any sort of causal studies. This COULD be there's no effect, this could ALSO be I have grabbed a bunch of never changing party line voters in my study. (Or it contains enough of them I won't detect any results).

1-6, how correct are these statements?

Miner · Oct 11, 2016

1), 3) & 6) NO. The p-value says absolutely nothing about the quality of the sample. Look up SAMPLING BIAS. Remember the polls that overestimated Mitt Romney's ratings during his presidential run? A poor sample can bias the results in either direction.

2) Very close. The statistical difference could be 1) real due to theorized cause, 2) random chance, 3) sampling bias, 4) real, but due to another cause. I'm sure others can come up with additional possibilities.

4) & 5) True. The impact may have been real, but the noise drowned out the signal.

ncwalker · Oct 11, 2016

OK. Let me try again.

H0 (Null): The vice presidential debate had no effect on voter decision.
H1 (Alternative): The vice presidential debate changed the minds of voters.

If my p-value is low (reject H0):
My conclusion is - SOMETHING definitely happened, there was a cause/effect that cannot be attributed to random chance. (Further testing needed to establish the relationship).

If my p-value is high (cannot reject H0):
I can't draw a conclusion other than there was no statistical significance. My options are: drop it, or conduct a larger or possibly more accurate study.

Bev D · Oct 11, 2016

ncwalker said: ↑

If my p-value is low (reject H0):
My conclusion is - SOMETHING definitely happened, there was a cause/effect that cannot be attributed to random chance. (Further testing needed to establish the relationship).
Click to expand...

Closer, but you can only say that it is unlikely that the observed difference occurred simply by chance. (the likelihood of the observed difference occurring simply by chance is the alpha risk of the sample size)

ncwalker said: ↑

If my p-value is high (cannot reject H0):
I can't draw a conclusion other than there was no statistical significance. My options are: drop it, or conduct a larger or possibly more accurate study.
Click to expand...

The observed difference was not statistically significant. (this is where power comes in, but power must be established prior to selecting the samples) If there is a difference it is likely less than the detectable difference of your sample size.

the dilemma you are facing is not in understanding what you can and cannot claim with the p value and the null hypothesis approach. Your dilemma is that you want it to say more than it can. by itself it is a VERY weak 'tool'...

Bev D · Oct 11, 2016

ncwalker said: ↑

My options are: drop it, or conduct a larger or possibly more accurate study
Click to expand...

this is a slippery slope. it traps many many people.
If the study is important enough to do the first time, it's important enough to do it well. a well designed study involves a lot more than taking some data and putting it through some statistical software.
the whole issue with Mitt Romney was people's biased view of the voting process and their almost complete misunderstanding of statistics. Mitt was always within the 'margin of error' of Obama. (in p value terms the p value was high) So people who wanted him to win kept believing that he was going to win. the key here was replication (it's always replication...and homogeneity). EVERY credible poll had him on the losing side of Obama. true it was always within the margin of error of the poll so it was 'close'. Even Mitt thought he was going to win on election night. He - and others thought that if they just a bigger sample (everyone who was going to vote) the 'sampling error' would be eliminated. but it was always on the losing side, always. when you replicate the same result over and over and over, the margin error doesn't apply.

think of it this way. there is a 50% chance you will flip a head on a fair coin with a fair flip. every time you flip that coin there is 50% chance that you will flip a head. that is always true. but it isn't the whole truth. If you flip a coin 5 times and get a head, what do you think the odds REALLY are that you will flip a head again on the 6th flip? If you have a two headed coin the odds are 100%. If you have a fair coin, the probability of getting 5 heads in a row is only 3.1%, so getting a 6th head in a row is vanishingly small...this is conditional probability. A quote from Deming: Levels of significance furnish no measure of belief in a prediction. Probability has use; tests of significance do not"

so when you get a result you don't like and it looks 'close' the answer isn't to do a different study. the answer is to design a better study for your next question. otherwise you keep chasing your bias.

Miner · Oct 11, 2016

Part of the problem is that today we are trying to stretch the null hypothesis test way beyond what it was originally designed to do, which was to deal with small samples. It is part of the iterative inductive-deductive experimentation process and cannot be separated from that without profound risk. In that iterative process, you develop a hypothesis and run an experiment to test that hypothesis. If the results of the experiment are not significant, you modify your hypothesis. If they are significant you run a series of follow up experiments to validate the hypothesis. Many of the studies today that end up debunked fail to go beyond the "statistically significant" result and declare early victory only to be shamed in Andrew Gelman's blog. Bottom line, the hypothesis test was never intended to be the final say. It was just more information on which to base further experiments.

ncwalker · Oct 11, 2016

OK. I am going to do some reading.

But I have one more question for today ... why do we even bother with an alternative hypothesis statement at all? I think many of us that go down the wrong p-value road are thrown by this second hypothesis.

Because it is there (H1 - the debate had an effect) when one says the null hypothesis is rejected, it brings with it the assumption that therefore the alternative hypothesis be true. The old symbolic logic problem of the difference between an "if ... then ..." statement and an "if and only if ... " statement.

To me, it would make more sense were there JUST the null hypothesis.

H: The debate had no effect on peoples minds.

p<alpha, no, there was some effect. The polling results are unlikely to have been random.
p>alpha, there was nothing statistically significant about the result. No conclusion.

Why even mention H1?

Bev D · Oct 12, 2016

Why do we bother with an alternative hypothesis? Well, the null hypothesis ritual stems from old Greek logic. But there is NO law that says you have to follow it. Many thought leaders are now advocating for what Deming and Shewhart advocated for in the last century. There are MANY alternative theories and all of them matter.
I stopped teaching the Null hypothesis thing years ago. None of my students ever understood it and it isn't necessary. I've solved hundreds of complex problems, validated process changes, designed new processes and products, established process capability and reduced variation and never used the p value or the null hypothesis thingy.

Face it: the reason it doesn't make sense to you is that it isn't a very useful model. Miner's description of the iterative approach (famously explained by George Box) is useful model. And you can use p values for it - or not. But you don't need to think about the null and alternate hypothesis at all...
You might find two presentations I've posted in the resources section somewhat helpful: A Scientific Approach to determining root cause and Power Tools for Six Sigma. They involve an approach that doesn't require the null or p values...

Miner · Oct 12, 2016

Several reasons: First, it helps clarify your thinking. Second, it helps to formulate whether you need a 1-tailed or 2-tailed test. If you are very careful in how you word the null hypothesis, it might not be necessary, but many times the alternative is worded more clearly.

ncwalker · Oct 12, 2016

OK. Both of you are making sense to me.

Miner: After thinking about it, I agree that it needs to be formulated. In any sort of modeling, you must ensure that the underlying assumptions either apply or can be neglected without effect. And I can see where the formulation of the alternative is sort of a back check to make sure the null is stated in a way that satisfies the p-value model.

BevD: I also agree. I too have solved many problems without the p-value. But I wish to understand it because it is brought up. And if it disagrees with other findings, I need to be able to evaluate if it is a valid p-value (arrived at properly) or an invalid one.

I have watched "p-value extravaganza" a dozen times now, very carefully. And what's interesting to me is that it is a monotonic response. That if I did an experiment and got a high p-value, if I kept replicating the experiment, I would eventually get a p-value under alpha because the likelihood of getting any p-value is uniform. In other words, if I selected my hurdle at the standard 0.05, and did my experiment 100 times, I would expect 5 of them to have a p-value under my hurdle and 95 would have a high p. Right away, this tells me that replication is needed. Which, is kind of common sense.

If I a vendor comes along and presents me with a new surface coating claiming that my parts will go longer without oxidizing, I can see doing a test where I coat 50 parts in current and 50 in new and measure the time until oxidizing appears. I can see taking this mean time and doing a t-test. And if my t statistic is greater than my tcrit I say "This is statistically significant...." (I then have to make the financial decision ... is the new coating also WORTH it...) And if my t statistic is the other way, I look at the new vendor and say "Sorry, it's not making much of difference compared to what I am doing." And I CAN calculate a p-value for this test.

Though I am not sure what it tells me ...

Miner · Oct 12, 2016

ncwalker said: ↑

If I a vendor comes along and presents me with a new surface coating claiming that my parts will go longer without oxidizing, I can see doing a test where I coat 50 parts in current and 50 in new and measure the time until oxidizing appears. I can see taking this mean time and doing a t-test. And if my t statistic is greater than my tcrit I say "This is statistically significant...." (I then have to make the financial decision ... is the new coating also WORTH it...) And if my t statistic is the other way, I look at the new vendor and say "Sorry, it's not making much of difference compared to what I am doing." And I CAN calculate a p-value for this test.
Click to expand...

I would add another step to this. If it looks favorable from a time and cost perspective, run a confirmation pilot to validate before committing.

Bev D · Oct 12, 2016

ncwalker said: ↑

And if it disagrees with other findings, I need to be able to evaluate if it is a valid p-value (arrived at properly) or an invalid one.
Click to expand...

let's try this: read "On Probability as a basis for Action" then let's see what questions you have.

ncwalker said: ↑

And I CAN calculate a p-value for this test. Though I am not sure what it tells me ...
Click to expand...

I think Fisher and Deming said it pretty well. Unfortunately too many people have fallen for the old "bright line" requirement. a p-value of .049 is no different than a p value of .051. the 3 of us keep circling around the same thing: p values, RPNs, Cpk values, R&R numbers. it's really all the same. the bright line is a seductive concept but "the calculation of mathematical formulas is no substitute for thinking". Replication, iteration, understanding the homogeneity of the process, study design, science, understanding simple probability and plotting your data: these are the critical elements we need to understand.

I understand that people keep hitting you with the p value and the NH thingy, but the only way to understand those weaknesses and be able to articulate them to others is to study and understand the alternate approaches.

Englishman Abroad · Oct 12, 2016

Bev,

Thanks for quoting See “Bad statistical practice in pharmacology (and other basic biomedical disciplines): you probably don’t know P”, Michael J Lew, British Journal of Pharmacology, 2012, pp. 1959-1967;

At least in the Automotive industry we dont care that we abuse the core foundation statistical concepts, just as long as the RPNs, Cpks, and R&Rs are green then that is good and stop looking...

Seriously to all (Bev, Miner & NC) thanks for some interesting discussion, I am going to have to see my 6sigma sensei again.

Miner · Oct 12, 2016

Bev D said: ↑

let's try this: read "On Probability as a basis for Action"
Click to expand...

Bev, Your link has an extra %20 in front that prevents it from working. When I strip that off the link works.

Bev D · Oct 12, 2016

Thanks Miner - I didn't' notice that...strange

ncwalker · Oct 13, 2016

It IS annoying. This is statistics which is based in PROBABILITY, not certainty. If I know two sides of a triangle and the angle between them, I can calculate all the other sides and angles with CERTAINTY. This doesn't work like that. They want that bright line. Years ago I had an OEM SQ looking at me telling me he wanted the sheet tracking all my supplier green - like his other suppliers give him. I looked at him and said - if someone gives you a tracking sheet that has THIS MANY complicated parts on it and it is all "green" (no problems), they are LYING to you.

He still wanted it green. So, we colored it green and made him happy and were forced to maintain two trackers. We have gotten intellectually very lazy as a society.

ncwalker · Oct 13, 2016

At Englishmen Abroad: Yes. I LOVE these types of discussions that make me question my own understanding of a concept. If you don't do this, you don't have a lot of confidence in your understanding. One of my favorite Twain quotes is: "In religion and politics people's beliefs and convictions are in almost every case gotten at second-hand, and without examination."

I wonder if he would have added "statistics" to that list (or perhaps science) today ...

Miner · Oct 13, 2016

Bev D said: ↑

Warning Rant Ahead: Interestingly, the way in which we use the p value and the null hypothesis ‘ritual’ was not endorsed by either of the two (opposing) forces that are often touted as championing them: Fisher and the Neyman-Pearson duo. (See “Bad statistical practice in pharmacology (and other basic biomedical disciplines): you probably don’t know P”, Michael J Lew, British Journal of Pharmacology, 2012, pp. 1959-1967 and “Mindless Statistics” by Gerd Gigerenzer, The Journal of Socio-Economics 33 (2004) pp. 587–606).

This approach was popularized by text book writers and ingrained into the common quality tool box by second generation 'six sigma' instructors. It was not endorsed by most of the Quality profession's founders and 'heroes' such as Shewhart, Deming, et al. (See “On Probablility as Basis for Action” by W. Edwards Deming, American Statistician, November 1975, Vol. 29, No. 4, pp. 146-152 https://www.deming.org/media/pdf/145.pdf ; “Shewhart-Deming Critique of Classical Statistics”, by Jonathon Siegal, Deming Electronic Network Web Site, http://deming.ces.clemson,edu/pub/den/deming_siegal1.htm ; Statistical Method from the Viewpoint of Quality Control by Walter Shewhart, Dover Press

Well I suppose that is enough to start a debate that certainly shouldn't be 'dull'
Click to expand...

I just read the article referenced above. It was an excellent read. While having heard of the disagreement between the two factions, I had never seen it explained so clearly. Fisher's philosophy really resonated with me and closely matches my actual practice, whereas I have always been uncomfortable with the arbitrariness of the 0.05 threshold, particularly if you do not temper it with risk based thinking (i.e., 0.1 vs. 0.05 vs. 0.01) and follow up with a validation experiment.

I agree that the textbook authors popularized the Neyman-Pearson approach (corrupting it in the process by garbling the correct interpretation of the p-value), but this has been going on for much longer than Six Sigma has been around. I think it was already firmly implanted in the psyche before Six Sigma came along. Look at its prevalence in the soft sciences. In defense of "reputable" Six Sigma training organizations, they do emphasize running confirmation experiments, which safeguards against a lot of mistakes. However, there are a lot of schlocky Six Sigma organizations out there , which do not.

Log in or Sign up

What does the p-value mean?

ncwalker Well-Known Member

Bev D Moderator Staff Member

ncwalker Well-Known Member

Miner Moderator Staff Member

ncwalker Well-Known Member

Bev D Moderator Staff Member

Bev D Moderator Staff Member

Miner Moderator Staff Member

ncwalker Well-Known Member

Bev D Moderator Staff Member

Miner Moderator Staff Member

ncwalker Well-Known Member

Miner Moderator Staff Member

Bev D Moderator Staff Member

Englishman Abroad Member

Miner Moderator Staff Member

Bev D Moderator Staff Member

ncwalker Well-Known Member

ncwalker Well-Known Member

Miner Moderator Staff Member

Log in or Sign up

What does the p-value mean?

ncwalker Well-Known Member

Bev D Moderator Staff Member

ncwalker Well-Known Member

Miner Moderator Staff Member

ncwalker Well-Known Member

Bev D Moderator Staff Member

Bev D Moderator Staff Member

Miner Moderator Staff Member

ncwalker Well-Known Member

Bev D Moderator Staff Member

Miner Moderator Staff Member

ncwalker Well-Known Member

Miner Moderator Staff Member

Bev D Moderator Staff Member

Englishman Abroad Member

Miner Moderator Staff Member

Bev D Moderator Staff Member

ncwalker Well-Known Member

ncwalker Well-Known Member

Miner Moderator Staff Member

Useful Searches