Calculate p-values for z-tests, t-tests, and chi-square tests. One-tailed and two-tailed results with instant significance interpretation at any alpha level.
A p-value is the probability of obtaining results at least as extreme as your observed data, assuming the null hypothesis is true. It is the cornerstone of statistical hypothesis testing used in medicine, psychology, economics, biology, and every quantitative field. A small p-value means your data would be unlikely to occur by chance if the null hypothesis were true — giving you evidence to reject it.
The p-value does not tell you the probability that your hypothesis is correct, nor the probability that the result was due to chance. This is the single most common misinterpretation in science. If p = 0.03, it means results this extreme would occur only 3% of the time under the null hypothesis — not that there is a 3% chance you are wrong. Your significance level (alpha) — set before data collection — determines your threshold. If p is less than alpha, reject the null hypothesis.
The threshold alpha = 0.05 (5%) is a convention from Ronald Fisher in the 1920s, not a universal truth. Physics requires p less than 0.0000003 (five sigma) for new particle discoveries. The FDA often requires p less than 0.025 for drug approvals (two-sided). Some psychology journals now recommend p less than 0.005 to reduce false positives. Always choose alpha before you collect data — changing it after seeing results is called p-hacking and invalidates the test.
| Test | Use When | Requires |
|---|---|---|
| Z-test | Large sample (n > 30) or known population SD | Normal distribution, continuous data |
| T-test | Small sample (n < 30), unknown population SD | Approximately normal data |
| Chi-square | Categorical data, independence or goodness-of-fit | Expected cell counts ≥ 5 |
A two-tailed test tests for any difference (either direction). Use this by default. A one-tailed test tests for a difference in one specific direction only — it is more powerful but must be justified by a prior directional hypothesis stated before data collection. One-tailed tests halve the p-value compared to two-tailed, which is why they're sometimes misused to push borderline results past the significance threshold.
| P-value | Interpretation | Notation |
|---|---|---|
| p < 0.001 | Extremely significant | *** |
| 0.001 to 0.01 | Very significant | ** |
| 0.01 to 0.05 | Significant | * |
| 0.05 to 0.10 | Marginal trend | † |
| p ≥ 0.10 | Not significant | ns |
The t-distribution was developed by William Sealy Gosset in 1908 (published under the pseudonym "Student") to handle small-sample statistics. It resembles the normal distribution but with heavier tails — reflecting the extra uncertainty when estimating the population standard deviation from a small sample. As degrees of freedom (df = n - 1) increase, the t-distribution converges to the standard normal. At df = 120, the difference is negligible. For small samples (df < 30), the t-distribution produces larger critical values and therefore higher p-values than the z-test for the same test statistic — this is the correct conservative behavior.
The chi-square distribution is always right-skewed and defined only for non-negative values. It arises when you square standard normal variables: if Z is standard normal, then Z² follows chi-square with 1 degree of freedom. A chi-square test with df degrees of freedom tests whether observed categorical frequencies differ from expected. For a 2x2 contingency table (two binary variables), df = (rows - 1)(columns - 1) = 1. For a 3x4 table, df = 2 x 3 = 6. The chi-square statistic equals the sum of (Observed - Expected)² / Expected across all cells.
A common mistake is applying chi-square when expected cell counts are less than 5. With small expected counts, the chi-square approximation breaks down and you should use Fisher's exact test instead. Fisher's exact test calculates the exact probability of observing your contingency table or a more extreme one, without relying on distributional approximations.
If you run 20 independent hypothesis tests at alpha = 0.05, you expect one false positive on average — even when all null hypotheses are true. This is the multiple comparisons problem. Solutions include the Bonferroni correction (divide alpha by the number of tests), the Benjamini-Hochberg procedure for controlling false discovery rate, and pre-registration of hypotheses before data collection. In genomics studies testing millions of genetic variants, researchers use alpha = 5 x 10&sup-8; as the genome-wide significance threshold to account for this.