Hey, I’m Kay! This guide provides an introduction to the fundamental concepts of and relationships between hypothesis testing, effect size, and power analysis, using the one-sample z-test as a prime example. While the primary goal is to elucidate the idea behind hypothesis testing, this guide does try to carefully derive the math details behind the test in the hope that it helps clarification. DISCLAIMER: It’s important to mention that the one-sample z-test is rarely used due to its restrictive assumptions. As such, there are limited resources on the subject, compelling me to derive most of the formulas, particularly those related to power, on my own. This self-reliance might increase the likelihood of errors. If you detect any inaccuracies or inconsistencies, please don’t hesitate to let me know, and I’ll make the necessary updates. Happy learning! ;)
In a single sample z-test, our data generating process (DGP) assumes that our observations of a random variable \(X\) are independently drawn from one identical distribution (i.i.d.) with mean \(\mu\) and variance \(\sigma^2\) .
Important Notation:
The sample mean is defined as below. As indicated in previous guide, the sample mean is an unbiased estimator of population expectation under i.i.d. assumption.
The expectation of the sample mean should be:
\[ \begin E(\bar) =& E(\frac \cdot \sum^n_i(X_i)) \\ =& \frac \cdot \sum^n_iE(X_i)\\ =&\frac\cdot n \cdot \mu\\ =& \mu \end \]
and the variance of the sample mean would be:
\[ \begin Var(\bar) =& Var(\frac \cdot \sum^n_i(X_i))\\ =& \frac \cdot \sum^n_i Var(X_i)\\ =&\frac \cdot n \cdot \sigma^2\\ =& \frac\\[2ex] *\text & Var(X_1 +X_2) = Var(X_1) + Var(X_2) + Cov(X_1, X_2)\\ &\text Cov(X_1, X_2) =0, \\ &Var(X_1 +X_2) = Var(X_1) + Var(X_2)\\ \end \]
More importantly, according to The Central Limit Theorem (CLT), even we did not specify the original distribution of \(x\) , if the original distributions of \(x\) have finite variances, as n become sufficiently large (rule of thumb: n >30), the distribution of \(\bar\) become a normal distribution:
Given the nature of the normal distribution, we know the probability density function of \(\bar\) would be
This can be tedious to calculate so we could standardize the normal distribution to a standard normal distribution ( \(N(0, 1)\) ).
Important Notation: Similar to \(\bar\) and \(\bar\) , we use \(Z\) to refer to the random variable and \(z\) to refer to the observation from a fixed sample.
Also we could get the theoretical probability of getting Z between an interval from the distribution by
For a one-sample Z-test, we assume we know the variance parameter \(\sigma^2\) of our data generating distribution (a very unrealistic assumption, but let’s stick with it for now)
Given a sample, we could also know the sample size n, the observed sample mean \(\bar\) (remember we use lower case so it don’t get confused as we view the sample mean \(\bar\) as a random variable in our DGP).
The aim of our hypothesis testing is then, given our knowledge about the \(\sigma\) , n and the \(\bar\) , we can test hypothesis about our sample mean \(\mu\) . Specifically, the null hypothesis ( \(H_0\) ) stating that,
We make this decision follow the logic that: if, given the null hypothesis is true, the probability of getting a sample mean \(\bar\) (or its corresponding test statistics \(Z\) ) that is as extreme or more extreme as the observed sample mean \(\bar\) (or its corresponding test statistics \(z\) ) is smaller than some threshold ( \(\alpha\) ), we would rather believe the null hypothesis is not true.
The p-value represents the probability of observing a test statistic \(Z = \frac \cdot (\bar - \mu_0)>\) as extreme as, or more extreme than, the one computed from the sample \(z = \frac \cdot (\bar - \mu_0)>\) , given that the null hypothesis is true.
The threshold we set is called significance level, denoted as \(\alpha\) . As we reject the null if the p-value is below \(\alpha\) , this also means that we have the probability of \(\alpha\) to falsely reject the null given our null is true and our observed case is indeed extreme (known as Type I error).
Moreover, given the distribution under the null, the \(\alpha\) correspond to a specific value(s) of z called the critical value(s), which we can denote as \(z_c\) .
There are two practical ways we could conduct this hypothesis testing (they are actually the same), we could either calculate the p-value and compare them to the \(\alpha\) , or compare the test statistics \(z\) with the critical value \(z_c\) .
If we are concerned with the probability that our actual \(\mu\) is different (either larger or smaller) than \(\mu_\) , we are doing a two-tail test.
For a two-tailed test, when we refer to values that are “as extreme or more extreme” than the observed test statistic, we’re considering deviations in both positive and negative directions from zero.
Therefore, the two-tailed p-value is:
And if we are only concerned with the probability that our actual \(\mu\) is larger (or smaller) than \(\mu_\) , we are doing a one-tail test.
For a one-tailed test, when we refer to values that are “as extreme or more extreme” than the observed test statistic, we’re considering deviations only in one direction from zero.
Therefore, the one-tailed p-value is:
If the p-value is smaller than our significance level \(\alpha\) , we can reject the null.
Alternatively, we could choose to not to calculate p-value for our observed \(z\) , but compare our \(z\) to the z value(s) corresponding to our \(\alpha\) .
Under a two-tailed test, we use:
The critical value \(z_\) is defined as:
Due to the symmetry of the standard normal distribution:
Our decision rule then implies:
Similarly for one-tailed test, the critical value \(z_\) is:
Then, our conditions to reject the null hypothesis are equivalent to:
The idea behind effect size is to calculate a statistic that measure how large the difference actually is and make this statistic comparable across different situations.
Our intuitive effect size in the single sample Z-test might be \(\bar - \mu_0 = \bar - \mu_\) , given our hypothesized \(\mu_0 = \mu_\) .
But this statistic is not comparable across situations, as the same difference should be more important for us to consider when the population standard deviation is very small.
So to adjust for this, we could use Cohen’s d, the magnitude of the difference between your sample mean and the hypothetical population mean, relative to the population standard deviation.
The power indicate the probability that the Z-test correctly reject the null ( \(H_0: \mu = \mu_\) ). In other word, if the \(\mu \neq \mu_\) , what’s our chance of detecting this difference?.
Suppose the true expectation is \(\mu_\) , so the difference between the true expectation and our hypothetical expectation is:
\[ \Delta = \mu_ - \mu_ \\ \text \mu_ = \mu_ - \Delta \] Our original statistics can be written as:
The first term of \(Z\) can be seen as the z-statistics under the true expectation \(\mu_\) , let’s denote it as \(Z'\) .
Let’s define \(\delta\) as below. \(\delta\) is referred to as the non-centrality parameter (NCP) because it measures how much the distribution of \(Z'\) diverge from the central distribution of \(z\)
\[ Z = Z' + \delta \Rightarrow Z'=Z-\delta \]
Thus, the power would be the probability that the \(Z'\) is in the rejection area, or more simply, use \(Z'\) to replace the \(z\) in our decision rule above: