Sampling theory

As an introduction to sampling theory, consider the problem of estimating the average IQ of students attending the University of Liverpool. To test the entire group, or population, of students would take too long. Instead, it is decided that tests should be handed out to a sample of the student population. From the sample, results regarding the population can be statistically inferred. The reliability of the survey depends on whether the sample is properly chosen.

IQ scores range between 0 and 200. The set of all possible scores can be represented by the sample space \(\Omega = \{0, 1, 2, ..., 200\}\). Let the variable \(X(\omega) = \omega\) represent a particular outcome after completing a test. Clearly \(X\) is a discrete random variable. An alphabetic roll call of students is used to select a systematic sample. The list is first split into groups of \(k\) students (where \(k\) is an integer greater than 1). If \(k\) is not a multiple of the population size, \(N\), the last group has a smaller size than \(k\). Next a random integer, \(r\), between \(0\) and \(k – 1\) is chosen. Students are included in the sample if their position in the roll call is \(r\) modulo \(k\).

Let the size of the sample be \(n\). On receipt of \(n\) tests, each student in the sample is assigned a score, \(x_{i}\), in the range \(0\) to \(200\) where \(x_{i}\) is the value of a random variable \(X_{i}\). The sample mean is a random variable defined by

$$ \overline{X} = \frac{1}{n} \sum_{i = 1}^{n} X_{i} $$

whose value is

$$ \overline{x} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}. $$

\(X_{1}, ..., X_{n}\) are independent random variables whose distribution functions are the same as the population which has mean \(\mu\) and variance \(\sigma^{2}\). The expected value of the sample mean, \(\mathbf{E}(\overline{X})\), is the population mean, \(\mu\), because

$$ \begin{align*} \mathbf{E}(\overline{X}) &= \frac{1}{n}\left(\sum_{i=1}^{n}\mathbf{E}(X_{i})\right) \\ &= \frac{1}{n}(n\mu) \\ &= \mu. \end{align*} $$

Furthermore, \(X_{1}, ..., X_{n}\) have variance \(\sigma^{2}\) and so the variance of \(\overline{X}\), \(\text{Var}(\overline{X})\), is

$$ \begin{align} \mathbf{E}\left((\overline{X} - \mu)^{2}\right) &= \text{Var}\left(\frac{1}{n}\sum_{i = 1}^{n} X_{i}\right) \\ &= \frac{1}{n^2}\sum_{i = 1}^{n}\text{Var}(X_{i}) \\ &= \frac{1}{n^2} n \sigma^{2} \\ &= \frac{\sigma^{2}}{n}. \end{align} \tag{19} $$

As the sample size increases the variation or scatter of the sample means tends to zero.

From (5) the random variable, \(S^{2}\), giving the sample variance is

$$ S^{2} = \frac{1}{n} \sum_{i = 1}^{n}(X_{i} - \overline{X})^{2}. $$

It turns out that sample variance, \(S^{2}\), is not an unbiased estimator of population variance, \(\sigma^{2}\). \(S^{2}\) underestimates \(\sigma^{2}\) by a factor of \((n – 1)/n\) so that

$$ \mathbf{E}(S^{2}) = \frac{n - 1}{n} \sigma^{2}. \tag{20} $$

The proof of (20) is as follows. Consider the term \(X_{i} - \overline{X} = (X_{i} - \mu) - (\overline{X} - \mu)\). Then, \((X_{i} - \overline{X})^{2} = (X_{i} - \mu)^{2} - 2(X_{i} - \mu)(\overline{X} - \mu) + (\overline{X} - \mu)^2\) and so

$$ \begin{align} \sum_{i = 1}^{n}(X_{i} - \overline{X})^2 &= \sum_{i=1}^{n}(X_{i} - \mu)^{2} - 2(\overline{X} - \mu)\sum_{i=1}^{n}(X_{i} - \mu) + \sum_{i=1}^{n}(\overline{X} - \mu)^{2} \\ &= \sum_{i=1}^{n}(X_{i} - \mu)^{2} - 2n(\overline{X} - \mu)^{2} + n(\overline{X} - \mu)^{2}. \end{align} \tag{21} $$
\(\sum_{i = 1}^{n}(X_{i} - \overline{X})^2 = \sum_{i=1}^{n}(X_{i} - \mu)^{2} - 2(\overline{X} - \mu)\sum_{i=1}^{n}(X_{i} - \mu) + \sum_{i=1}^{n}(\overline{X} - \mu)^{2} = \sum_{i=1}^{n}(X_{i} - \mu)^{2} - 2n(\overline{X} - \mu)^{2} + n(\overline{X} - \mu)^{2}.\)
\((21)\)

Equation (19) together with the expectation of (21) give

$$ \begin{align*} \mathbf{E}\left(\sum_{i=1}^{n}(X_{i} - \overline{X})^{2}\right) &= \mathbf{E}\left(\sum_{i=1}^{n}(X_{i} - \mu)^{2}\right) - n\mathbf{E}\left((\overline{X} - \mu)^{2}\right) \\ &= n\sigma^{2} - n\left(\frac{\sigma^{2}}{n}\right) \\ &= (n - 1)\sigma^{2} \end{align*} $$
\(\mathbf{E}\left(\sum_{i=1}^{n}(X_{i} - \overline{X})^{2}\right) = \mathbf{E}\left(\sum_{i=1}^{n}(X_{i} - \mu)^{2}\right) - n\mathbf{E}\left((\overline{X} - \mu)^{2}\right) = n\sigma^{2} - n\left(\frac{\sigma^{2}}{n}\right) = (n - 1)\sigma^{2}\)

so that

$$ \mathbf{E}(S^{2}) = \frac{n-1}{n}\sigma^{2} $$

and

$$ \sigma^{2} = \frac{n}{n-1}\mathbf{E}(S^{2}). \tag{22} $$

Equation (22) is an important result that states that the population variance, \(\sigma^{2}\), is equal to the expected sample variance, \(\mathbf{E}(S^{2})\), multiplied by \(n/(n - 1)\).