Analysts often confront competing ideas about how financial markets work. Some of these ideas develop through personal research or experience with markets; others come from interactions with colleagues; and many others appear in the professional literature on finance and investments. In general, how can an analyst decide whether statements about the financial world are probably true or probably false? When we can reduce an idea or assertion to a definite statement about the value of a quantity, such as an underlying or population mean, the idea becomes a statistically testable statement or hypothesis. The analyst may want to explore questions such as the following: • Is the underlying mean return on this mutual fund different from the underlying mean return on its benchmark? • Did the volatility of returns on this stock change after the stock was added to a stock market index? • Are a security’s bid-ask spreads related to the number of dealers making a market in the security? • Do data from a national bond market support a prediction of an economic theory about the term structure of interest rates (the relationship between yield and maturity)? To address these questions, we use the concepts and tools of hypothesis testing. Hypothesis testing is part of statistical inference, the process of making judgments about a larger group (a population) on the basis of a smaller group actually observed (a sample). The concepts and tools of hypothesis testing provide an objective means to gauge whether the available evidence supports the hypothesis. After a statistical test of a hypothesis we should have a clearer idea of the probability that a hypothesis is true or not, although our conclusion always stops short of certainty. Hypothesis testing has been a powerful tool in the advancement of investment knowledge and science. As Robert L. Kahn of the Institute for Social Research (Ann Arbor, Michigan) has written, ‘‘The mill of science grinds only when hypothesis and data are in continuous and abrasive contact.’’ The main emphases of this chapter are the framework of hypothesis testing and tests concerning mean and variance, two quantities frequently used in investments. We give an overview of the procedure of hypothesis testing in the next section. We then address testing hypotheses about the mean and hypotheses about the differences between means. In the fourth section of this chapter, we address testing hypotheses about a single variance and hypotheses about the differences between variances. We end the chapter with an overview of some other important issues and techniques in statistical inference. 243 244 Quantitative Investment Analysis 2. HYPOTHESIS TESTING Hypothesis testing, as we have mentioned, is part of the branch of statistics known as statistical inference. Traditionally, the field of statistical inference has two subdivisions: estimation and hypothesis testing. Estimation addresses the question ‘‘What is this parameter’s (e.g., the population mean’s) value?’’ The answer is in the form of a confidence interval built around a point estimate. Take the case of the mean: We build a confidence interval for the population mean around the sample mean as a point estimate. For the sake of specificity, suppose the sample mean is 50 and a 95 percent confidence interval for the population mean is 50 ± 10 (the confidence interval runs from 40 to 60). If this confidence interval has been properly constructed, there is a 95 percent probability that the interval from 40 to 60 contains the population mean’s value.1 The second branch of statistical inference, hypothesis testing, has a somewhat different focus. A hypothesis testing question is ‘‘Is the value of the parameter (say, the population mean) 45 (or some other specific value)?’’ The assertion ‘‘the population mean is 45’’ is a hypothesis. A hypothesis is defined as a statement about one or more populations. This section focuses on the concepts of hypothesis testing. The process of hypothesis testing is part of a rigorous approach to acquiring knowledge known as the scientific method. The scientific method starts with observation and the formulation of a theory to organize and explain observations. We judge the correctness of the theory by its ability to make accurate predictions—for example, to predict the results of new observations.2 If the predictions are correct, we continue to maintain the theory as a possibly correct explanation of our observations. When risk plays a role in the outcomes of observations, as in finance, we can only try to make unbiased, probability-based judgments about whether the new data support the predictions. Statistical hypothesis testing fills that key role of testing hypotheses when chance plays a role. In an analyst’s day-to-day work, he may address questions to which he might give answers of varying quality. When an analyst correctly formulates the question into a testable hypothesis and carries out and reports on a hypothesis test, he has provided an element of support to his answer consistent with the standards of the scientific method. Of course, the analyst’s logic, economic reasoning, information sources, and perhaps other factors also play a role in our assessment of the answer’s quality.3 We organize this introduction to hypothesis testing around the following list of seven steps. • Steps in Hypothesis Testing. The steps in testing a hypothesis are as follows:4 1. Stating the hypotheses. 2. Identifying the appropriate test statistic and its probability distribution. 3. Specifying the significance level. 4. Stating the decision rule. 5. Collecting the data and calculating the test statistic. 6. Making the statistical decision. 7. Making the economic or investment decision. 1We discussed the construction and interpretation of confidence intervals in the chapter on sampling. 2To be testable, a theory must be capable of making predictions that can be shown to be wrong. 3See Freeley and Steinberg (1999) for a discussion of critical thinking applied to reasoned decision making. 4This list is based on one in Daniel and Terrell (1986). Chapter 7 Hypothesis Testing 245 We will explain each of these steps using as illustration a hypothesis test concerning the sign of the risk premium on Canadian stocks. The steps above constitute a traditional approach to hypothesis testing. We will end the section with a frequently used alternative to those steps, the p-value approach. The first step in hypothesis testing is stating the hypotheses. We always state two hypotheses: the null hypothesis (or null), designated H0, and the alternative hypothesis, designated Ha. • Definition of Null Hypothesis. The null hypothesis is the hypothesis to be tested. For example, we could hypothesize that the population mean risk premium for Canadian equities is less than or equal to zero. The null hypothesis is a proposition that is considered true unless the sample we use to conduct the hypothesis test gives convincing evidence that the null hypothesis is false. When such evidence is present, we are led to the alternative hypothesis. • Definition of Alternative Hypothesis. The alternative hypothesis is the hypothesis accepted when the null hypothesis is rejected. Our alternative hypothesis is that the population mean risk premium for Canadian equities is greater than zero. Suppose our question concerns the value of a population parameter, θ, in relation to one possible value of the parameter, θ0 (these are read, respectively, ‘‘theta’’ and ‘‘theta sub zero’’).5 Examples of a population parameter include the population mean, µ, and the population variance, σ2. We can formulate three different sets of hypotheses, which we label according to the assertion made by the alternative hypothesis. • Formulations of Hypotheses. We can formulate the null and alternative hypotheses in three different ways: 1. H0: θ = θ0 versus Ha: θ = θ0 (a ‘‘not equal to’’ alternative hypothesis) 2. H0: θ ≤ θ0 versus Ha: θ > θ0 (a ‘‘greater than’’ alternative hypothesis) 3. H0: θ ≥ θ0 versus Ha: θ < θ0 (a ‘‘less than’’ alternative hypothesis) In our Canadian example, θ = µRP and represents the population mean risk premium on Canadian equities. Also, θ0 = 0 and we are using the second of the above three formulations. The first formulation is a two-sided hypothesis test (or two-tailed hypothesis test): We reject the null in favor of the alternative if the evidence indicates that the population parameter is either smaller or larger than θ0. In contrast, Formulations 2 and 3 are each a one-sided hypothesis test (or one-tailed hypothesis test). For Formulations 2 and 3, we reject the null only if the evidence indicates that the population parameter is respectively greater than or less than θ0. The alternative hypothesis has one side. Notice that in each case above, we state the null and alternative hypotheses such that they account for all possible values of the parameter. With Formulation 1, for example, the parameter is either equal to the hypothesized value θ0 (under the null hypothesis) or not equal to the hypothesized value θ0 (under the alternative hypothesis). Those two statements logically exhaust all possible values of the parameter. 5Greek letters, such as σ, are reserved for population parameters; Roman letters in italics, such as s, are used for sample statistics. 246 Quantitative Investment Analysis Despite the different ways to formulate hypotheses, we always conduct a test of the null hypothesis at the point of equality, θ = θ0. Whether the null is H0: θ = θ0, H0: θ ≤ θ0, or H0: θ ≥ θ0, we actually test θ = θ0. The reasoning is straightforward. Suppose the hypothesized value of the parameter is 5. Consider H0: θ ≤ 5, with a ‘‘greater than’’ alternative hypothesis, Ha: θ > 5. If we have enough evidence to reject H0: θ = 5 in favor of Ha: θ > 5, we definitely also have enough evidence to reject the hypothesis that the parameter, θ, is some smaller value, such as 4.5 or 4. To review, the calculation to test the null hypothesis is the same for all three formulations. What is different for the three formulations, we will see shortly, is how the calculation is evaluated to decide whether or not to reject the null. How do we choose the null and alternative hypotheses? Probably most common are ‘‘not equal to’’ alternative hypotheses. We reject the null because the evidence indicates that the parameter is either larger or smaller than θ0. Sometimes, however, we may have a ‘‘suspected’’ or ‘‘hoped for’’ condition for which we want to find supportive evidence.6 In that case, we can formulate the alternative hypothesis as the statement that this condition is true; the null hypothesis that we test is the statement that this condition is not true. If the evidence supports rejecting the null and accepting the alternative, we have statistically confirmed what we thought was true. For example, economic theory suggests that investors require a positive risk premium on stocks (the risk premium is defined as the expected return on stocks minus the risk-free rate). Following the principle of stating the alternative as the ‘‘hoped for’’ condition, we formulate the following hypotheses: H0: The population mean risk premium on Canadian stocks is less than or equal to 0. Ha: The population mean risk premium on Canadian stocks is positive. Note that ‘‘greater than’’ and ‘‘less than’’ alternative hypotheses reflect the beliefs of the researcher more strongly than a ‘‘not equal to’’ alternative hypothesis. To emphasize an attitude of neutrality, the researcher may sometimes select a ‘‘not equal to’’ alternative hypothesis when a one-sided alternative hypothesis is also reasonable. The second step in hypothesis testing is identifying the appropriate test statistic and its probability distribution. • Definition of Test Statistic. A test statistic is a quantity, calculated based on a sample, whose value is the basis for deciding whether or not to reject the null hypothesis. The focal point of our statistical decision is the value of the test statistic. Frequently (in all the cases that we examine in this chapter), the test statistic has the form Test statistic = Sample statistic − Value of the population parameter under H0 Standard error of the sample statistic (7-1) For our risk premium example, the population parameter of interest is the population mean risk premium, µRP. We label the hypothesized value of the population mean under H0 as µ0. Restating the hypotheses using symbols, we test H0: µRP ≤ µ0 versus Ha: µRP > µ0. However, because under the null we are testing µ0 = 0, we write H0: µRP ≤ 0 versus Ha: µRP > 0. 6Part of this discussion of the selection of hypotheses follows Bowerman and O’Connell (1997, p. 386). Chapter 7 Hypothesis Testing 247 The sample mean provides an estimate of the population mean. Therefore, we can use the sample mean risk premium calculated from historical data, X RP, as the sample statistic in Equation 7-1. The standard deviation of the sample statistic, known as the ‘‘standard error’’ of the statistic, is the denominator in Equation 7-1. For this example, the sample statistic is a sample mean. For a sample mean, X , calculated from a sample generated by a population with standard deviation σ, the standard error is given by one of two expressions: σX = σ √n (7-2) when we know σ (the population standard deviation), or s X = s √n (7-3) when we do not know the population standard deviation and need to use the sample standard deviation s to estimate it. For this example, because we do not know the population standard deviation of the process generating the return, we use Equation 7-3. The test statistic is thus X RP − µ0 s X = X RP − 0 s/ √n In making the substitution of 0 for µ0, we use the fact already highlighted that we test any null hypothesis at the point of equality, as well as the fact that µ0 = 0 here. We have identified a test statistic to test the null hypothesis. What probability distribution does it follow? We will encounter four distributions for test statistics in this chapter: • the t-distribution (for a t-test), • the standard normal or z-distribution (for a z-test), • the chi-square (χ2) distribution (for a chi-square test), and • the F-distribution (for an F-test). We will discuss the details later, but assume we can conduct a z-test based on the cen-tral limit theorem because our Canadian sample has many observations.7 To summarize, the test statistic for the hypothesis test concerning the mean risk premium is X RP/s X . We can conduct a z-test because we can plausibly assume that the test statistic follows a standard normal distribution. The third step in hypothesis testing is specifying the significance level. When the test statistic has been calculated, two actions are possible: (1) We reject the null hypothesis or (2) we do not reject the null hypothesis. The action we take is based on comparing the calculated test statistic to a specified possible value or values. The comparison values we choose are based on the level of significance selected. The level of significance reflects how much sample evidence we require to reject the null. Analogous to its counterpart in a court of law, the required standard of proof can change according to the nature of the hypotheses and the seriousness of the consequences of making a mistake. There are four possible outcomes when we test a null hypothesis: 7The central limit theorem says that the sampling distribution of the sample mean will be approximately normal with mean µ and variance σ2/n when the sample size is large. The sample we will use for this example has 103 observations. 248 Quantitative Investment Analysis TABLE 7-1 Type I and Type II Errors in Hypothesis Testing True Situation Decision H0 True H0 False Do not reject H0 Correct Decision Type II Error Reject H0 (accept Ha) Type I Error Correct Decision 1. We reject a false null hypothesis. This is a correct decision. 2. We reject a true null hypothesis. This is called a Type I error. 3. We do not reject a false null hypothesis. This is called a Type II error. 4. We do not reject a true null hypothesis. This is a correct decision. We illustrate these outcomes in Table 7-1. When we make a decision in a hypothesis test, we run the risk of making either a Type I or a Type II error. These are mutually exclusive errors: If we mistakenly reject the null, we can only be making a Type I error; if we mistakenly fail to reject the null, we can only be making a Type II error. The probability of a Type I error in testing a hypothesis is denoted by the Greek letter alpha, α. This probability is also known as the level of significance of the test. For example, a level of significance of 0.05 for a test means that there is a 5 percent probability of rejecting a true null hypothesis. The probability of a Type II error is denoted by the Greek letter beta, β. Controlling the probabilities of the two types of errors involves a trade-off. All else equal, if we decrease the probability of a Type I error by specifying a smaller significance level (say, 0.01 rather than 0.05), we increase the probability of making a Type II error because we will reject the null less frequently, including when it is false. The only way to reduce the probabilities of both types of errors simultaneously is to increase the sample size, n. Quantifying the trade-off between the two types of error in practice is usually impossible because the probability of a Type II error is itself hard to quantify. Consider H0: θ ≤ 5 versus Ha: θ > 5. Because every true value of θ greater than 5 makes the null hypothesis false, each value of θ greater than 5 has a different β (Type II error probability). In contrast, it is sufficient to state a Type I error probability for θ = 5, the point at which we conduct the test of the null hypothesis. Thus, in general, we specify only α, the probability of a Type I error, when we conduct a hypothesis test. Whereas the significance level of a test is the probability of incorrectly rejecting the null, the power of a test is the probability of correctly rejecting the null— that is, the probability of rejecting the null when it is false.8 When more than one test statistic is available to conduct a hypothesis test, we should prefer the most powerful, all else equal.9 To summarize, the standard approach to hypothesis testing involves specifying a level of significance (probability of Type I error) only. It is most appropriate to specify this significance level prior to calculating the test statistic. If we specify it after calculating the test statistic, we may be influenced by the result of the calculation, which detracts from the objectivity of the test. We can use three conventional significance levels to conduct hypothesis tests: 0.10, 0.05, and 0.01. Qualitatively, if we can reject a null hypothesis at the 0.10 level of significance, we have some evidence that the null hypothesis is false. If we can reject a null hypothesis at the 8The power of a test is, in fact, 1 minus the probability of a Type II error. 9We do not always have information on the relative power of the test for competing test statistics, however. Chapter 7 Hypothesis Testing 249 0.05 level, we have strong evidence that the null hypothesis is false. And if we can reject a null hypothesis at the 0.01 level, we have very strong evidence that the null hypothesis is false. For the risk premium example, we will specify a 0.05 significance level. The fourth step in hypothesis testing is stating the decision rule. The general principle is simply stated. When we test the null hypothesis, if we find that the calculated value of the test statistic is as extreme or more extreme than a given value or values determined by the specified level of significance, α, we reject the null hypothesis. We say the result is statistically significant. Otherwise, we do not reject the null hypothesis and we say the result is not statistically significant. The value or values with which we compare the calculated test statistic to make our decision are the rejection points (critical values) for the test.10 • Definition of a Rejection Point (Critical Value) for the Test Statistic. A rejection point (critical value) for a test statistic is a value with which the computed test statistic is compared to decide whether to reject or not reject the null hypothesis. For a one-tailed test, we indicate a rejection point using the symbol for the test statistic with a subscript indicating the specified probability of a Type I error, α; for example, zα. For a two-tailed test, we indicate zα/2. To illustrate the use of rejection points, suppose we are using a z-test and have chosen a 0.05 level of significance. • For a test of H0: θ = θ0 versus Ha: θ = θ0, two rejection points exist, one negative and one positive. For a two-sided test at the 0.05 level, the total probability of a Type I error must sum to 0.05. Thus, 0.05/2 = 0.025 of the probability should be in each tail of the distribution of the test statistic under the null. Consequently, the two rejection points are z0.025 = 1.96 and −z0.025 = −1.96. Let z represent the calculated value of the test statistic. We reject the null if we find that z < −1.96 or z > 1.96. We do not reject if −1.96 ≤ z ≤ 1.96. • For a test of H0: θ ≤ θ0 versus Ha: θ > θ0 at the 0.05 level of significance, the rejection point is z0.05 = 1.645. We reject the null hypothesis if z > 1.645. The value of the standard normal distribution such that 5 percent of the outcomes lie to the right is z0.05 = 1.645. • For a test of H0: θ ≥ θ0 versus Ha: θ < θ0, the rejection point is −z0.05 = −1.645. We reject the null hypothesis if z < −1.645. Figure 7-1 illustrates a test H0: µ = µ0 versus Ha: µ = µ0 at the 0.05 significance level using a z-test. The ‘‘acceptance region’’ is the traditional name for the set of values of the test statistic for which we do not reject the null hypothesis. (The traditional name, however, is inaccurate. We should avoid using phrases such as ‘‘accept the null hypothesis’’ because such a statement implies a greater degree of conviction about the null than is warranted when we fail to reject it.11) On either side of the acceptance region is a rejection region (or critical region). If the null hypothesis that µ = µ0 is true, the test statistic has a 2.5 percent chance of falling in the left rejection region and a 2.5 percent chance of falling in the right rejection region. Any calculated value of the test statistic that falls in either of these two regions causes us to reject the null hypothesis at the 0.05 significance level. The rejection points 10‘‘Rejection point’’ is a descriptive synonym for the more traditional term ‘‘critical value.’’ 11The analogy in some courts of law (for example, in the United States) is that if a jury does not return a verdict of guilty (the alternative hypothesis), it is most accurate to say that the jury has failed to reject the null hypothesis, namely, that the defendant is innocent