Non-Parametric Statistics for Applied Linguistics Research [1 ed.]


313 21 2MB

English Pages 165 [164] Year 2009

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
PREFACE v
ACKNOWLEDGMENT viii
1 INTRODUCTION TO STATISTICS AND SLAR 1
1.1. A mole wrench or a pipe wrench 4
1.2. Test power 5
1.2.1. Sample size and power 9
1.2.2. Effect size and power 11
1.2.3. Are nonparametric tests less powerful? 13
2 PARAMETRIC/NONPARAMETRIC ASSUMPTIONS 14
2.1. Scales of measurement 18
2.2. Sample Size 20
2.3. Normality 24
3 NONPARAMETRIC TESTS: REVISITED 26
3.1. NP Tests: When and Why? 27
3.2. Misconceptions about NP Tests 29
3.3. The power of NP tests 30
3.4. Common types of NP tests 32
4 NON-NORMALITY TESTS 37
4.1. The normal distribution 38
4.2. Graphical methods of testing normality 41
4.2.1. Histogram 41
4.2.2. Stem and Leaf Plot 42
4.2.3. Boxplot 43
4.2.4. P-P plot 44
4.2.5. Q-Q plot 45
4.3. Numerical methods of testing normality 46
4.3.1. Skewness 46
4.3.2. Kurtosis 49
4.4. Testing normality using SPSS 52
4.4.1. A normally distributed variable 53
4.4.2. Graphical methods 54
4.4.3. Numerical methods 58
5 NONPARAMETRIC TETS OF DIFFERENCE 65
5.1. Kolmogorov-Smirov test for one sample 66
5.1.1. SPSS for K-S test analysis 66
5.1.2. SLAR literature: Kolmogorov-Smirov test 69
5.2. Mann-Whitney U test 71
5.2.1. Carrying out Mann-Whitney test 73
5.2.2. SPSS for Mann-Whitney test 75
5.2.3. SLAR literature: Mann-Whitney U test 76
5.3. Kruskal-Wallis one-way analysis of variance 78
5.3.1. Carrying out Kruskal-Willis test 79
5.3.2. SPSS for Kruskal-Willis test 80
5.3.3. SLAR literature: Kruskal-Willis test 84
5.4. Wilcoxon matched-pairs signed ranks test 85
5.4.1. SPSS for the Wilcoxon tests 87
5.4.2. SLAR literature: Wilcoxon test 95
5.5. Friedman two-way ANOVA tests 97
5.5.1. SPSS for the Friedman test 103
5.5.2. SLAR literature: Friedman two-way ANOVA 105
6 NP TESTS OF DIFFERENCE: CATEGORICAL 107
6.1. Chi-square test for frequency data 108
6.1.1. SPSS for Chi-square 114
6.1.2. SLAR literature: Chi-square test 114
6.2. Fisher test of categorical data 116
6.2.1. SPSS for the Fisher test 121
7 NONPARAMETRIC TESTS OF ASSOCIATION 122
7.1. NP tests for categorical data 123
7.1.1. Phi Coefficient 123
7.1.1.1. SPSS for Phi Coefficient 125
7.1.2. Contingency Coefficient 132
7.1.3. Cramer's V Coefficient 134
7.2. NP tests for non-categorical data 136
7.2.1. Kendall's rank coefficient correlation 136
7.2.1.1. Kendall's tau a 137
7.2.1.2. Kendall's tau b 140
7.2.1.3. Kendall's tau c 141
7.2.1.4. SLA literature: Kendall's tau tests 144
7.2.2. Spearman's rank order correlation (Rho) 145
7.2.3. Using SPSS for Spearman's Rho 151

APPENDIX1: TWO-TAILED CRITICAL VALUES OF T FOR THE WILCOXON TEST 155
APPENDIX 2: ONE-TAILED CRITICAL VALUES OF T FOR
THE WILCOXON TEST 156
APPENDIX 3: CHI-SQUARE TABLE 157
APPENDIX 4: CRITICAL VALUES OF r (and rs) 158
APPENDIX 5: Z-TABLE 159
APPENDIX 6: T-TABLE 160
REFFERENCES
GLOSSARY
Recommend Papers

Non-Parametric Statistics for Applied Linguistics Research [1 ed.]

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Karl Popper: All life is problem solving

Statistics and Language Acquisition Research Statistics is defined as the science of collecting, organizing, presenting, analyzing and interpreting data for the purpose of assisting in making a more effective decision. In this definition of statistics, two key terms need elaboration: data and effective decision-making. The final step in statistics is making sound and effective decisions, and to arrive at effective decisions the role of data is highly underlined: the more accurate decisions we want to

1

make, the more appropriate data we need to collect. How is it possible to arrive at sound decisions when the data on which decisions are made are not related? So, questions such as what data is, what appropriate data means, and how to collect and measure data are dealt with in measurement theories. Decision makers make better decisions when they use all available information in an effective and meaningful way. The primary role of statistics is to provide decision makers with methods for obtaining and analyzing information to help make these decisions. As far as second language acquisition research (SLAR) is concerned, Messick (1989) states that measurement is a data-driven and theory-driven endeavor. Measurement in research serves to provide systematic means for gathering evidence about human behaviors. Using measurement procedures, researchers elicit, observe, record, and analyze the related behavior (here, data) of language learners. In addition, measurement theories help SLA researchers in interpretation of obtained results in the light of SLA theories. Norris and Ortega (2003) believe that measurement occurs based on several interrelated stages. They divide the measurement process into two broad categories in which each category itself consists of three phases. The general categories are conceptual and procedural (See Figure 1). The conceptual component of measurement process consists of construct definition, behavior identification, and task specification. In terms of construct definition, researchers should explicitly explain "what it is they want to know". Based on current views, construct refers to interplay between a theoretical explanation of a phenomenon and the data collected about the phenomenon. Thus, if constructs are not clearly specified, behaviors (data) can not be linked with them. The second phase, behavior identification, is

2

related to the first concept mentioned previously. As it was mentioned, we need learners' behavior for decision making and interpretation. Researchers should know which particular type of behavior be observed and collected to arrive at related interpretations. In next chapters on parametric and nonparametric statistical test, you will recognize how selection of appropriate and related evidence determines the type of statistics indexes needed for data analysis. Task specification, the third phase of conceptual category, refers to decisions we need to make concerning specific tasks and situations to elicit targeted behaviors. If a selected task in the process of research can not provide the evidence required, naturally the measurement validity would be at stake.

Figure 1: Measurement Process (Norris & Ortega, 2003)

In the procedural stages of measurement process, researchers' onus is to proceduralize the outcomes of the conceptual stages and to use

3

mechanisms to elicit and analyze the data to provide evidence for interpretations. Three stages are introduced to achieve the objectives of the procedural phase of measurement process. In the first stage, behavior elicitation stage, researchers use specific tasks to elicit data. Observation scoring, the second stage, deals with scoring the data. Remember scoring in practice should be clearly linked to intended interpretations, and it is of different types such as categorical, ordinal, interval, and ratio types. Finally, in the data analysis stage, the scores are summarized and interpreted in light of numerous statistical paradigms. It appears that for second/foreign language researchers, statistics is mostly of interest when the procedural stages of measurement process are involved. Given that the primary purpose of second language acquisition research is to understand and inform improvements in language teaching and learning, validity of its interpretations depends absolutely on selection of appropriate measurement level for appropriate statistics. Consequently, behavior elicitation, observation scoring, and data analysis are the keys to arrive at more effective decisions on the part of SLA researchers.

Which One is Better: a Mole Wrench or a Pipe Wrench? A decision always facing applied linguistics researches is the choice from among

parametric

or

nonparametric

statistical

tests.

A

common

misconception appears to be the priority of parametric over nonparametric tests because of the so-called power of parametric tests. In the present section, we examine the critical factors contributing to the decisions made concerning the selection of appropriate statistical tests. The factors to be discussed are

4

power, sample size, and effect size as the core issues in decision making. The purpose here is to finally arrive at the conclusion that the selection of a test depends upon the function for which it is formulated. I surmise the choice between a parametric or nonparametric test is similar to the decision between a mole wrench and a pipe wrench. In Webster's Unabridged Dictionary (Copyright Random House 2000), a pipe wrench is defined as "a tool having two toothed jaws, one fixed and the other free to grip pipes and other tubular objects when the tool is turned in one direction only." A wrench is defined as "a tool for gripping and turning or twisting the head of a bolt, a nut, a pipe, or the like, commonly consisting of a bar of metal with fixed or adjustable jaws." Using the wrenches metaphor, definitely the selection of a wrench over another one depends on consideration of numerous factors. In the same way, a parametric test type appears to be as useful as a nonparametric test type based on the assumptions on which the tests are designed. In chapter two and three we will expound the main assumptions of both parametric and nonparametric tests.

Test Power A critical decision to be made by applied linguistics researchers has always been the choice among alternative statistical tests to meet the requirements of their research design. Which test is the best? When there are alternative statistical tests to handle data (parametric and nonparametric tests), a researcher should first of all consider the power of the test and select the most powerful test. In fact, a statistical test's power is the probability that the test procedure will result in statistical significance. Since always statistical significance is the favorite objective in research, it is of primary significance

5

for SLA researchers to plan a study to achieve a high power. Usually because of the difficulty of the calculations, the power is often ignored or some socalled rule-of-thumb is adopted by researchers. Fortunately today, thankful to statistical packages and internet-based soft wares, power analysis is not a complex procedure (For a free Internet-based trial version of power analysis, see www.Power-Analysis.com). Traditionally in applied linguistics research, statistical tests assume a null hypothesis of no difference, and hypothesis testing means a decision making based on the sample data obtained from the population against an alternative hypothesis. Hypothesis testing needs a decision concerning if sample data is consistent or inconsistent with the null hypothesis. We can see the probable outcomes of this decision in Table 1:

Table 1: Type І and Type ІІ Errors

True in the Population Researcher's Decision

H0 is True

Ho is False

Type І error (α)

Correct decision

Correct decision

Type ІІ error (β)

Rejects H0 (Accept H1)

Accepts H0

As displayed on Table 1, there are two alternatives to happen: either the decisions made are correct or the decisions made are with errors. As far as the former is concerned, if a treatment is really effective and the research is

6

successful in rejecting the null hypothesis, or if a treatment has no real effect and the research can not reject the null hypothesis, we make certain that the study's result is correct. However, in many cases this is not true and we should consider two types of potential errors in decision making by SLA researchers: Type І (α) and Type ІІ (β) errors. A Type I error happens when the treatment really has no effect but we mistakenly reject the null hypothesis. A Type II error occurs in case the treatment is effective but we fail to reject the null hypothesis. Supposing that the null hypothesis is true and alpha is set at .05, we expect a Type I error to occur in 5% of all studies. That is, the rate of Type I error is equal to alpha. Supposing the null hypothesis is false, we would expect a Type II error to occur in the proportion of studies denoted by one minus power, and this error rate is known as beta. Test power could be defined as the probability of not committing a Type II error. So, an indirect relationship could be found between a test power and Type II error: if power increases, Type II error decreases. According to Cohen (1988), power refers to "the probability that the statistical test will lead to the rejection of the null hypothesis, i.e., the probability that it will result in the conclusion that the phenomenon exists" (p. 56). In simple terms, we can consider the power of a test as our probability of finding what we were looking for in the research design. In other words, power is the probability of correctly rejecting the null hypothesis and is equal to 1 – β. In statistical tests, the desired power value is set at .80. Therefore, in the same way that α is conventionally set at .05 by researchers, β can also be set at less than .20 or greater than .80 for power. So, the power of .85, for instance, shows that the research has statistically sufficient power for the researcher to accept the alternative hypothesis with confidence.

7

Power analysis is usually done a priori when researchers determine power prior to the study and data collection. In this case, power analysis could be utilized to decide upon appropriate sample size to achieve appropriate power. It could be done to economize time and resources carrying out a study which has very little chance of finding a significant effect. Moreover, a priori power analysis guarantees that researchers do not waste time and resources testing more subjects than are necessary to detect an effect. However, it could also be conducted post hoc when a study has been completed to determine what the power was in the study. Post-hoc analysis done after a study could help researchers to explain the results if a study did not find any significant effects. Using SPSS, the complex procedure of power can be done performed with facility. From the menus, select: Analyze → General Linear Model → Univariate

Now in the Univariate window, select a variable from the left dialogue box and move it to the right Dependant Variable box. Then select Options to have Univariate Options window. In the Display section, put a click on Observed power and Continue. Finally press the Ok button. The output of SPSS is a table as following:

8

Tests of Between-Subjects Effectsc Dependent Variable: Comp2output1 Source Corrected Model Intercept Error Total Corrected Total

Type II Sum of Squares .000b 196.000 18.000 214.000 18.000

df 0 1 3 4 3

Mean Square . 196.000 6.000

F . 32.667

Sig. . .011

Noncent. Parameter .000 32.667

Observed a Power . .953

a. Computed using alpha = .05 b. R Squared = .000 (Adjusted R Squared = .000) c. Weighted Least Squares Regression - Weighted by preposition

As the table displays, the calculated observed power is .95 which is above .80. The interpretation is that the study had a high power to detect the effect the research hypothesis claimed.

Sample Size and Power Power is influenced by a host of factors including sample size, the effect size, the level of error in experimental measurement, and the type of statistical test used. Sample size is a major factor contributing to test power. In this regard, power analysis is recommended to make certain that the sample size is large enough for the statistical tests to actually detect the differences that they claimed to find. Statistically speaking, in case the sample size is too small, the standard statistical tests will not have the sufficient statistical power to find differences that actually exist. In fact, no significant difference is found although in reality there is a difference. As it was mentioned previously about

9

Type II error, with the decrease in sample size, the probability of acceptance of a false null hypothesis, referred to as beta, increases. Figure 2 displays the trade-off between power and sample size:

Figure 2: The Trade-off between Power and Sample Size

Based on the graph, we will arrive at the following conclusions:   

For α set at 0.01, an N of about 140 per group is needed for power of .80 For α set at 0.05, an N of about 90 per group is needed for power of .80 For α set at 0.10, an N of about 70 per group is needed for power of .80

However, we need to notice that although an increase in α leads to an increase in power, we need to make a balance between alpha and beta depending on research circumstances.

10

Effect Size and Power Another factor contributing to statistical power of tests is effect size because larger effects are easier to detect than smaller effects. Effect size refers to a set of indices which measure the magnitude of a treatment effect and the strength of association or difference between observations (See 2001 edition of Publication Manual of the American Psychology Association, p.25-26). In data interpretation we should notice to the critical point that statistical significance is different from importance. There are many cases where the results of a research are statistically significant but not important. As Dornyei (2007) states, statistical significance only shows that the observed phenomenon is probably true but it does not indicate that it is important. Thus, in case a difference is statistically significant, it does not mean that it is big and important in decision-making: statistical significance merely means that we are confident that there is a difference. Therefore, once we have determined statistical significance, we need to compute the related effect size to find out whether the difference is both statistically significant and important. Unfortunately SPSS does not provide us with effect size computation. The easiest method to obtain effect size is to calculate the difference between the mean of the two groups and dividing it by the standard deviation of one group:

Effect size = (Group 1 mean – group 2 mean) / SD

11

The interpretation of effect size is commonly based on the general guide by Cohen (1988) as following: 

0.0 – 0.2 Small Effect Size



0.3-0.5 Medium Effect Size



0.6- 0.9 Large Effect Size

To find out the relationship between statistical significance and importance of an observed phenomenon, consider the following table taken from Kotrilk and Williams (2003):

The table displays that although the t-test is statistically significant (t= 2.20, p = .03, df = 154), it does not meet even a small effect size based on Cohen's guidelines (d=.09). The interpretation is that even though the difference of time spent and preferred to spend in teaching is significant, since its effect size is highly small, its magnitude is ignorable. To sum up, we stated that power is related to Type ІІ (β) error, sample size, and effect size. Cohen (1988) has designed procedures to determine sample size based on the interrelationship between these concepts (experiments with two groups and a significance level of .05). In the

12

following table, the interrelationship between the three major issues is displayed: Table 2: Power, Size Effect, and Sample Size Relations

Effect size

Power = .80

Power = .90

.10

786

1052

.20

200

266

.30

88

116

.40

52

68

.50

28

36

Are Nonparametric Tests less Powerful? A common misunderstanding concerning the nonparametric tests is that they are always less powerful than their parametric alternatives. This assumption is only based on common knowledge and is not scientifically supported. Yes, provided that all assumptions underlying parametric tests are met, parametric tests are more powerful. However, when it comes to real research, we know that the parametric assumptions are rarely met and researches assume the satisfaction of the underlying assumptions necessary for the application of tests. In a nutshell, nonparametric tests are more powerful than their parametric equivalents if the assumptions underlying parametric tests are not met. A more detailed elaboration on power of nonparametric tests is given in chapter three after explanation of related underlying assumptions of parametric/nonparametric tests and some other related issues.

13

Parametric/Non-parametric Assumptions Today, non-parametric techniques are widely applicable to second language acquisition research. Despite limited exposure and misconceptions, the challenge seems to be convincing language-related researchers to utilize nonparametric techniques to arrive at sound interpretations. Data from research in applied linguistics in general and classroom research in particular often

14

violates one, if not all, of the basic assumptions for the application of parametric statistics. Non-parametric statistics can be considered as another package of tests for statistical inference which do not make strict assumptions about the population from which the data have been sampled, and may be utilized for researches with small sample sizes, nominal or ordinal data, and non-normally distributed data. The present volume will provide a terse prologue to nonparametric statistics, as they apply to research conducted in the field of applied linguistics. Traditionally, many introductory courses in statistics and SLAR tend to focus on what are called parametric statistics. These techniques are named parametric since they concentrate on specific parameters of the population, usually the mean and variance. In order to utilize these techniques, a number of assumptions regarding the nature of population from which the data are drawn must be met. Pett (1997) enumerates the basic assumptions to be met for the application of parametric tests: • Normal distribution of the dependent variable • A certain level of measurement: Interval data • Adequate sample size (>30 recommended per group) • An independence of observations, except with paired data • Observations for the dependent variable have been randomly drawn • Equal variance among sample populations • Hypotheses usually made about numerical values, especially the mean

15

In addition, Mackey and Gass (2005) mention three assumptions for the use of parametric tests in second language research. The first assumption, to them, is the normal distribution of data in which case means and standard deviations are suitable measures of central tendency. Second, the data are interval in nature. The third assumption is independence of observations, i.e. scores on one test do not influence scores on another measure. Considering the above-mentioned assumptions, one might note that many researches conducted in applied linguistics violate at least one of these parametric assumptions. For statisticians, non-parametric statistics appear to be the solution to this problem in many cases because they do not make strict assumptions concerning the population from which the data have been sampled. In spite of claim that nonparametric techniques do not need the rigid assumptions associated with their parametric counterparts, this does not imply that they are ‘assumption free’. Pett (1997) introduces some characteristics common to most non-parametric techniques: • Fewer assumptions regarding the population distribution • Sample sizes are often less stringent • Measurement level may be nominal or ordinal • Independence of randomly selected observations, except when paired • Primary focus is on the rank ordering or frequencies of data • Hypotheses are posed regarding ranks, medians, or frequencies of data

16

According to Hollander and Wolfe (1999), "… roughly speaking, a nonparametric procedure is a statistical procedure that has certain desirable properties that hold under relatively mild assumptions regarding the underlying populations from which the data are obtained" (p. 1). They further enumerate several advantages enjoyed by nonparametric techniques which contributed to the rapid and continuous advance of nonparametric techniques. a. Nonparametric techniques need few assumptions concerning the underlying populations: the traditional assumption that the underlying populations are normal is not met. b. Nonparametric methods enable the researchers to calculate exact pvalues for tests and exact coverage probabilities for confidence intervals. c. The application of nonparametric procedures is easier than their parametric counterparts. d. Nonparametric methods are easy to comprehend. e. Although it appears that nonparametric procedures sacrifice much of the underlying information in the sample, theoretical researches have proved that this is not true. On the contrary, nonparametric procedures can be mildly or widely more efficient than their parametric counterparts when the underlying populations are not normal. f.

Nonparametric methods are not sensitive to outlying observations.

g. Nonparametric techniques could be utilized in many settings where normal theory procedures can not be used. Many nonparametric techniques require merely the ranks of the data rather than the

17

magnitude of the data, whereas the parametric techniques require the magnitude. h. The advent of computer packages has facilitated rapid computation of P-values for nonparametric tests. Before the advance of computer convenient software, investigators commonly avoided the excessive computation needed for exact conditional tests and instead relied on large-sample results which produced approximate P-values.

As far as research in applied linguistics is concerned, there are usually three major parametric assumptions which are routinely violated: level of measurement, sample size, and normal distribution of the dependent variable. To illuminate the reason why much of the data in SLAR violate these assumptions, and to justify the application of non-parametric techniques, the following sections will explain these assumptions.

Scale of Measurement At the very beginning, in order to decide which statistical test to use, it is indispensable to recognize the scale of measurement associated with the dependent variable of interest. Generally speaking, for the application of a parametric test, a minimum of interval level measurement is required. On the contrary, nonparametric techniques can be used with all scales of measurement, and are most frequently associated with nominal and ordinal level data. In many cases, it is possible for interval data to be converted into ordinal data.

18

Nominal data The first level of measurement is nominal, or categorical. Nominal scales are usually composed of two mutually exclusive named categories with no implied ordering: yes or no, male or female. Data are placed in one of the categories, and the numbers in each category are counted (also known as frequencies). The key to nominal level measurement is that there are no numerical values assigned to the variables. In the nominal scale, numbers are used simply as names and have no real quantitative or mathematical value. Given that no ordering or meaningful numerical distances between numbers exist in nominal measurement, we cannot obtain the ‘normal distribution’ of the dependent variable. Descriptive research in applied linguistics would make use of the nominal scale frequently when collecting data about target populations. For example, native language of participants in a research project, their nationality and sex are coded using the nominal scale. According to Spatz and Johnston (1989), a mean is appropriate for interval or ratio scale data, but not for ordinal or nominal distributions.

Ordinal data The second level of measurement which is also frequently associated with nonparametric statistics is the ordinal scale. The ordinal scale has the characteristic of the nominal scale plus the feature of indicating "greater than" or "less than". In other words, the ordinal level measurement gives us a quantitative ‘order’ of variables, in mutually exclusive categories, but no indication as to the value of the differences between the positions. As such,

19

the difference between positions in the ordered scale cannot be assumed to be equal. Examples of ordinal scales in SLA research include motivation, stress in classroom, participation in tasks, and so forth. One could estimate that someone with a score of 5 is in more fluency or more accuracy than someone with a score of 3, but not by how much. As Spatz and Johnston (1989) mention, a median is appropriate for ratio, interval, and ordinal data, but not for nominal. There are a number of non-parametric techniques available to test hypotheses about differences between groups and relationships among variables, as well as descriptive statistics relying on rank ordering which will be explained in next sections.

Sample size Adequate sample size is another assumption underlying parametric tests. Cozby (2001) argues that increasing sample size increases the probability that results will be statistically significant, because larger samples would be more representative of population values. Statistically speaking, it would be possible to specify the sample size by use of a mathematical formula which accounts for the size of the confidence interval and the size of the population (see Table 1). For instance, based on the table, if our population size is 5,000, we need a sample of 357 for 5% accuracy; the needed sample size increases to 879 for 3% accuracy. Spatz and Johnston (1989) also believe that the larger the sample size, the smaller the standard error of the difference and the standard error of the mean. For research in education, Fraenkel and Wallen (2003) recommendation is 100 for descriptive research, 50 for correlational studies, and about 30 for each group in experimental research designs. To

20

Hatch and Farhady (1982) if the sample size is 30, the distribution of data of a sample which is randomly selected is close enough to a normal distribution and will not breach the underlying assumptions of the normal distribution considerably. Bachman (2003) believes that there is not a clear-cut answer to the question of appropriate sample size, but a division of around 20-30 cases is generally accepted.

Table 3: Sample Size and Precision of Population Estimates (95% confidence interval) (Cozby, 2001)

Size of Population

Precision of Estimate + -3%

+ -5%

+ -10%

2,000

696

322

92

5,000

879

357

96

10,000

964

370

95

50,000

1,045

381

96

100,000

1,056

383

96

Over 100,000

1,067

384

96

The sample size required for a study has implications for both choices of statistical techniques and resulting power. Researchers often select a sample size based on what is typical in a particular field of research. Cozby (2001) recommends an alternative approach and suggests selecting a sample size based on a desired probability of correctly rejecting the null hypothesis. This probability is called the power of the statistical test. As it was elaborated in the previous chapter, from statistical point of view, power of a test is

21

related to the probability that a Type II error occurs (the failure to reject the null hypothesis when it is mistaken). The range of power of a test varies from 0 to 1 with 1 the maximum and 0 the minimum power a test can take. In research, ideally we expect a test to have high power, close to 1. It has been revealed that sample size is directly related to researchers’ ability to correctly reject a null hypothesis (power). Hence, small sample sizes often decrease power of a test and increase the chance of a type II error. It has been found that it is possible to obtain appropriate power by applying non-parametric techniques with small sample sizes. A researcher's purpose has always been to select the most powerful test in order to test the research hypothesis. It was argued that the choice from among parametric/nonparametric tests depends upon their underlying assumptions. In case the sample size is the same, when parametric tests are compared with nonparametric tests, the power of the tests is not equal. Power-efficiency is a statistical index which is concerned with the size of a sample and test power. In fact, power-efficiency specifies how much larger a sample size is needed to be for test B to be as powerful as test A, supposing that test A is the most powerful test for a particular application (Siegel, 1956). Power-efficiency is calculated using the following formula:

Power-efficiency (B) = (NA/ NB) × 100

For instance, suppose that Test A is a parametric test which meets all its underlying assumptions. Also, suppose that Test A has the highest power for the required application and it has a sample size of N. Now, assume that we need to use a nonparametric test (Test B) since the assumptions claimed for

22

the parametric test are not met. In this case, the problem is how much larger Test B sample size must be for its power to be equal to Test A. If the obtained power-efficiency, for example, is 70, it means that for 7 subjects in sample A we need 10 subjects in sample B. It could be noticed that selecting a larger sample could result in nearly the same power when some assumptions of a more powerful parametric test can not be satisfied. Furthermore, power of a test and normality are interrelated. The power of a test is considered as the probability that the test rejects the null hypothesis. Of course, we intend to reject the null hypothesis when it is not true. It is proved that a test is more powerful than another one when it has a higher probability of determining that the null hypothesis is not true. In the specific case of test of normality, one test is said to be more powerful than others when it has a higher probability of rejecting the hypothesis of normality, when the distribution is not normal. Of course to make a fair comparison, we expect all tests to have the same probability of rejecting the null hypothesis when the distribution is actually normal (i.e. they have to have the same  or significance level). The common way in which studies are done in order to find out which tests are more powerful is through Monte Carlo simulation (For tests of normality see Chapter Three). Nevertheless, there is not a general consensus among statisticians concerning what constitutes a small sample size. Siegel and Castellan argue that when in a research project the sample size is very small, there may be no alternative to applying a non-parametric statistical test, but the value of ‘very small’ is not specified. As with many aspects of research in SLA, decisions about statistical technique choice are not clear-cut, but require us to get dirty with the data. In light of the dearth of consensus regarding what constitutes a

23

small sample, Pett (1997) has suggested that the choice of parametric or nonparametric tests just depends. It depends on sample size, level of measurement, the researcher’s knowledge of the variables’ distribution in the population, and the shape of the distribution of the variable of interest. It is not uncommon to see small sample sizes (i.e. n=15), or case studies (one subject) in SLAR literature. Language scientists frequently work with small groups of individuals, low incidence conditions, convenience samples, and limited funding. Thus, the assumption of large sample size is often violated by such studies using parametric statistical techniques.

Normality According to Pett (1997), in choosing a test we must consider the shape of the distribution of the variable of interest. In order to use a parametric test, we must assume a normal distribution of the dependent variable. However, in real research contexts things do not come packaged with labels specifying the features of the population of origin. From time to time, it would be possible to base assumptions of population distributions on empirical evidence, or prior experience. Frequently, however, sample sizes are too small to make any reasonable assumptions regarding the parameters of population. Generally in practice, a researcher can only assert that that a sample appears to come from a skewed, very peaked, or very flat population, for instance. Even when one has an accurate measurement, it may not be reasonable to assume a normal distribution, because this implies a certain degree of symmetry and spread. Non-parametric statistics are designed to be applied when we know nothing about the distribution of the variable of interest. Therefore, we can apply non-

24

parametric statistics to data from which the variable of under study does not belong to any specified distribution (i.e. normal distribution). One can find many variables in world which are normally distributed, such as weight, height and strength. However, when it comes to SLAR, this is not true of all variables in learning/acquisition studies. However, it appears that most researchers applying parametric statistics in their studies often just "assume" normality of data. In this regard, Micceri (1989) states that the naïve assumption of normality seems characterize research in many fields of study. Empirical studies, however, have documented non-normal distributions in literature from a variety of fields. Micceri (1989) investigated the distribution in 440 large sample achievement and psychometric tests and he came to the conclusion that all of the samples were significantly non-normal (p