high

Friday, October 19, 2007

criticism of by-subject and by-item analysis

4.2.5 By-Subject and By-Item Analyses (from MiniJudge: Software for Small-scale Experimental Syntax)

MiniJudgeJS runs both by-subject and by-subject-and-item analyses, but it reports only the first in the main summary unless it finds that the more complex analysis is really necessary. This approach differs from standard psycholinguistic practice, where both by-subject and by-item analyses are always run. A commonly cited reason for always running a by-item analysis is that it is required to test for generality across items, just as a by-subject analysis tests for generality across subjects. However, this logic is based on a misinterpretation of [Clark 1973], which is the paper usually cited as justification.

First, it is wrong to think that by-item analyses check to see if any item behaves
atypically (i.e. is an outlier). For parametric models like ANOVA, it is quite possible for a single outlier to cause an illusory significant result, even in a by-item analysis (categorical data analyses like GLMM don’t have this weakness). To test for outliers, there’s no substitute for checking the individual by-item results manually. MiniJudge helps with this by reporting MiniJudge: Software for Small-Scale Experimental Syntax 191 the by-sentence rates of yes judgments in a table saved as part of the offline analysis file; items with unusually low or high acceptability relative to others of their type stand out clearly.

In the case of the VOmen experiment, this table did not seem to show any outliers.
The second problem with the standard justification for performing obligatory by-item
analyses, as [Raaijmakers et al. 1999] emphasize, is that the advice given in [Clark 1973]actually applies only to experiments without matched items, such as an experiment comparing a random set of sentences with transitive verbs (“eat”, etc.) with a random set of sentences with unrelated intransitive verbs (“sleep”, etc.). Such sentences will differ in more than just the crucial factor (transitive vs. intransitive), so, even if a difference in judgments is found, it may actually relate to uninteresting confounded properties (e.g. the lexical frequency of the
verbs). However, if lexically matched items are used, as in the VOmen experiment, there is no such confound, since items within each set differ only in terms of the experimental factor(s). If items are sufficiently well matched, taking cross-item variation into account won’t make any difference in the analysis (except to make it much more complicated), but if they are not well matched, ignoring the cross-item variation will result in misleadingly low p values.

Nevertheless, if we only computed models that take cross-item variation into account, we might lose useful information. After all, a high p value does not necessarily mean that there is no pattern at all, just that we have failed to detect the pattern. Thus, it may be useful to know if a by-speaker analysis is significant even if the by-speaker-and-sentence analysis is not. Such an outcome could mean that the significant by-speaker result is an illusion due to an uninteresting lexical confound, but it could instead mean that if we do a better job matching the items in our next experiment, we will be able to demonstrate the validity of our theoretically interesting factor. Moreover, it is quite difficult to compute GLMM models with two random variables, making such models somewhat less reliable than those with only one random variable. Just in the last year, the lme4 package in R has been upgraded, so that the lmer function now gives different results for by-subjects-and-items analyses than it did when MiniJudge was first developed. Due to concerns like these, MiniJudge runs both types of analyses and only chooses the by-subjects-and-items analysis for the main report if a statistically significant confound between factors and items is detected. The full results of both analyses are saved in an off-line file, along with the results of the statistical comparison of them.

The R language makes it quite easy to perform this comparison, since the model in which only speakers are treated as random is a special case of the model in which both speakers and sentences are treated as random. This means the two GLMM models can be compared by a likelihood ratio test using ANOVA [Pinheiro and Bates 2000]. As with the output of the lmer function, the output of the lme4 package’s anova function makes it difficult to extract p values, so again the output is “sunk” to the offline analysis file to be read back in as a string. 192 James Myers
Only if the p value is below 0.05 is the more complex model taken as significantly better. If the p value is above 0.2, MiniJudgeJS assumes that items and factors are not confounded and reports only the by-subjects-only analysis in the main summary. Nevertheless, MiniJudgeJS, erring on the side of caution, gives a warning if 0.2 > p > 0.05. In any case, both GLMM analyses are available for inspection in the offline analysis file. Each analysis also includes additional information, generated by lmer, that may help determine which analysis is really more reliable, including variance of the random variables and the estimated scale (compared with 1); these details are explained in the MiniJudge help page. In the case of the VOmen experiment, the comparison of the two models showed that the by-subjects-only model was sufficient, unsurprisingly, given that the materials were almost perfectly matched, and that the items table showed no outliers among the sentence judgments.
The final problem with the standard justification for automatic by-item analyses is one that even [Raaijmakers et al. 1999] fail to point out. Namely, since repeated-measures regression models make it possible to take cross-speaker and cross-sentence variation into account at the same time, without throwing away any data, they are superior to standard models like ANOVA. To learn more about how advances in statistics have made some psycholinguistic traditions obsolete, see [Baayen 2004].

posted by wahahaha # 12:29 AM 0 Comments

high

Friday, October 19, 2007

criticism of by-subject and by-item analysis

Archives