Research Papers

MacCann, R. G. (2009). Standard setting with dichotomous and constructed response items: some Rasch model approaches. Journal of Applied Measurement, 10(4), 438-454.

Abstract
Using real data comprising responses to both dichotomously scored and constructed response items, this paper shows how Rasch modeling may be used to facilitate standard-setting. The modeling uses Andrich’s Extended Logistic Model, which is incorporated into the RUMM software package. After a review of the fundamental equations of the model, an application to Bookmark standard setting is given, showing how to calculate the bookmark difficulty location (BDL) for both dichomotous items and tests containing a mixture of item types. An example showing how the bookmark is set is also discussed. The Rasch model is then applied in various ways to the Angoff standard-setting methods. In the first Angoff approach, the judges’ item ratings are compared to Rasch model expected scores, allowing the judges to find items where their ratings differ significantly from the Rasch model values. In the second Angoff approach, the distribution of item ratings are converted to a distribution of possible cutscores, from which a final cutscore may be selected. In the third Angoff approach, the Rasch model provides a comprehensive information set to the judges. For every total score on the test, the model provides a column of item ratings (expected scores) for the ability associated with the total score. The judges consider each column of item ratings as a whole and select the column that best fits the expected pattern of responses of a marginal candidate. The total score corresponding to the selected column is then the performance band cutscore.

MacCann, R. G. (2008). A modification to Angoff and Bookmarking cutscores to account for the imperfect reliability of test scores. Educational and Psychological Measurement, 68, 197-214.

Abstract
It is shown that the Angoff and Bookmarking cutscores are examples of true score equating which, in the real world, must work with observed scores. In the context of defining minimal competency, the percentage ‘failed’ by such methods is a function of the length of the measuring instrument.  It is argued that this length is largely arbitrary, being heavily influenced by practical educational constraints. Hence there is an ambiguity or non-uniqueness about the percentage failed. An argument is advanced that the failure rate should reflect the percentage of true scores below the cutscore. A modification to the cutscore is derived, which achieves this outcome and simultaneously removes the non-uniqueness in the percent failed.

MacCann, R. G. (2004). Reliability as a function of the number of item options derived from the ‘knowledge or random guessing’ model. Psychometrika, 69, 147-157.

Abstract
For (0, 1) scored multiple-choice tests, a formula giving test reliability as a function of the number of item options is derived, assuming the “knowledge or random guessing model,” the parallelism of the new and old tests (apart from the guessing probability), and the assumptions of classical test theory. It is shown that the formula is a more general case of an equation by Lord, and reduces to Lord’s equation if the items are effectively parallel. Further, the formula is shown to be closely related to another formula derived from Lord’s randomly parallel tests model.

MacCann, R. G. (1990). Derivations of observed-score equating methods which cater to populations that differ in ability. Journal of Educational Statistics, 15, 146 – 170.

Abstract
For anchor test equating, three linear observed score methods are derived for populations selected to differ in ability. Version A uses the slope, intercept, and standard error of estimate assumptions of selection theory. Version B uses the slope and intercept assumptions and assumes that the equated tests (X, Y) are equally reliable and congeneric. Both employ the synthetic population. Version C requires similar assumptions to Version A but does not use the synthetic population. Each version requires that the correlations of the tests with the selection variable be known. Five further sets of assumptions are made for each model, yielding 15 methods, which are then related to existing equating methods in the literature.

MacCann, R. G. (2006). The equivalence of online and traditional testing for different subpopulations and item types. British Journal of Educational Technology, 37, 79-91.

Abstract
A trial of pen-and-paper and online modes of a computing skills test was conducted for volunteer students of ages 15–16 in New South Wales, Australia. The tests comprised Matching, True/False and 4-option Multiple- Choice items. The aims were to determine whether gender, socioeconomic status (SES), or the type of item interacted with testing mode. No interactions were found for gender and item type, but the SES interaction was statistically significant. For low SES students, the online mode mean was 1 percent lower than the pen-and-paper mean, whereas high SES students had near equivalent means. These findings should be treated with caution as the groups in the study were self-selected, rather than random samples from the student population.

MacCann, R. G. (1989). A derivation of Levine’s Formulae (for equating unequally reliable tests using random groups) without the assumption of parallelism. Educational and Psychological Measurement, 49, 53-58.

Abstract
Levine’s Equations for random groups and unequally reliable tests can be used to equate tests X and Y through performance on an anchor test, Z. Levine’s derivation assumed that all three tests were parallel in function, with X and Y of different lengths. It is shown that the parallelism requirement is unnecessary, as it is sufficient to assume only that X and Y are congeneric, an assumption that is itself implicit in the definition of linear test equating.