Journal of Applied Measurement
P.O. Box 1283
Maple Grove, MN 55311
Current Volume Article Abstracts
Vol. 10, No. 1 Spring 2009
****
Mapping Multiple Dimensions of Student Learning: The
ConstructMap Program
Cathleen A. Kennedy and Karen Draney
Abstract
In the past, many assessments, especially standardized
assessments, tended to be composed of items with specific right and wrong
answers, such as those found in multiple choice, true-false and short response
items. Performancebased questions that require students to construct answers
rather than select correct responses introduce the complexities of multiple
correct answers, dependence on teacher judgment for scoring, and requisite
ancillary skills such as language fluency, which are technically difficult to
handle, and may even introduce problems such as bias against certain groups of
students. Recent developments in assessment design and psychometrics have
improved the feasibility of assessing performance-based tasks more efficiently
and effectively, thereby providing a rich domain of information from which
interpretations can be made about what students know and what they can do when
they draw upon that knowledge. We developed the ConstructMap computer program
specifically to assist teachers in interpreting and representing this type of
performance data. The program accepts as input student scores on items
associated with one or multiple performance variables, computes proficiencies
using multidimensional item response methods, and produces graphical
representations of students’ estimated proficiency on each of the variables.
****
Response Dependence and the Measurement of Change
Ida Marais
Abstract
Because of confounding effects that can mask change when
persons respond to the same items on more than one occasion, the measurement of
change is a challenge. The specific effect on change studied in this paper is
that observed when responses of persons to items at time 2 are dependent
statistically on their responses at time 1. In addition, because this response
dependence may affect the change differently for different locations of items
relative to persons at time 1, the initial targeting of persons to items was
studied. For a specific change in means of persons, dichotomous data were
simulated according to the Rasch model with varying degrees of dependence and
varying initial targeting of persons to items. Data were analysed, also using
the Rasch model, in two ways: firstly, by treating items used at time 1 and time
2 as distinct ones (rack analysis) and, secondly, by treating persons at time 1
and time 2 as distinct ones (stack analysis). With the rack analysis the change
is revealed through the item parameters and with the stack analysis the change
is revealed through the person parameters. With no response dependence the two
analyses gave equivalent and correct measures of change. With increasing
dependence change was increasingly masked or increasingly amplified, depending
on the targeting of items to persons at time 1. Response dependence affected the
measurement of change in both analyses, but not always in the same way. The
paper serves as a warning against undetected dependence and also considers
evidence that can be used in the analysis of real data sets for detecting the
presence of dependence when measuring change.
****
Using Paired Comparison Matrices to Estimate Parameters
of the Partial Credit Rasch Measurement Model for Rater-Mediated Assessments
Mary Garner and George Engelhard, Jr.
Abstract
The purpose of this paper is to describe a technique for
estimating the parameters of a Rasch model that accommodates ordered categories
and rater severity. The technique builds on the conditional pairwise algorithm
described by Choppin (1968, 1985) and represents an extension of a conditional
algorithm described by Garner and Engelhard (2000, 2002) in which parameters
appear as the eigenvector of a matrix derived from paired comparisons. The
algorithm is used successfully to recover parameters from a simulated data set.
No one has previously described such an extension of the pairwise algorithm to a
Rasch model that includes both ordered categories and rater effects. The paired
comparisons technique has importance for several reasons: it relies on the
separability of parameters that is true only for the Rasch measurement model; it
works in the presence of missing data; it makes transparent the connectivity
needed for parameter estimation; and it is very simple. The technique also
shares the mathematical framework of a very popular technique in the social
sciences called the Analytic Hierarchy Process (Saaty, 1996).
****
Toward a Domain Theory in English as a Second Language
Diane Strong-Krause
Abstract
This paper demonstrates how domain theory development is
enhanced by using both theoretical data and empirical data. The study explored
the domain of speaking English as a second language (ESL) comparing hypothetical
data on speaking tasks provided by an experienced teacher and by a certified
ACTFL oral proficiency interview rater with observed data from scores on a
computer-delivered speaking exam. While the hypothetical data and observed data
showed similar patterns in task difficulty in general, some tasks were
identified as being much easier or harder than expected. These differences raise
questions not only about test task design but also about the theoretical
underpinnings of the domain. The results of the study suggest that this
approach, where theory and data are examined together, will improve test design
as well as benefit domain theory development.
Comparison of Single- and Double-Assessor Scoring
Designs for the Assessment of Accomplished Teaching
George Engelhard, Jr. and Carol M.
Myford
Abstract
This article is based on a more extensive research
report (Engelhard, Myford and Cline, 2000) prepared for the National Board for
Professional Teaching Standards (NBPTS) concerning the Early
Childhood/Generalist and Middle Childhood/Generalist assessment systems. The
report is available from the Educational Testing Service (ETS). An earlier
version of the article was presented at the American Educational Research
Association Conference in New Orleans in 2000. We would like to acknowledge the
helpful advice of Mike Linacre regarding the use of the FACETS computer program
and the assistance of Fred Cline in analyzing these data. The material contained
in this article is based on work supported by the NBPTS. Any opinions, findings,
conclusions, and recommendations expressed herein are those of the authors and
do not necessarily reflect the views of the NBPTS, Emory University, ETS, or the
University of Illinois at Chicago.
****
A
Rasch Model Prototype for Assessing Vocabulary Learning Resulting from Different
Instructional Methods: A Preschool Example
Cynthia B. Leung and William Steve Lang
Abstract
This study explored the effects of using Rasch modeling
to analyze data on vocabulary knowledge of preschoolers who participated in
repeated read-aloud events and hands-on science activities with their classroom
teachers. A Rasch prototype for literacy research was developed and applied to
the preschool data. Thirty-one target words were selected for analysis from
three children’s informational picture books on light and color. After different
instructional activities, each child received scores on individual target words
measured with a total of six assessments, including free response vocabulary
tests and expressive and receptive picture vocabulary tests. Rasch modeling was
used to assess the learning difficulty of target words in different
instructional settings. Suggestions are made for applying Rasch modeling to
classroom studies of instructional interventions.
****
An Empirical Study on the Relationship between Teacher’s
Judgments and Fit Statistics of the Partial Credit Model
Sun-Geun Baek and Hye-Sook Kim
Abstract
The main purpose of the study was to investigate
empirically the relationship between classroom teacher’s judgment and the item
and person fit-statistics of the partial credit model. In this study, classroom
teacher’s judgments were made intuitively checking each item’s consistency with
the general response pattern and each student’s need for additional treatment or
advice. The item and person fit statistics of the partial credit model were
estimated using the WINSTEPS program (Linacre, 2003). The subjects of this study
were 321 sixth grade students in 9 classrooms within 3 elementary schools in
Seoul, Korea. For this research, a performance assessment test for 6th grade
mathematics was developed. It consisted of 20 polytomous response items and its
total scores ranged between 0 and 50. In addition, the 9 classroom teachers made
their judgments for each item of the test and for each student in their own
classroom. They judged intuitively using 4 categories; (1) well fit, (2) fit,
(3) misfit, and (4) badly misfit for each item as well as each student. Their
judgments were scored from 1 to 4 for each item as well as each student. There
are two significant findings in this study. First, there is a statistically
significant relationship between the classroom teacher’s judgment and item fit
statistic for each item (The median correlation coefficient between the
teacher’s judgment and the item outfit ZSTD is 0.61). Second, there is a
statistically significant relationship between the teacher’s judgment and the
person fit statistic for each student (The median correlation coefficient
between the teacher’s judgment and the person outfit ZSTD is 0.52). In
conclusion, the item and person fit statistics of the partial credit model
correspond with the teacher’s judgments for each test item and each student.
****
Understanding Rasch Measurement: Tools for Measuring
Academic Growth
G. Gage Kingsbury, Martha McCall, and Carl
Hauser
Abstract
Growth measurement and growth modeling have gained
substantial interest in the last few years with the development of new
statistical procedures and policy decisions such as the incorporation of growth
into No Child Left Behind. The current study investigates the following four
aspects of growth measurement: • Issues in the development of vertical scales to
measure growth • Design of instruments to measure academic growth • Techniques
for modeling individual student growth, and • Uses of growth information in a
classroom Measuring growth has always been a daunting task, but the development
of measurement tools such as the Rasch model and computerized adaptive testing
position us well to obtain high-quality data with which to measure and model the
growth of an individual student across a course of study. This growth
information, in norm-referenced and standards-referenced form, should enhance
educators’ ability to enrich student learning.
Vol. 10, No. 2 Summer 2009
****
The Relationships Among Design Experiments, Invariant
Measurement Scales,and Domain Theories
C. Victor Bunderson and Van A. Newby
Abstract
In this paper we discuss principled design experiments,
a rigorous, experimentally-oriented form of designbased research. We show the
dependence of design experiments on invariant measurement scales. We discuss
four kinds of invariance culminating in interpretive invariance, and how this in
turn depends on increasingly adequate theories of a domain. These theories give
an account of the dimensions and ordered attainments on a set of dimensions that
span a domain appropriately. This account may be called a domain theory or
learning theory of progressive attainments (in a local domain). We show the
direct, and the broader benefits of developing and using these descriptive
theories of a domain to guide prescriptive design approaches to research. In
process of giving an account of this set of interdependencies, we will discuss
aspects of the design method we are using, called Validity-Centered Design. This
design framework guides the development of instruments based on domain theories,
the development of learning opportunities; also based on domain theories, and
the construction of a sound validity argument for systems that integrate
learning with assessment.
****
Considerations About Expected a Posteriori Estimation in
Adaptive Testing: Adaptive a Priori, Adaptive Correction for Bias, and Adaptive
Integration Interval
Gilles Raîche and Jean-Guy Blais
Abstract
In a computerized adaptive test, we would like to obtain
an acceptable precision of the proficiency level estimate using an optimal
number of items. Unfortunately, decreasing the number of items is accompanied by
a certain degree of bias when the true proficiency level differs significantly
from the a priori estimate. The authors suggest that it is possible to reduced
the bias, and even the standard error of the estimate, by applying to each
provisional estimation one or a combination of the following strategies:
adaptive correction for bias proposed by Bock and Mislevy (1982), adaptive a
priori estimate, and adaptive integration interval.
****
Local Independence and Residual Covariance: A Study of
Olympic Figure Skating Ratings
John M. Linacre
Abstract
Rasch fit analysis has focused on tests of global fit
and tests of the fit of individual parameter estimates. Critics have noted that
slight, but pervasive, patterns of misfit to a Rasch model within the data may
escape detection using these approaches. These patterns contradict the Rasch
axiom of local independence, and so degrade measurement and may bias measures.
Misfit to a Rasch model is captured in the observation residuals. Traces of
pervasive, but faint, secondary dimensions within the observations may be
identified using factor analytic techniques. To illustrate these techniques, the
ratings awarded during the Pairs Figure Skating competition at the 2002 Winter
Olympic Games are examined. The intention is to detect analytically the patterns
of rater bias admitted publicly after the event. It is seen that the
one-parameter-at-a-time fit statistics and differential item functioning
approaches fail to detect the crucial misfit patterns. Factor analytic methods
do. In fact, the competition was held in two stages. Factor analytic techniques
already detect the rater bias after the first stage. This suggests that remedial
rater retraining or other rater-related actions could be taken before the final
ratings are collected.
****
Constructing One Scale to Describe Two Statewide Exams
Insu Paek, Deborah G. Peres, and Mark
Wilson
Abstract
This study applies two approaches in creating a single
scale from two separate statewide exams (Golden State Math Exam and California
Standard Math Test) and compares some aspects of the two statewide tests. The
first analysis involves a sequence of unidimensional Rasch scalings, using
anchored items to scale the two tests together. The second analysis employs a
2-dimensional Rasch scaling using previous unidimensional analysis results to
link the scales. The linking facilitates the investigation of their measurement
properties of the two exams and is a basis for combining items from both exams
to develop a more efficient testing program. The results of the comparisons of
the two statewide exams based on the linking are shown and discussed.
****
Multidimensional Models in a Developmental Context
Yiyu Xie and Theo L. Dawson
Abstract
The concept of epistemological development is useful in
psychological assessment only insofar as instruments can be designed to measure
it consistently, reliably, and without bias. In the psychosocial domain, most
traditional stage assessment systems rely on a process of matching concepts in a
scoring manual generated from a limited number of construction cases, and thus
suffer to some extent from bias introduced by an over-dependence on particular
content. On the other hand, Commons’ Hierarchical Complexity Scoring System
(HCSS) is an assessment that employs criteria for assessing the hierarchical
complexity of texts that are independent of specific content. This paper
examines whether the HCSS and one of the conventional systems, Kohlberg’s
Standard Issue Scoring System (SISS) measure the same dimension of performance.
A multidimensional partial credit analysis was performed on data collected
between 1955 and 1999. The correlation between performance estimates on the SISS
and HCSS is 0.92. The high correlation provides strong evidence that the order
of hierarchical complexity identified by the HCSS is the same latent dimension
of ability assessed with the SISS. The HCSS produced more distinct patterns of
ordered stages and wider gaps between adjacent stages. This evidence implies
that individual performances display a higher degree of consistency in their
hierarchical complexity under the HCSS. A developmental scoring system that
employs scoring criteria that are independent of particular content might be
more powerful than the traditional scoring systems as it provides easiness in
scoring and also possibilities of crosscultural, cross-gender, cross-context
comparison of conceptual knowledge within developmental levels.
****
An Application of the Multidimensional Random
Coefficients Multinomial Logit Model to Evaluating Cognitive Models of Reasoning
in Genetics
Edward W. Wolfe, Daniel T. Hickey, and Ann C.H.
Kindfield
Abstract
This article summarizes multidimensional Rasch analyses
in which several alternative models of genetics reasoning are evaluated based on
item response data from secondary students who participated in a genetics
reasoning curriculum. The various depictions of genetics reasoning are compared
by fitting several models to the item response data and comparing data-to-model
fit at the model level between hierarchically nested models. We conclude that
two two-dimensional models provide a substantively better depiction of student
performance than does a unidimensional model or more complex three- and
four-dimensional models.
****
Understanding Rasch Measurement: The ISR: Intelligent
Student Reports
Ronald Mead
Abstract
Rasch-based Scale Scores are a simple linear
transformation of the basic logit metric. Scale Scores are the quantification of
the measurement continuum. This quantification makes it possible to do
arithmetic, computer differences, and apply standard statistical techniques.
However, qualitative meaning is not in the numbers and must come from experience
with the scale and from the descriptive information that can (and should) be
attached. This includes item content and exemplars, normative information for
relevant groups, historical data for the individual, and evaluative assessment
like performance levels standards. The Scale Score metric is the structure that
manages the organization of intelligent reports and recognizes anomalies. Scale
Scores have no meaning, per se, but can provide a strong framework for
organizing useful reports and presenting meaningful information. They facilitate
diagnosis by “Analysis of Fit” and by “Analysis of Misfit.” The Analysis of Fit
relies on the general definition of the construct to describe what a student at
a particular point on the scale can and cannot do. It is meaningful to the
extent that the student conforms to the expectations of the measurement model.
The Analysis of Misfit uses the model to identify surprises, i.e., departures
from the model expectations. It highlights atypical areas of strong and weak
performance. The intent is to bring these exceptons to the attention of the
experts for informed, substantive interpretation and diagnosis. Intelligent
reports, to be useful, and to justify the time and expense of testing, need to
provide more information in a useable format than the candidate, student,
parent, or educator had available otherwise. This requires more than reporting a
single number or a single decision. It should include sufficient scaffolding to
allow the consumer to extract quickly and efficiently all the useful information
that can be taken from the test. Rasch Scale Scores are an important, perhaps
essential, tool in this process.
Vol. 10, No. 3 Fall 2009
****
Using Classical and Modern Measurement Theories to
Explore Rater, Domain, and Gender Influences on Student Writing
Ability
Ismail S. Gyagenda and George Engelhard,
Jr.
Abstract
This study i) examined the rater, domain, and gender
influences on the assessed quality of student’s writing ability and ii)
described and compared different approaches for examining these influences based
on classical and modern measurement theories. Twenty raters were randomly
selected from a group of 87 trained raters contracted to rate essays of the
annual Georgia High School Writing Test. Each rater scored the entire set of 375
essays on a 1-4 rating scale (366 essays were used in the analyses because nine
cases had missing values and were dropped). Two approaches, the classical
approach and the item response theory-based Rasch model, were used to conduct
psychometric measures of reliability and inter-rater reliability, and
statistical analyses with rater and gender as the predictor variables and the
total and domain scores as the dependent variables. To achieve the second
purpose, the Classical Test Model and the Rasch model were compared and
contrasted and their strengths and limitations discussed as they related to
student writing assessment. Analyses from both approaches indicated
statistically significant rater and gender effects on student writing. Using
domain scores as the dependent variables, there was a statistically significant
rater by gender interaction effect at the multivariate level, but not at the
univariate level. The Rasch analysis indicated a statistically significant rater
by gender effect. The comparison between the two approaches highlighted their
strengths and limitations, their different measurement and statistical models,
and their different procedures.
****
The Efficacy of Link Items in the Construction of a
Numeracy Achievement Scale—from Kindergarten to Year 6
Juho Looveer and Joanne Mulligan
Abstract
A large-scale numeracy research project was commissioned
by the Australian Government, involving 4732 Australian students from 91 NSW
primary schools. Rasch analysis was applied in the construction of a Numeracy
Achievement Scale (NAS) in order to measure numeracy growth. Following
trialling, a pool of 244 items was developed to assess number, space and
measurement concepts. Link items were included in test forms within year levels
and across adjacent year levels to enable linking of the forms and the
construction of a scale spanning Kindergarten to Year 6 (5 to 13 years of age).
However, results from the scaling were not consistent with expectations of
increases in student abilities or item difficulties across year levels.
Differential item functioning determined the problematic role of link items
across year levels. After a different set of items was used for linking test
forms, the results were consistent with expectations. A key finding was that
items used to link forms must not exhibit differential item functioning across
those levels.
****
The Study Skills Self-Efficacy Scale for Use with
Chinese Students
Mantak Yuen, Everett V. Smith, Jr., Lidia Dobria, and
Qiong Fu
Abstract
Silver, Smith and Greene (2001) examined the
dimensionality of responses to the Study Skills Self-Efficacy Scale (SSSES)
using exploratory principal factor analysis (PFA) and Rasch measurement
techniques based on a sample of social science students from a community college
in the United States. They found that responses defined three related
dimensions. In the present study, Messick’s (1995) conceptualization of validity
was used to organize the exploration of the psychometric properties of data from
a Chinese version of the SSSES. Evidence related to the content aspect of
validity was obtained via item fit evaluation; the substantive aspect of
validity was addressed by examining the functioning of the rating scales; the
structural aspect of validity was explored with exploratory PFA and Rasch item
fit statistics; and support for the generalizability aspect of validity was
investigate via differential item functioning and internal consistency
reliability estimates for both items and persons. The exploratory PFA and Rasch
analysis of responses to the Chinese version of the SSSES were conducted with a
sample of 494 Hong Kong high school students. Four factors emerged including
Study Routines, Resource Use, Text-Based Critical Thinking, and
Self-Modification. The fit of the data to the Rasch rating scale model for each
dimension generally supported the unidimensionality of the four constructs. The
ordered average measures and thresholds from the four Rasch analyses supported
the continued use of the six-point response format. Item and person reliability
were found to be adequate. Differential item functioning across gender and
language taught in was minimal.
****
Rasch Family Models in e-Learning: Analyzing
Architectural Sketching with a Digital Pen
Kathleen Scalise, Nancy Yen-wen Cheng, and Nargas
Oskui
Abstract
Since architecture students studying design drawing are
usually assessed qualitatively on the basis of their final products, the
challenges and stages of their learning have remained masked. To clarify the
challenges in design drawing, we have been using the BEAR Assessment System and
Rasch family models to measure levels of understanding for individuals and
groups, in order to correct pedagogical assumptions and tune teaching materials.
This chapter discusses the analysis of 81 drawings created by architectural
students to solve a space layout problem, collected and analyzed with digital
pen-and-paper technology. The approach allows us to map developmental
performance criteria and perceive achievement overlaps in learning domains
assumed separate, and then re-conceptualize a three-part framework to represent
learning in architectural drawing. Results and measurement evidence from the
assessment and Rasch modeling are discussed.
****
Measuring Measuring: Toward a Theory of Proficiency with
the Constructing Measures Framework
Brent Duckor, Karen Draney, and Mark
Wilson
Abstract
This paper is relevant to measurement educators who are
interested in the variability of understanding and use of the four building
blocks in the Constructing Measures framework (Wilson, 2005). It proposes a
uni-dimensional structure for understanding Wilson’s framework, and explores the
evidence for and against this conceptualization. Constructed and fixed choice
response items are utilized to collect responses from 72 participants who range
in experience and expertise with constructing measures. The data was scored by
two raters was analyzed with the Rasch partial credit model using ConQuest
(1998). Guided by the 1999 Testing Standards, analyses of validity and
reliability evidence provide support for the construct theory and limited uses
of the instrument pending item design modifications.
****
Plausible Values: How to Deal with Their Limitations
Christian Monseur and Raymond Adams
Abstract
Rasch modeling and plausible values methodology were
used to scale and report the results of the Organization for Economic
Cooperation and Development’s Programme for International Student Achievement
(PISA). This article will describe the scaling approach adopted in PISA. In
particular it will focus on the use of plausible values, a multiple imputation
approach that is now commonly used in large-scale assessment. As with all
imputation models the plausible values must be generated using models that are
consistent with those used in subsequent data analysis. In the case of PISA the
plausible value generation assumes a flat linear regression with all students’
background variables collected through the international student questionnaire
included as regressors. Further, like most linear models, homoscedasticity and
normality of the conditional variance are assumed. This article will explore
some of the implications of this approach. First, we will discuss the conditions
under which the secondary analyses on variables not included in the model for
generating the plausible values might be biased. Secondly, as plausible values
were not drawn from a multi-level model, the article will explore the adequacy
of the PISA procedures for estimating variance components when the data have a
hierarchical structure.
****
Understanding Rasch Measurement: Item and Rater Analysis
of Constructed Response Items via the Multi-Faceted Rasch
Model
Edward W. Wolfe
Abstract
This article describes how the multi-faceted Rasch model
(MFRM) can be applied to item and rater analysis and the types of information
that is made available by a multifaceted analysis of constructed-response items.
Particularly, the text describes evidence that is made available by such
analyses that is relevant to improving item and rubric development as well as
rater training and monitoring. The article provides an introduction to MRFM
extensions of the family of Rasch models, a description of item analysis
procedures, a description of rater analysis procedures, and concludes with an
example analysis conducted using a commercially available program that
implements the MFRM, Facets.