IOMW2006

 

 

HOME

OVERVIEW

SUBMISSIONS

PROGRAM

REGISTRATION

LODGING

TRAVEL

DINING

IOMW STAFF

FEEDBACK

SNAPSHOTS

ADVANCES IN RASCH MEASUREMENT VOL. 2 DEADLINES

ADVANCES IN RASCH MEASUREMENT VOL. 2 FAQ

Presenter Abstracts

 

Papers     Pre-organized Symposia     Workshops

Papers

Alphabetical by first author
A-E   F-J   K-O   P-T   U-Z

 

Reliability and Item Response Modelling: Myths, Observations and Applications
Ray Adams, University of Melbourne, Australia

Test reliability is a concept central to classical test theory, it is commonly stated that a test must attain a certain level of reliability before it is considered of sufficient quality for practical use. Further, reliability is often stated as a requirement for validity, and even an indicator of fit (and hence of unidimensionality). This presentation discusses the role of reliability in item response theory, covering both conditional and marginal item response models. It is argued that in many situations reliability is not a requirement for validity and that reliability is often unrelated to fit of the item response model. Finally the concept of reliability as a measurement design effect is introduced. This concept parallels the concept of sampling design effects, in that it describes the impact that measurement error at the individual level (quantified through a reliability index) has on the accuracy with which population parameters are estimated.

Click here for slides.


Using Item Response Modeling Methods to Develop Theory Related to Human Performance
Diane D. Allen, UC, Berkeley, USA

Theories arise to explain observed phenomena. Explanations about phenomena affecting human performance lack certainty because of the difficulty in measuring people's perceptions, an intrinsic part of their performance. Item response modeling (IRM) methods can assist in the development and testing of instruments for measuring people's perception and for using such instruments to test theory. The purpose of this paper is to demonstrate IRM methods used to initiate the development and testing of a theory related to human movement via the Movement Ability Measure (MAM), a self-report questionnaire asking for people’s perceptions of their ability to move. The MAM was generated to match closely the construct of human movement theorized. Over 300 people of varying movement abilities, aged 18 to 101, completed the 24-item questionnaire. The partial credit and multidimensional item response models fit the data best. Wright Maps showed the strong relationship between the theorized construct and the empirical data. Comparison of person perceptions of movement within a group of 34 patients before and after undergoing physical therapy showed changes that correspond to those predicted by the proposed theory. Similar IRM methods might enhance the development and testing of additional theories related to human performance.

Click here for slides.


Rasch model and geostatistics techniques for atmospheric pollution
Pedro Alvarez,Francisco J. Moral,Jose L. Canito, University of Extremadura, Spain

Researchers or decision-makers frequently need information about atmospheric pollution patterns in urbanized areas. The preparation of this type of information is a complex task, due to the influence of several individual pollutants, with different units, on the global air pollution (e.g. nitrogen dioxide concentrations, ppm, and noise, dB). In this work, a new methodology based on the formulation of the Rasch model is proposed to obtain a measure of the atmospheric pollution. Two main results were obtained after applying this method: 1) A classification of all locations according to the pollution level, which was the value of the Rasch measure; 2) The influence on the environmental deterioration of each individual pollutant (particularly, in this work, NO2, NO, CO2, CO and noise). Finally, pollution at locations where no measurements were available was estimated with the optimum interpolation technique, kriging. Kriged estimates were subsequently used to map atmospheric pollution. To illustrate the application of this two-step method (Rasch model plus interpolation), which is useful to generate hazard assessment maps based on the spatial distribution of atmospheric pollution, an example is shown.


Measuring sensorial perception
Pedro Alvarez, M. A. Blanco, University of Extremadura, Spain

A method is described for measuring the sensorial perception of virgin olive oil according to sensorial analysis. It is an application of latent traits theory. "Sensorial perception of virgin olive oil" can be considered a latent variable defined by a set of sensorial factors (items). The subject-matter of this approach is how we can develop a technique for measuring that latent variable based on data from olive oil tasting. Rasch model is used as an instrument for measuring that variable. This technique is here applied to data from 113 different virgin olive oils assessed by 12 expert judges following the International Olive Oil Council (IOOC) tasting sheet for the organoleptic assessment of virgin olive oil. The measurements obtained for oils and items are discussed as well as a misfit analysis.


On characterising the thickness of an educational measurement
David Andrich, Murdoch University, Australia

Many educational assessments constructed to measure a single variable are nevertheless designed in terms of different aspects. These different aspects are take account of degrees of complexity of a construct making the assessment more valid. This paper resolves each subscale into a variable common among all subscales and a variable unique to each subscale. Then using traditional test theory and Rasch theory, it derives a formula which estimates a summary value characterising the relative magnitude of the mutually orthogonal variables. The greater this value, the greater the relative impact of the mutually orthogonal variables. The concept of thickness of a scale is invoked as a metaphor for this characterisation. It is analogous to Cronbach's little used concept of bandwidth. It is argued that the concept of thickness of a measurement is useful in constructing valid measurements at different levels of scale. Simulation studies using the Rasch unidimensional model illustrates the construction and effectiveness of the formula in estimating this thickness. In addition, data from two performance tests, one a scholastic aptitude test and one a non-verbal intelligence test, are analysed and their relative thicknesses estimated. Connections to literature emphasising the multidimensionality of scales are also made.

Click here for slides.


The Mastery Level Judgment Consistency Rate of a Rasch Model Based Standard Setting Method for Classroom Achievement Tests
Sun-Geun Baek, Seoul National University, Korea ; In-Hee Choi, Korea National Open University

The purpose of this study is to investigate the mastery level judgment consistency rate of a Rasch model based standard setting method (Rasch method) for classroom achievement tests at middle schools in Korea. In this study, the procedure of ‘Rasch method’ was discussed in comparison with both ‘raw-score method’ and ‘Angoff method’. The subjects of the study were 8th grade 422 students at 12 classrooms within 4 middle schools. For this research, two achievement tests for science classes were developed with different difficulty levels. Each test consisted of 25 multiple-choice items and its total scores ranged between 0 and 100 respectively. The mastery level was set up as 80% achievement of learning objectives and 7 teachers were involved as judges. In results, the mastery level judgment consistency rate of Rasch method between two achievement tests was 71.3%, raw-score method’s rate was 54.7%, and Angoff method’s rate was 75.4%. There was statistically significant difference between Rasch method and raw-score method (chi-square=4.916, p<0.01), but there was no difference with Angoff method. Moreover, there are additional advantages of Rasch method, for example, the difficulty of items and the distribution of persons can be represented visually on an 'item-person map'. Therefore Rasch method may supplement or replace both raw-score and Angoff methods.

Click here for slides.


Determining confidence intervals for IRT statistics through parametric bootstrapping
Kirk A. Becker, Promissor, USA; George Karabatsos, University of Illinois at Chicago, USA

When Rasch or other IRT parameters are generated, the fit of the data to the model is determined through some function of the observed and expected values for each person and item. The interpretation of these fit statistics is complicated by the fact that the magnitude and distribution the statistics are functions of both the model error and the distribution of the latent person and item parameters. This paper uses parametric bootstrapping to estimate the distribution of Rasch mean square infit and outfit statistics, as well as Yen's Q. These analyses demonstrate how parametric bootstrapping can be used to estimate the distribution of any parametric function. Results show that for tests that are too easy or too difficult for a target population, traditional heuristics for interpreting fit statistics can incorrectly flag over 40% of the items in an analysis as misfitting.


NetPASS: Construct and content validity in e-learning products
Diana J. Bernbaum, UC Berkeley, USA

The first case study and associated paper in this proposed symposium on e-learning assessment will consider NetPASS, an assessment of computer networking skills from the Cisco Learning Institute (http://www.education.umd.edu/EDMS/mislevy/CLI/). It is a performance-based e-learning product for assessment and uses simulations and live interactions as tasks in computer network design, implementation and troubleshooting. The measurement model is based primarily on Bayesian networks, which represent beliefs about student proficiencies as a joint probability distribution over the proficiency variables. We will consider NetPASS as an example of the use of sound principles for construct and content validity in e-learning, where learning goals are clearly defined and measurement approaches are selected in an evidenced-centered design approach. NetPASS supports a developmental perspective in e-product delivery, and positions the product to effectively serve student needs at a variety of performance levels.


Use of the Many-facet Rasch Model in Resolving Standard Setting Issues
Trevor G. Bond, Hong Kong Institute of Education, China ; Noor Lide Abu Kassim, International Islamic University, Malaysia

In selecting the “right” standard setting method several issues are of primary concern: correspondence of the judgment task to the measured construct, capacity of the standard setting method to deal with diverse item types and procedures for dealing with judgment-related variability. This paper describes a standard setting study that utilizes a standard setting procedure based on the Objective Standard Setting Method (Stone, 1996) and the Many-facet Rasch model (Linacre, 1989), which has the potential to deal with these standard setting issues. Findings suggest that this approach not only has the capacity to deal with different item types and judgment-related variability but it also facilitates greater efficiency in the standard setting process. However, as with other model-based standard setting methods, the integrity of resulting cutscores is dependent on a clear understanding of the measured construct.

Click here for slides.


Exploring Bundle Dependencies for the Embedded Attitudinal Items in PISA 2006
Steffen Brandt, University of Kiel, Germany

PISA, like other large-scale assessments, uses tests with sets of item bundles in which the items of an item bundle refer to a common stimulus. An innovation to PISA 2006 is that these item bundles not only include items to measure achievement but also items that measure attitudes. These, so called, embedded attitudinal items directly ask for the students' attitudes towards the scientific issue of the given stimulus, so due to the common stimulus dependencies of the items within an item bundle can be expected. The IRT models commonly used for the analysis of the achievement tests do not consider dependence between the items of an item bundle. In order to include the expected bundle dependencies of the attitudinal items, the author proposes a random-effects facet model (Keller, Swaminathan, & Sireci, 2003; Wang & Wilson, 2005a, 2005b) for the analysis of these items and compares the results of the proposed model to those obtained with a corresponding standard model. First results show that the proposed model yields more appropriate measures. Ate the same time, it offers new ways for the analysis of the bundle dependencies and, thereby, the chance for a more thorough interpretation of the data.

Click here for slides.


Modeling Tests With Sub-Dimensions
Steffen Brandt,University of Kiel, Germany

In many tests that measure aptitudes we deal with aptitudes that are, themselves, assumed to be composed by other aptitudes that represent specific facets of the overall measured aptitude. The usual approach to obtain person measures for each of these aptitudes as well as for the overall aptitude is to analyze the data via a one-dimensional model first and, thereafter, as a multidimensional model with each aptitude as a separate dimension. While this approach yields the needed measures, it is very unsatisfying from a theoretical point of view since the same data can either be one-dimensional or multidimensional - though, not both. The author proposes a model that offers a way to avoid this contradiction by yielding all needed measures via a single model, and, at the same time, offers the chance for a gain in measurement precision.

Click here for slides.


Vertical Scaling in Value-Added Models for Student Learning
Derek Briggs, Ed Wiley, Jon Weeks, University of Colorado, USA

This study compares different approaches to creating a vertical scale from longitudinal test scores such that the scores have a consistent interpretation over time. We focus on two important choices as part of this sort of vertical scaling exercise: 1) the type of measurement model being specified at the item level, and 2) the type of approach used to link item responses over time. We then assess the impact of vertical scales developed as a function of these different choices on estimates of growth in student learning from value-added models. Two sources of data are considered in this study. The first data source is simulated, the second derives from the longitudinal item responses of students taking standardized tests in math and reading in Colorado school districts from 2001 through 2005. This presentation will focus on the preliminary results from our analysis of simulated data.

Click here for slides.


Assessing a learning progression in science: Solving psychometric issues
Nathaniel J. S. Brown, Cathleen A. Kennedy, Karen Draney, Mark Wilson, UC Berkeley, USA

The concept of a learning progression has been put forward as a way to help make science curricula and assessments more interpretable (Wilson & Bertenthal, 2005). A learning progression consists of (a) a set of standards that can be structured into a developmental trajectory, and (b) materials that elaborate the standards in order to make them more educationally useful. In this presentation, we will describe a learning progression developed in the area of density and buoyancy, which consists of two strands, one about the science content area, and one about the sophistication of students' explanations (Wilson, Kennedy, Brown & Draney, 2005). The items yielded responses that could be classified into a variety of educationally interesting diagnostic categories, and which map back onto the developmental trajectory in a many-to-one fashion: the ordered partition model (Wilson & Adams, 1993) provides a suitable approach for this feature. In addition, many of the items tapped both strands, so we also used multidimensional item bundling to model this. In the presentation, we report on the usefulness and applicability of this complex modeling approach, and show how it can be used to support diagnostic reporting.

Click here for slides.


ALEKS: Making e-learning assessment reports useful in classroom instruction
Kristen Burmester, UC Berkeley, USA

The final case study and paper in the proposed e-learning assessment symposium considers good examples of making CBT diagnostic approaches useful for teachers, students and other stakeholders. ALEKS (http://www.aleks.com/) is a math tutoring system, which uses computer-based, authentic student input item responses to determine the student's “knowledge state.” The ALEKS system covers most middle and high school math curriculum content and proportionally weights the topics within each content area, and is used for developmental mathematics instruction for some of the California State University system campuses. Using adaptive item administration, a single 90-minute session enables ALEKS to determine the student's level and produces individualized progress reports. A student-level report presents a pie chart with proportionally weighted topics for each content area and differentiates between the student's progress and the student's outstanding needs in the content area. Teachers and parents are able to view multiple students on a single report to see each student's overall progress, distance to a short-term goal, and distance to mastery within each area. The reports also provide interactive feedback, link to relevant practice, and suggest subsequent instruction that the student is most ready to learn based on their set of skills. ALEKS is an example of how e-learning assessment reports can be useful in classroom instruction.


Examining Type I and Type II Error Rates in Small Sample DIF Statistics
Donna J. Butterbaugh, Richard M. Smith, Data Recognition Corporation, USA ; Vincent A. Maurelli, Thompson-Prometrics, USA

A large number of states have turned to the Rasch methodological framework for their No Child Left Behind (NCLB) assessments because a legal precedence has already been established (GI Forum v. Texas Education Agency, 2000). An issue that may arise, as an unintended consequence of the NCLB legislation, is small sample sizes for a differential item functioning (DIF) study. Many states may have a limited item pool with which to construct assessments and/or relatively homogeneous populations that limit the size of the demographic groups. Future test development needs may require the field testing of new items to populate the item pool, the number of which may require a significant number of forms, thereby limiting the number of students within a demographic group that can be included in a DIF study for those field test items. In this study the Type I and Type II error rates, and the power of a number of DIF methods will be investigated using simulated and real test data. The test consists of both dichotomous and polytomous items. DIF will only be determined using the typical Rasch definition of DIF (i.e., dislocation in the item difficulty parameter).


Rasch and structural equation modelling analyses of teacher observations of school principal leadership
Robert Frederick Cavanagh, Joseph Romanoski, Curtin University of Technology, Australia

A linear scale of teacher observations of their principal’s leadership behaviour was constructed. This used data from 389 teachers in 48 primary and seven high schools in the Canning Education District of Western Australia. The teachers responded to 50 items on a four point rating scale from strongly agree to strongly disagree. The items were organised into 11 groups consistent with a hypothesised model derived from the literature on educational leadership. Data were analysed using the Rasch rating scale model to ensure the scale complied with the requirements for objective measurement. This analysis also produced item difficulty locations which were interpreted as evidence of common and uncommon teacher observations of principal behaviours. LISREL structural equation modelling was then used to test for postulated dependencies between the 11 variables in the hypothesised model. This process led to confirmation of a structural model comprising dependent and independent principal leadership variables. The results have application in describing the leadership process in terms of specific principal behaviours and the interaction between these behaviours.


Rasch Analysis of Teams Abilities and Home Court Advantages for 2005-2006 NBA Ranks
Tsair-Wei Chien, Chi-Mei Medical Center, Taiwan ; Mark Linacre, University of Sydney, Australia ; Wen-Chung Wang, National Chung Cheng University, Taiwan; Ou Lydia Liu, UC Berkeley, USA

Rasch analysis was used to estimate the ability of all teams and evaluate the home court advantage of each team in NBA's competition for 2005-2006 NBA 473 regular season series. The odds ratio of any two teams was also calculated. Four conclusions were found in this study: 1. NBA team should be divided into 4 strata. 2.mean and standard deviation of odds ratio produced by Rasch analysis was less than that produced by traditional calculation method. 3.Home court advantages for Spurs and Magic were much more than other NBA teams. 4.The ability of Eastern and Western divisions was identical. By using Rasch analysis, more information can be obtained as compared to win-loss based traditional calculation method. Furthermore, the extraordinary result of a dynamic line chart plot established on the website demonstrated that the strong favorite team lost the game or the weak team won it as compared to residual Z-score beyond 2 produced by Winsteps computer software.


Cognitive change is stage-like: The cumulative evidence from a decade of Rasch modeling
Theo Dawson-Tunik, Hampshire College & Developmental Testing Service, LLC, USA

Over the course of the last decade, I have employed the Rasch model to examine the psychometric properties of four developmental assessment systems, including Armon's (1984) Good Life Scoring System, Kohlberg's (Colby, et.al, 1987) Standard Issue Scoring System, Perry's (Perry, 1970) system for determining epistemological positions, and Dawson-Tunik's (2004) Lectical™ Assessment System (LAS, http://www.lectica.info). Patterns of performance on all of these scoring systems support claims for invariant sequence and qualitative--as opposed to cumulative--growth. Some critics have questioned the evidence for qualitative growth, suggesting that it may be an artifact of developmental scoring systems. Models of data scored with the latest (5-phase) version of the LAS provides robust evidence that these qualitative shifts are not an artifact of developmental assessment systems. Instead, they appear to reflect marked shifts in cognitive behavior.


Sequential Tests for the Rasch Model
Clemens Draxler,University of Kiel, Germany

A seldom discussed problem in connection with the assessment of model fit of item response models by statistical tests is the ignoring of the type II error. Since model fit is formulated in the statistical null hypothesis it is very important to have the possibility to control the risk of a wrong decision of the acceptance of the model. The only way to achieve this is the formulation of an effect size and a specific alternative hypothesis respectively. This paper proposes a simple way of determining specific alternative hypothesis to carry out sequential analyses according to the theory developed by Wald, so-called sequential tests, for assessing the fit of the dichotomous Rasch model. It discusses exact tests based on discrete distributions and asymptotic tests as well as test for individual items. Sequential tests have the advantage that a decision in favor of either the null hypothesis or in favor of the alternative hypothesis can be based on a minimum sample size given a, b, and the effect size.


Measuring Measuring: An item response theory approach
Brent Duckor, UC Berkeley, USA

This study offers a conceptual framework for defining the nature of measurement knowledge, and evaluates the effectiveness of measures that are intended to assess that knowledge. A mixed-methods empirical investigation into the scripts, frames, and schemata employed by a range of individuals enrolled in an introductory measurement course aligned with the NRC knowledge framework is presented. The study reports the outcomes from a pre-post test. Person proficiencies and item difficulties are calibrated with a Rasch item response model. Evidence for the nature and structure of four constructs related to course content (construct maps, item design, outcome space, and measurement models) are examined. Results have implications for the use of the instrument in future classroom assessment or professional development program evaluation.

Click here for slides.

Survey-Based Service Quality Standards under IDEA: An Open Source Platform for Metrological Uniformity
William P. Fisher, Jr., Avatar Institute of Measurement, USA ; Batya Elbaum, University of Miami, USA ; Lisa Persinger, Alan Coulter, Louisiana State University, USA

States receiving funding under the Individuals with Disabilities Education Act (IDEA) are required, since the November, 2004, reauthorization, to report measures of parents’ perceptions of the quality of the services received by their children. Since January, 2003, the National Center for Special Education Services (NCSEAM) has provided a forum in which parents have been able to voice their concerns about the quality of the services provided to their children. Parents have contributed to the development of a survey states may find useful in meeting their IDEA reporting requirements. Five constructs were calibrated for assessing Part B programs (serving children ages 3-21), and two for Part C (serving children from birth through age 2). Reliabilities and model fit are uniformly high. The surveys, and the scaling and standard setting methodologies employed, incorporate open source design principles that may enable special education accountability practices to lead more directly to improved services.


On the Factor Structure of Standardized Educational Achievement Tests
Tim Gaffney, California Department of Education, USA

This research analyzed the factor structure at both the item and parcel level of California's norm and criterion referenced standardized educational achievement tests (SEAT) used in that state's high-steaks educational accountability assessments. It will be shown through full information factor analysis (e.g., NOHARM and Testfact) that at the item level, SEATs are highly unidimensional (i.e., they appear to tap a general test-taking ability) even when items representing broad content areas such as English, science, mathematics, and history are analyzed simultaneously as a single measure. These item- level factors also account for a relatively small proportion (1/4 to 1/3) of the variance. It will also be shown that, when items representing each content domain are combined into parcels, a richer bi-factor structure emerges that accounts for a larger portion (2/3) of the total variance, but, in which the pattern of the general and specific factors is still roughly proportional to that of the item level factor structure. The meaning of these patterns of correspondence between the item and parcel level structures will be discussed (e.g., well articulated structure at the parcel level does not necessarily imply correspondingly clear structure at the item level), as well as the diagnostic and remedial implications of these tests' factor structure for educators and test forms developers.


Using the Rasch Model to Evaluate Programs and Program Theories: An Example from the Evaluation of an Integrated Science and Literacy Curriculum
John Gargani, UC Berkeley, USA

In this presentation, we describe the results of a cluster-randomized trial of the Seeds/Roots Curriculum, a theory-based approach to integrating science and literacy instruction. The curriculum designers take the perspective that scientific reasoning is no different from other forms of reasoning. They define it as a broad and adaptable ability that can be applied to many specific domains. Accordingly, reasoning in other domains, such as literacy, can be used to teach and reinforce scientific reasoning. We demonstrate how formulating variants of the Rasch model as hierarchical logistic regressions allowed us to simultaneously estimate the effectiveness of the curriculum and investigate the validity of this theoretical perspective. We present our results and describe our methods.


Why not do it differently: analysis of examination results in Slovenia
Sergij Gabrscek, CPZ-Center for Knowledge Promotion, Slovenia

Slovenia introduced its first external high stakes secondary school leaving examination, Matura, in 1995, and lower secondary school leaving examinations (in mathematics and mother tongue) in 2002. From the very beginning a system was put in place which allowed the capturing of all details about the examination, for each subject, each question paper, each item and each candidate. Data were also gathered for each marker, so that in case of double marking inter-marker agreement could be analysed. But statistical data analysis was (and still is) limited to the classical test analysis after each examination term. There was no direct comparison possible between two examination terms; all results were more or less normalised and similar from year to year. It's difficult to compare students' results even to set standards in individual subjects. Rasch analysis offers a number of possibilities to reassess information from lower secondary school leaving examinations, giving more details about individual items as well as test or examination to students, teachers and experts preparing assessment materials. The paper presents some data from analysis of results of examinations during the first phase of introduction of these examinations in 2003 to present some of the options that could be used to improve assessment instruments.


The psychometric properties of the Strengths and Difficulties Questionnaire–an analysis of Swedish data based on the Rasch model
Curt Hagquist, Karlstad University, Sweden

The Strength and Difficulties Questionnaire (SDQ) is increasingly used as a survey instrument for tapping information about mental health problems among children and adolescents. Most of the evaluations of the SDQ are based on traditional analyses indicating sound psychometric properties of the instrument. Rasch-analyses do not seem to have been used yet in any study dealing with the SDQ-instrument. The purpose of this paper is to analyse the properties of five subscales of the SDQ by using the Rasch model, with a focus on the operating characteristics of the items. The analysis is based on self-reported data collected in Sweden in 2003 and 2004 at four different periods of time among adolescents in their ages of 12-18. In all, six sets of data including 8837 students are used for the analysis. The present analysis detected problems in all dimensions. In particular, the conduct problems scale seemed problematic, with inappropriate categorisation as well as lack of fit of the items. The study demonstrates the efficiency of the Rasch model in examining the measurement requirements of invariance and proper ordering of the data.

Click here for slides.


FOSS Self-assessment System: Using Rasch modeling to determine student tutoring needs
S. Veeragoudar Harrell, UC Berkeley, USA

The third case study and paper in the proposed e-learning assessment symposium will present aspects of a cognitive diagnoser with hinting capacity using Rasch family models (partial credit) as the underlying measurement model. The diagnostic approach was developed by WestEd researcher Michael Timms as a UC Berkeley dissertation project. It is an example of the use of Rasch models and construct mapping to collect evidence in an e-learning setting. The presentation will include descriptions of student performance in this approach to evidence-based assessment online.


Validity Equivalence Between the Chinese and English Versions of the IEA Child Cognitive Developmental Status Test
Xiaoting Huang, UC Berkeley, USA; "Kelly" Lei Wang, UC Berkeley, USA & National Education Examinations Authority at the Ministry of Education P.R. China

In this study, we examine the validity equivalence between the Chinese and English versions of the IEA Child Cognitive Developmental Status Test, through unidimensional and multidimensional analysis of differential item functioning (DIF). The subjects are 871 Chinese children and 557 American children. It is found that half of the items exhibit substantial DIF and the item infit statistics are not sensible to detect misfit items. Three plausible causes of DIF are speculated. The users of the test should be very cautious in suing the test to compare group difference between Chinese and American children.


Item dependency in an objective structured clinical examination
Cherdsak Iramaneerat, Carol M. Myford, Rachel Yudkowsky, University of Illinois at Chicago, USA

An Objective Structured Clinical Examination (OSCE) is an assessment approach employed in medical education, in which residents rotate through multiple stations of standardized clinical tasks to evaluate their clinical competence. Because items used to evaluate residents' performance in each OSCE station are linked to the same task and are rated by the same rater, their ratings may be dependent on one another, violating the assumption of conditional item independence that underlies the multi-faceted Rasch measurement (MFRM) model. We employed a MFRM model to analyze a communication skills assessment of 79 residents, using 6 OSCE stations, each scored on 18 five-point rating scale items. When we treated the rating on each item as a separate scoring unit, MFRM analyses showed item dependency in 65% of item pairs within an OSCE station according to Fisher's Z statistic, a modification of Yen's Q3 index of item dependency. This resulted in overestimation of resident separation reliability and inaccurate parameter estimation. Combining item scores in each OSCE station into a station score and using station as a scoring unit reduced the amount of item dependency to 27%. This approach produced more realistic reliability estimates and helped improve the fit of the data to the model.

Click here for slides.
Click here for handout.

Investigation of local item dependence in scenario-based science assessment
Hong Jiao, Shudong Wang, Zarko Vukmirovic, Harcourt Assessment, Inc., USA

With NCLB requirements of science, some states used scenarios to design their NCLB science assessment. A scenario may incorporate inquiry-based learning and hands-on strategies to present a scientific experiment or phenomenon that addresses assessment strands. The presentation of a scenario gives students an opportunity to observe the process of science and the results of an investigation or phenomenon. A group of items are created based on one scenario. This testing situation is similar to a reading comprehension test where multiple items are constructed for one reading passage. Scenarios and reading passages are testlets (Wainer & Kiely, 1987). When testlets are used in a test, it is likely that local item independence is violated (Yen, 1984, 1993; Wainer & Thissen, 1996; Lee & Frisbie, 1999; & Lee, Dunbar, & Frisbie, 1999). This study investigates local item dependence in scenario-based or testlet-based science assessment. A multilevel testlet model will be illustrated in detecting testlet effects by using one-parameter hierarchical generalized linear modeling approach. Yen's Q3 index (Yen, 1984) will be used to detect local item dependence in the test data as well. This study will provide information regarding local item dependence in scenario-based assessments, and help to build technically sound science assessments.

Click here for slides.


Perceived importance of employees' traits and abilities for performance in hospitality jobs
Rense Lange, Integrated Knowledge Systems, Inc., USA ; James Houran, 20 20 Skills Test - HVS International, USA

Literature in the hospitality industry (e.g., hotels, restaurants, gambling, cruises) shows that a wide variety of traits, abilities, and other characteristics have been linked to employment success and performance. Unfortunately, there is little information about which traits, abilities, and characteristics are valued most highly by Human Resource (HR) professionals in the hospitality industry. Yet, such information is potentially highly relevant to those designing selection and evaluation instruments. Starting from a literature review and interviews with HR professionals, we identified 31 traits and abilities as crucial. Next, a total of 108 HR professionals rated these traits in terms of their importance using four-point rating scales. Additionally, three levels of employees were distinguished: Line, Middle, and Senior. Analyses using the Facets software revealed that age and gender interacted weakly with some items' locations. Most importantly, the importance attached to the traits varied with employees' organizational levels (c293 538.2, p < .001), and the locations of 19 of the 31 traits varied significantly (p < .05) across Line, Middle, and Senior level employees. Thus, in contrast to the impression created in the literature, there is no single hierarchy of traits and abilities of importance across all jobs in the hospitality industry.


Creating Equivalent Groups for Equating with Bootstrap and Matched Samples
Anli Lin, Don Meagher,Eugene Bowles,Christina P. Stellato, Harcourt Assessment, Inc., USA

The purpose for this research was to determine an efficient method for creating equivalent groups for the accurate equating of two parallel forms of the Miller Analogies Test (MAT). The bootstrap and matched-sample methods are discussed as ways to get equivalent groups. Basically, equivalent groups can be selected by reasonably matching two samples according to key demographic variables (including ethnic identification, educational level, age, and sex). This paper also discusses ways to check the equivalence of the two equivalent groups selected by using the bootstrap and matched samples. When the equating that results from these methods is compared to results obtained in previous applications of the common-person method, it is seen that the different approaches produce very similar results.


Rasch Analysis Assists a Hospital with Salary Allocation for Physicians in Emergency Department
Hung-Jung Lin, Tsair-Wei Chien, Chi-Mei Medical Center, Taiwan ; Wen-Chung Wang, National Chung Cheng University, Taiwan

The purpose of this study was to evaluate person and item performance to assess 17 visiting doctors strengths and weaknesses using a 44-item multiple-choice performance content which resulted from physician mutual self-assessment, residents, type-I and type-II nurses. We collected Likert-type 5 scoring scale evaluation data from 4 groups and analyzed with Winsteps computer software after transforming averaged evaluated scored into 0-9 rating scale. The unidimensionality of the physician evaluation inventory was examined using the Rasch partial credit model. These data supported by Rasch model fit were discriminated as to 3 single construct Factors composed of 24, 7, 6 items respectively. Remaining 7 items were deemed misfitting to Rasch model. We added the 3 Factors’ measures of each physician to be a total logit score and then transformed each summed logit score into individual odds by exponential function. The salary allocation of each physician could be easily determined by a fixed budge and the individual odds. Many kinds of information derived from Rasch analysis are valuable in comparing physician abilities between items and assessing change over time within items.


Gender Similarities or Differences: Analysis based on PISA mathematics 2003
Ou Lydia Liu, UC Berkeley, USA

This study takes advantage of the Rasch analysis using multidimensional models and latent regression models based on the Program for International Student Assessment (PISA) 2003 mathematics data. The multidimensional models are more efficient than the unidimensional models in that they factor in the correlations between dimensions and increase the reliability of estimation. Furthermore, they offer the possibility of direct estimation of the variance covariance matrix/correlation matrix. Comparing to the regular regression models, the latent regression models control for measurement errors and provide more reliable standard errors thus more accurate hypothesis testing. Results revealed that overall all there is no striking evidence of gender differences, which supports the conclusion that gender differences are diminishing over the last four decades and there are more gender similarities than differences (Hyde, 2005). However, multiple evidences suggest that boys performed significantly better than girls on items measuring space and shape ability on average. This study illustrates an example of how advanced measurement models can be utilized to provide insights in achieving gender equity in math learning.


Rater stability and applicant pool quality across successive applicant pools: A many-faceted Rasch rating scale analysis
Peter D. MacMillan, University of Northern British Columbia, Canada

BEd program applicants are assessed on criteria that had been rated on a series of 5-point scales by two raters from a pool of raters. A previous study illustrated that any encounter with one severe rater was so damaging as to cause these applicants to fail to make the acceptance list. When a many-faceted Rasch analysis was used, the situation was corrected. The following year, the Education program improved rater training followed by discrepant -rater detection and rescoring for the affected applicant files. The many-faceted Rasch model has now been used to assess the effectiveness of this method of correction. This third closely related study examines the stability of the raters that were common across the years of rating. As well rating scales that were common across the two rating periods were also assessed. Finally, an anchoring procedure is employed to allow comparison of the applicant pools from one year to the next.

Click here for paper.


A Measurement Theoretic Version of the Ohta-Kimura Stepwise-Mutation (Ladder) Model
Nathan J. Markward, Pennington Biomedical Research Center, USA; William P. Fisher, Jr., Avatar International Inc., USA; Bronya J. B. Keats, Louisiana State University Health Sciences Center, USA

What would a measurement theoretic approach to modeling genetic frequencies look like? Drawing on available data from published sources, we demonstrate how the Ohta-Kimura (1973, 1975) stepwise-mutation (ladder) model can be parameterized as a probabilistic conjoint measurement (PCM) model and implemented to measure genomic variation at "short tandem repeat" (STR, microsatellite) loci in the frataxin gene (X25) region. X25 is of particular interest, because it harbors a GAA triplet-repeat expansion that has been linked to Friedreich ataxia--a severe autosomal recessive disorder with onset in early childhood--and whose size directly correlates with disease severity. Our results indicate that fundamental measurement theory's requirements of conjoint additivity and statistical sufficiency are, indeed, useful for 1) linearizing genomic data, 2) evaluating the internal consistency of person and locus observation patterns, and 3) relating "consistently inconsistent" patterns of model misfit to person-specific expansion dynamics. We conclude with a discussion of the possible advantages of Rasch theory and methods over traditional quantitative methods employed in genetic epidemiology, as well as practical implementations that might be of use to clinical geneticists and genetic counselors.


Separating the Paremeters of Genealogy and Mutation: Violations of Local Independence as Deviations from Genetic Equilibrium
Nathan J. Markward, Pennington Biomedical Research Center, USA

This paper explores how the Rasch model can be operationalized as a mathematical statement of (perfect) genetic equilibrium and subsequently implemented to evaluate the joint effects of random reproduction and random neutral mutations in a sample of unrelated individuals. What emerges is an elegant approach to constructing gene genealogies and interpreting the various forces of evolution--selection, mutation, recombination, etc.--as systematic and diagnosable violations of the local independence assumption. Key points to be highlighted include construct theory, "answer key" development, model parameterization, and the interpretation of measures and model fit statistics from the perspectives of both educational testing and population genomics.


A Simulation Study of Rasch Measurement Precision For Dichotomous Items
Anatoli A. Maslak, Slavyansk-on-Kuban State Pedagogical Institute, Russia

The purpose of this study is to investigate latent variable measurement precision when form length varies by number of dichotomous items. This topic has important practical significance because testing costs increase substantially as number of items increase. Therefore, test developers need to know the minimum number of items required for desired measurement precision of the latent variable, as well as contributions to precision from additional items. This investigation was based on a simulation experiment. Simulation data for this experiment were generated a priori to fit the Rasch model. Item difficulty values on the latent trait varied over a wide range from -4.0 to +4.0 logits. A difficulty range that is sufficient for most practical measurements. Results show that only 30 dichotomous items are needed to achieve measurement precision of .5 logits. Simulations show that increasing number of items does increase measurement precision but with diminishing returns. A test with 100 dichotomous items, for example, does not establish measurement precision greater than .3 logits.

Click here for slides.


The Theoretical Difference Between IRT and Rasch Models (It's Not What You Think!)
Robert W. Massof, Johns Hopkins University, USA

We have often heard that Rasch models are just "one-parameter IRT models" and we have heard that IRT models are over-parameterized ad hoc statistical models. We have heard that IRT models violate the basic premises of axiomatic conjoint measurement theory and we have heard that Rasch models violate the principle of measurement invariance when neighboring response categories are merged. But, both types of models estimate person measures and item measures from a matrix of observations consisting of responses of each person to each item. Most often, both types of models employ logistic equations to represent the probability of the difference between the person measure and item measure exceeding a response threshold. On the surface, one could easily argue that the differences between Rasch and IRT models are trivial, or at least not substantive enough to sustain a 40+ year debate. To outside parties who are still using raw scores, the debate appears to be mainly polemical. One of the problems in understanding and resolving the differences between models is that there has been no general explanatory theory from which both models can be derived. Such a theory would provide a common starting point built on assumptions that all parties can accept. Then the derivation of different models would require specific explicit assumptions that potentially could be tested. The present paper offers a general probabilistic item response theory from which both traditional IRT and Rasch models can be derived given opposite assumptions that are mathematically and conceptually explicit.

Click here for slides.


Multidimensional Equating: Linking Multidimensional Test Forms by Constructing an Objective n-Space
Mark H. Moulton, Educational Data Systems, USA ; Howard A. Silsdorf, The Andrea O'Brennan Foundation, USA

Form equating methods have proceeded under the assumption that test forms should be unidimensional, both across forms and within each form. This assumption is necessary when the data are fit to a unidimensional model, such as Rasch. Otherwise, variations in the dimensional mix of the items on each test form, as well as in the mix of skills in the student population, can lead to problematic testing anomalies. The assumption ceases to be necessary, however, when data are fit to an appropriate multidimensional model. In such a scenario, it becomes possible to reproduce the same composite dimension rigorously across multiple test forms, even when the relative mix of dimensions embodied in the items on each form varies substantially. This paper applies one such multidimensional model, NOUS, to simulated and actual multidimensional datasets and shows how it avoids the pitfalls that can arise when fitting the same data to a single dimension. Some implications of equating multidimensional forms are discussed.

Click here for paper.


Impact of Altering Randomization Intervals on Precision of Measurement and Item Exposure
Timothy Muckle, Betty Bergstrom, Kirk Becker, John Stahl, Promissor, Inc., USA

Item exposure and precision of measurement are critical aspects of computer adaptive testing. The purpose of this study is to evaluate the effects on bank usage (item exposure) and precision of measurement (standard error of measurement) when randomization intervals are relaxed. Recent studies suggest the efficacy of using an item selection procedure in which items are randomly selected from a group of items within a logit interval surrounding the estimated ability (also known as a randomesque procedure) for adaptive testing. Each subsequent item is chosen from all items within a certain specified logit distance of the target information value (Laughlin Davis, Dodd, et al., 2003). Presently, no study has been conducted to determine the optimum size of the logit interval from which items are selected. This project consists of studying the effects, specifically the impact on item exposure and measurement precision, of altering the width of this random selection interval.

Click here for paper.
Click here for slides.


Random Parameter Structure and Testlet Model: Extension of the Rasch Testlet Model
Insu Paek, Haniza Yon, Educational Testing Service, USA; Mark Wilson, UC Berkeley, USA

The current random effect Rasch testlet model assumes the independence of the testlet effect and the target dimension. This study investigated the impact of the violation of that assumption on the model parameters and the performance of the extended Rasch testlet model where the random parameter variance-covariance matrix is freely estimated without any constraints on the elements. Simulation study was conducted and its results showed that the extended testlet model was the same or superior to the regular testlet model in its performance. The target dimension variance in the regular testlet model was the most affected parameter when the independence of the random parameter distribution was violated, and its bias became largest when the dimension correlation was high and the size of true testlet effect was large. This leaves a concern that in some cases of real data applications, the degree of misinterpretation of the size of the testlet effect could become large as ell because the testlet effect size is interpreted relative to the size of the target variance. The regular testlet model showed robustness in its estimation of other model parameters such as testlet effect and item parameters, showing close performance to the extended testlet model.


Estimating the Rasch Model with Block-Diagonal Item Response Matrix: An Exploration of Winsteps Software with Implications for Equivalent-Groups Equating
Richard Patz, R.J. Patz, Inc., USA; Venessa Lall, Educational Testing Service, USA; Christopher Domaleski, Georgia Department of Education, USA

This paper examines how equivalent-groups equating has been implemented in a number of testing programs. In particular, it focuses on the question of how the computer program Winsteps derives Rasch model item and person parameter estimates when the item response data matrix has a block-diagonal form. We describe the model, the data collection design, maximum likelihood estimation of model parameters, and the conditions under which the parameters are uniquely identified. Through these investigations and the application of Winsteps to contrived and simulated data sets, we are able to infer indirectly some characteristics of Winsteps' estimation approach. Implications for forms equating forms under an equivalent-groups design are discussed, and recommendations for practice are offered.


Linking 2005 NAEP Science Assessments through Bridge Samples
Jiahe Qian, Educational Testing Service, USA

The 2005 NAEP Science Assessment initiated two major alterations, 1) changing the format of test booklets to that suggested by the National Assessment Governing Board (NAGB), and 2) creating three new blocks of items to replace the released blocks. To reduce the possible errors due to the changes and obtain valid linking, the NAEP science assessment comprised an operational sample and an added national-only bridge sample. The format of booklets used in the bridge sample was the same as in the previous (2000) assessment. Therefore, the 2005 NAEP science assessment employed two linking processes. The first process set the bridge sample and previous assessment (2000) sample on the same scale by concurrent calibration and then used a linear transformation to link the scale to the reporting scale. Based on the common population between bridge and operational samples, the second process aligned the scale of current operational samples with that of the bridge samples by matching the means and standard deviations of the distributions. Several statistical issues arose during linking, including identifying the sub-samples for common populations and adjusting weights. The effects of using experiment kits on the constructed-response items in concurrent calibration were also analyzed.

Click here for paper.
Click here for slides.


Differential Item Functioning in the SAT I Reasoning Test
Maria Veronica Santelices, UC Berkeley, USA

The purpose of this research proposal is to study the differential item functioning (DIF) in the SAT I Reasoning Test. In particular, it is my plan to compare the DIF results obtained when using the standardization approach to those obtained when analyzing DIF with an Item Response Theory (IRT) approach. This research proposal is inspired by the work conducted by Roy Freedle, regarding the role of culture and linguistic differences in explaining the relationship between item difficulty and DIF results in the responses to various SAT items. My research project aims to replicate Freedle's analysis in a newer dataset taking into consideration the technical criticisms made by ETS researchers to his application of the standardization methodology and comparing those results to findings obtained from the analysis of DIF in the 1 Parameter Logistic Model.

Evaluative Implementations: Meaning Equivalence Instructional and Assessment Methodology for Deep Understanding
Kavita L. Seeratan, Ph.D., University of Toronto, Canada

Current methods (e.g., multiple-choice) available for the assessment of learning are plagued with problems and often do not reveal depth of comprehension of the learned material. Frequent criticisms regarding their heavy reliance on memory-based or procedural knowledge, has provided impetus for the development of alternative theoretically valid, psychometrically authenticated methods for probing understanding. This paper describes the design, development, application, and various evaluative implementations of a research-based assessment and instructional methodology designed to evaluate and enhance learners' deep understanding of learned material. Discourse theory researchers propose that the ability to mentally represent a given meaning in a variety of ways is a pre-requisite for, and a marker of, deep comprehension. It is this experimental paradigm that underpins the new Meaning Equivalence (ME) methodology. The ME method evaluates deep comprehension of newly acquired concepts in a given content area by appraising learners' ability to recognize and produce multiple representations of content that encode equivalence-of-meaning (Shafrir, 1999; Sigel, 1999). In addition, ME is also believed to allow learners to begin thinking in ways that promote deep processing and understanding, prerequisites for successful knowledge transfer (Seeratan, 2006). Other pedagogical advantages of the method include its capacity to identify learners' strengths and weaknesses; facilitate instruction and remediation; and be objective, practical, and technological scalable. In addition to describing the design, development, application, and evaluation of the ME method, I will discuss underlying theoretical principles, empirical claims, as well as implications and next steps.


Mindfulness Practice: A Rasch Variable Construct Innovation
Sharon G. Solloway, Bloomsburg University, USA; W. P. Fisher, Jr., Avatar International, Inc., USA

Is it possible to establish a consistent, stable relationship between the structure of number and additive amounts of mindfulness practice? A bank of thirty items, constructed from novice practitioners' journal responses to mindfulness practice and augmented by the literature, comprised the instrument. A convenience sample of students in a teacher education program participated, none of whom were exposed to mindfulness training before taking the survey the first time. The WINSTEPS Rasch measurement software was used for all analyses. Measurement separation reliability was 0.92 and item separation reliability was 0.98, with satisfactory model fit. The 30 items measure a single construct of mindfulness practice. Construct validity was supported by the meaningfulness of the items as perceived as easy to hard. The same scale was produced when the items were calibrated separately on the T1 and T2 groups (Rsq 0.83). The experimental group's T2 measures were significantly different from both its own T1 measures and the control group's T1 and T2 measures. ANOVA showed significance for variance between the experimental and control groups for T2 (F 43.66, 151 d.f., p <.001) for a nearly two-logit (20 unit) difference (48.9 vs. 68.0). The study is innovative in its demonstration of mindfulness practice as a measurable variable.


On the Estimation of Classification Consistency Indexes for Complex Assessments
Matthew Stearns, Richard Smith, Data Recognition Corporation, USA

This article presents and reviews several methods for estimating the consistency of classifications based on test scores from complex assessments. We first review three strong true score models most widely used to estimate decision consistency indexes. Namely, the two parameter beta binomial, the four parameter beta binomial, and the four parameter beta compound binomial. To use these models with complex assessments, an ad-hoc method must be applied in order to satisfy the dichotomy condition of strong true score theory. We review the most commonly used, and criticized, method for doing this. We also introduce an alternative, yet computationally elementary, method which we feel is just as effective. Two recent procedures that don't make assumptions about the true score distribution are then implemented. Software written by the first author is used to calculate the classification consistency indexes for all of the mentioned methods above. Data from an actual 2004 administration is used along with data simulated to fit the actual model. The indexes are then compared and analyzed.


Does the Reader Comprehend the Text Because the Reader Is Able or Because the Text Is Easy?
A. Jackson Stenner, MetaMetrics, Inc., USA; Mark H. Stone, Adler School of Professional Psychology, USA

Does the reader comprehend the text because the reader is able or because the text is easy? Localizing the cause of comprehension in either the reader or the text is fraught with contradictions. A proposed solution model comprehension as the difference between reader ability and text readability. Computing such a difference requires that reader ability and text readability be measured on a common scale. Thus, the puzzle is solved by positing a single continuum along which texts and readers can be conjointly ordered. A reader's comprehension of a text is a function of the difference between reader ability and text readability. This solution forces recognition that generalizations about reader performance can be text independent (reader ability) or text dependent (comprehension). The article explores how reader ability and text readability can be measured on a single continuum, and the implications that this formulation holds for reading theory, the teaching of reading, and the testing of reading.


Using Guttman's Facet Theory to Develop an Instrument that Examines the Grading Practices of Teachers
Jennifer Randall Thomas, George Engelhard, Jr., Emory University, USA

Wilson's framework for instrument building, referred to as construct modeling, includes four major components: the construct map, items design, outcome space, and the measurement model. The primary purpose of this study is to illustrate how Guttman's Facet Design (Guttman, 1982), specifically Guttman's mapping sentences, can be used in the items design stage to develop an instrument. A secondary purpose of this study is to clarify teacher grades as a measurement construct, and the various facets teachers consider when assigning final grades. Specifically, this paper explores four facets commonly considered when teachers assign final grades: ability, classroom achievement, behavior, and effort. The instrument in this study was designed specifically to answer the following research questions: 1. Are teacher assigned grades influenced by the classroom achievement of students? 2. Are teacher assigned grades influenced by the ability of students? 3. Are teacher assigned grades influenced by the behavior of students? 4. Are teacher assigned grades influenced by the effort of students? A questionnaire was developed using Guttman's mapping sentences. Three focus groups (elementary, middle, and high school teachers) were formed to discuss the clarity, comprehensiveness, and ease of use of the questionnaire. Preliminary data, for the revised questionnaire is available for 46 teachers (elementary, middle school, and high school) from a metropolitan city in the southeastern United States. The questionnaire data was analyzed using Facets, a Rasch measurement computer program.

Click here for slides.


Quantum Tutors: Matching instructional goals with assessment in e-learning
Mike Timms, WestEd, USA

This second case study and paper for the proposed e-learning assessment symposium involves the product Quantum Tutors, developed by Quantum Simulations, Inc (http://www.quantumsimulations.com/). This company develops artificial intelligence (AI) tutoring, assessment and professional development software. The Quantum Tutors are e-learning products designed for students from middle school through college to improve their knowledge and appreciation for the sciences, and are funded by NSF, NIH and the U.S. Department of Education. Quantum Tutors will be considered as an example of good approaches to matching instruction goals with assessment in e-learning. Using rule-based measurement methods and a cognitive apprenticeship learning theory model, tutors are designed to allow the student to direct his/her own learning and inquiry within challenge levels ascertained by embedded measurement approaches. The underlying rule-based schema employs "model tracing" as developed by Carnegie Mellon University's PACT group (ref.), where production rules represent the domain knowledge in the system and thus represents a strong match between instruction and assessment.


An Analysis of the Latent Structure of the DSM IV Criteria for Major Depressive Disorder
George Jay Unick, UC Berkeley, USA

While the Diagnostic and Statistical Manual IV (DSM IV) category of Major Depressive Disorder (MDD) represents one of the most common and costly diagnostic categories, controversy still surrounds the measurement of this latent construct. This study uses DimCat, an Item Response Theory based method to investigate whether depressed individuals are categorically distinct from individuals with sub-threshold depression or whether these groups differ only in their location on a continuous distribution. Using a sample of individuals from the National Comorbidity Survey who endorsed at least one depression symptom (N= 4207), a set of 9 dichotomous symptoms were used to mimic the DSM IV criteria for MDD. Three nested models were tested for differences in their log likelihood. The first model allowed for difference in locations on the continuous depression distribution for those diagnosed with MDD and those with sub-threshold symptoms, the second model also allowed for differences in symptom prevalence and a third model allowed for the previous two differences and differences in symptom discrimination. The third model allowing for all three differences was the best fitting. These findings differ from previous research using taxometric methods, in suggesting that individuals diagnosed with MDD are categorically distinct from individuals with sub-threshold depression.

Click here for slides.


The Effects of Linking Designs in Vertical Scaling on the Growth Patterns of Student Achievement
Shudong Wang, Hong Jiao, Michael J. Young, Harcourt Assessment, Inc., USA; Ying Jin, American Institutes for Research, USA

When constructing a vertical scale, linking needs to be set up between adjacent grades. The purpose of this study is to investigate the effects of different vertical linking designs on the recovery of growth patterns of student achievement. Three vertical linking designs are compared: 1) using above-grade items, 2) using below-grade items, and 3) using both below-grade and above-grade items. Student growth is simulated using a unidimensional Rasch model with a common-person design. The accuracy of recovering the true growth is evaluated in terms of bias, standard error (SE), and root mean squared error (RMSE) and the correlations between the estimates and their true parameters.

Click here for slides.


The robustness of unidimensional Rasch measurement model for multidimensional data to construct vertical scale
Shudong Wang, Hong Jiao, Michael J. Young, Harcourt Assessment, Inc., USA

In practice, when a vertical scale is constructed, unidimensional IRT models are always used under the assumption that similar content and the same construct across-grade are assessed. However, unidimensionality is often difficult to maintain when multiple grades are scaled. Even for two adjacent grades, it cannot be guaranteed that the same science construct is assessed. Using unidimensional model to construct a vertical scale when the data is multidimensional may impose problems. In spite of the well-known difficulty to create multidimensional vertical scale due to limitation of technical issues and difficulty of interpretation of the results, unidimensional IRT models are still used in practice to construct a vertical scale.
The purpose of this study is to investigate the robustness of unidimensional Rasch measurement model to construct a vertical scale to capture the simulated growth patterns of student science achievement across grades 4 to 11 using both unidimensional and multidimensional logistic item response theory (MIRT) model when common-person design is used to link the adjacent grades. The accuracy of the growth pattern recovery will be evaluated in terms of three error indices: bias, standard error (SE), and root mean squared error (RMSE) and the correlations between the estimates and their true ability parameters.

Click here for slides.


Optimizing the compatibility between rating scales and product measures of second language competence
Christopher Weaver, Tokyo University of Agriculture and Technology, Japan

This paper presents a systematic investigation concerning the performance of different rating scales used in the English section of a university entrance examination to assess 1,287 Japanese test-takers' ability to write a third-person introduction speech. The thrust of this analysis lies in the always interesting intersection between the expectations of Rasch measurement theory for rating scales and the actual performance of rating scales in a specific testing situation. Although the rating scales used in the entrance examination did not conform to all of the expectations of the Rasch model, they successfully defined a meaningful continuum of English communicative competence. In some cases, the expectations of the Rasch model needed to be prioritized in order to meet the specific assessment needs of the university entrance examination. This investigation also found that the degree of compatibility between the number of points allotted to different rating scales an d the various demands required in an introduction speech played a considerable role in determining the extent to which the different rating scales conformed to the expectations of the Rasch model. This finding in turn provides useful suggestions on how to optimize rating scales for future entrance examinations.

Click here for slides.


To guess or not to guess? It is a student's choice
Yiyu Xie, UC Berkeley, USA

Most international assessments make extensive usage of multiple-choice item format. Students tend to guess on the items that they do not know how to solve or when they run out of time. Existing research show that the tendency to guess is an interaction between students and items, and does not appear to be uniform across persons. Over the decades, researchers have worked on various models to deal with the occurrence of guessing. However, no analysis has been carried out to examine the degree of the bias in item and person parameter estimation caused by this guessing behavior in international assessment. This study adopts a mixture item response model using the Bayesian estimation technique, in particular, Markov chain Monte Carlo method. Both simulated and empirical data are used to investigate the impact random responses from guessing poses on parameter estimation. The results show that the inference drawn from conventionally used item response analyses can be biased when guessing is not modeled or modeled incorrectly. The mixture model is capable of detecting random guessers based on their random response patterns. The proportion of random guessers varies by country. The bias in estimation is more significant when the proportion of random guessers increases.

Click here for slides.
Click here for scatterplots.


Imperial vs. Metric Study (IMS)
Yiyu Xie, Mark Wilson, UC Berkeley, USA

The purpose of the IMS is to investigate possible effects on American students' performance that the use of metric measurement units (e.g., meters, liters, etc.) in items might have compared to the more familiar imperial units (e.g., feet, gallons, etc.). No research has been undertaken in the past 20 years to study whether this unfamiliarity with the metric system might put American students at a disadvantage on international assessment. The IMS was implemented within the PISA (Programme for International Student Assessment) 2003 study. Matching imperial versions of mathematics items containing metric units were made, and data on both imperial and metric samples from the U.S. participants were collected. Item response analyses using all math items (metric and imperial) simultaneously were conducted to see if there is any significant difference between the item sets and two samples, respectively. The study shows no evidence that choice of measurement unit has a deleterious effect on American students' performance on mathematics for the 15-year-olds in the PISA sample. With just a few exceptions, items in two forms yielded similar difficulty estimates. Some discrepancies are observed possibly due to: (a) differences in the nature of the two systems, and (b) difficulties in the modification process.

Comparison of dimension-aligning techniques in a multidimensional IRT context
Hiroyuki Yamada, Karen Draney, Tzur Karelitz, Stephen Moore, Mark Wilson, UC Berkeley, USA

The purpose of this study is to investigate dimension-equating techniques that allow us to make scale scores comparable across dimensions. We used a real data set to compare these techniques in a 3-dimensional IRT context. There were two main approaches: a person or theta approach and an item or delta approach. Results indicated that differences in the number of items per dimension had some impact upon the theta approach. It is suggested that the delta approach is preferred to the theta approach when correlations between dimensions are relatively high.

Click here for slides.


A Comparison Study of IRT Item Parameter Scaling Methods in Common-Item Nonequivalent Groups Equating
Xiaohui Zheng, UC Berkeley

Test equating is defined as a process that produces comparable scores of different test forms. Test equating is important to insure consistency and fairness of assessments under different administration conditions. In IRT item parameter scaling, there are four commonly used methods of estimating the linking constants. They are Mean/Mean and Mean/Sigma transformation methods which are referred to as the moment methods; Stocking-Lord and Haebara transformation methods which are referred to as the characteristic curve methods. The four methods are based on separate estimations of item parameters. In Addition, there is concurrent estimation in which item parameters are estimated at the same time and are assumed to be on the same scale. The study intends to compare the relative effectiveness of the five IRT scaling methods in test equating and scaling using simulated data.


Pre-organized Symposia

Current Critical Issues Related to Science Assessments
Hong Jiao (Organizer), Shudong Wang (Organizer), Harcourt Assessment, Inc., USA ; Richard Patz (Discussant), R.J. Patz, Inc., USA

The symposium discusses about several critical issues related to science assessment. It consists of five papers. The first paper illustrates how to write construct valid items to improve alignment between targeted cognitive demands and knowledge and skills elicited from examinees. The second paper investigates the robustness of unidimensional Rasch measurement to construct vertical scale when response data is multidimensional. The third study examines linking NAEP science assessments through bridge samples. It offers insights for states to link current science assessment to the one meeting NCLB requirements. The fourth paper examines local item dependence in a scenario-based science assessment. It presents a multilevel perspective to examine local item dependence in scenario-based science assessment. The last study introduces the use and application of a learning progression model to science assessment and the use of this model for supporting diagnostic reporting. It provides insights on analyzing science assessment to provide individual diagnostic information. This symposium will provide test developers, state assessment directors, and researchers with some insights and more detailed evidence and guidance about dealing with some of the important technical issues related to designing state NCLB science assessments.


Shudong Wang, Hong Jiao, Michael J. Young, Lihua Yao - The effect of construct shift in science achievement across grades on a science vertical scale
Jiahe Qian - Linking 2005 NAEP science assessments through bridge samples
Hong Jiao, Shudong Wang, Zarko Vukmirovic - Investigation of local item dependence in scenario-based science assessment
Nathaniel J. S. Brown, Cathleen Kennedy, Karen Draney, Mark Wilson - Assessing a learning progression in science: Solving psychometric issues




Assessment for e-Learning: Case studies of an emerging field
Kathleen Scalise (Organizer), University of Oregon, USA; Cathleen A. Kennedy (Discussant), UC Berkeley, USA

This symposium will discuss the rapidly emerging field of computer-based assessment in e-learning (National Research Council, 2002). In e-learning products, a variety of assessment approaches are being used for such diverse purposes as adaptive delivery of content, individualizing learning materials, dynamic feedback, cognitive diagnosis, score reporting and course placement (Gifford, 2001). A recent paper at the latest General Teaching Council Conference in London, England, on Teaching, Learning and Accountability described assessment for personalized learning through e-learning products as a "quiet revolution" taking place in education (Hopkins, 2004). This symposium discusses evidence-based assessment principles that need to become a standard e-learning. Four case studies will be presented of e-learning products with assessment components. The products in the case studies were selected for exhibiting at least one exemplary aspect regarding assessment and measurement. The principles of the BEAR Assessment System (Wilson & Sloane, 2000) will be used as a framework of analysis for these products with respect to key measurement principles, such as evidence identification and accumulation, and use of Rasch family models will be featured in one of the case studies.

Click here for slides.
Click here for slides.

Diana J. Bernbaum - NetPASS: Construct and content validity in e-learning products
Mike Timms - Quantum Tutors: Matching instructional goals with assessment in e-learning
S. Veeragoudar Harrell - Cognitive Diagnostics: Using Rasch family models to map student understanding
Kristen Burmester - ALEKS: Making e-learning assessment reports useful in classroom instruction

Workshops

Interactive analysis of data using RUMM2020
David Andrich, Murdoch University, Australia
This workshop demonstrates the use features of the RUMM2020 computer software program which can be used for analysing data according to the Rasch Unidimensional Measurement Models. The features include fully interactive studies of checking various fit of items including DIF and principal component analysis of residuals. All statistical output has a graphical counterpart and vice versa which can be used in reports. Then on the basis of any analysis the program enables the user to delete items, delete subsamples or individuals, select random samples, form subtests (testlets or bundles), rescore items, split items by person groups, anchor items, and target the analysis, interactively and rerun the analysis. The perspective that analyses should give the user the opportunity to consider multiple evidence in making any decision, rather than using any statistical criterion mechanically, is emphasised as a basis for the writing of RUMM2020. For handling costs of $US20.00 participants can receive a full version of RUMM2020 which expires in June 2006. For more information about RUMM2020 software, please visit: http://www.rummlab.com/.

ConQuest
Karen Draney, Hiroyuki Yamada, UC Berkeley, USA; Ray Adams, University of Melbourne, Australia
This workshop will cover the basics of using ConQuest for a wide variety of models, including basic Rasch, Rating Scale, and Partial Credit models, more complex LLTM and FACETS style models, multidimensional models, and latent regression.  Menu-driven and command line style versions of the program will be shown, and the interpretation of output files will be discussed.  In addition, Ray Adams will be present, and will demonstrate some of the newest features of his versatile software. For more information about ConQuest software, please visit http://www.assess.com/Software/ConQuest.htm

GradeMap
Cathleen A. Kennedy, Sevan Tutunciyan, Richard Vorp, UC Berkeley, USA
GradeMap is a graphical, menu-driven software package combining a multidimensional IRT engine for estimating item and person parameters from the multidimensional random coefficients multinomial logit (MRCML) model, with tools for managing cross-sectional and longitudinal response data. The MRCML model is used to fit a variety of one-parameter logistic models to dichotomous or polytomous data. Graphical maps and reports are designed for use in settings in which respondent progress on multiple measures can be examined and analyzed. Users can select expected a posteriori (EAP), maximum likelihood, or plausible value estimates of multivariate proficiencies. In this workshop we will introduce GradeMap model fitting and reporting features through examples. We will show how multidimensional instruments can be linked through common item equating and how student progress can subsequently be mapped and interpreted. Instrument calibration and reporting features designed to assist teachers in interpreting and using diagnostic and formative assessment data will be highlighted. For more information about GradeMap, please visit /GradeMap/.

Modeling Many-Facet Data Using Facets: An Introduction
John M. Linacre, University of Sydney, Australia; William Bonk, University of Colorado at Boulder
The use of the Facets software package as part of a research project is explained. Topics addressed include identifying the types of research questions appropriate for Facets analysis, typically, but not exclusively, those involving judge- or rater-intermediation; planning your experimental design and data collection; setting up your data; constructing your Facets command files; and interpreting Facets output. The workshop demonstrations include datasets from education, psychology, and health care. For more information about Facets software, please visit: http://www.winsteps.com/facets.htm.

Winsteps Applied to Messy Data
John M. Linacre, University of Sydney, Australia; Mark H. Moulton, Educational Data Systems, USA
The capabilities of Winsteps are demonstrated by means of the analysis of data from a recently-conducted large scale assessment with multiple forms and item types. The analysis includes: keying raw response data to create a score file, calculating item weights and assigning them to items, processing rating scales, investigating dimensionality with principal components analysis of residuals, analyzing item, person and category misfit, equating across forms and administrations using anchor files, creating meaningful subscale measures, interpreting reports, and producing Excel-friendly output files. During this demonstration, issues relating to multi-dimensionality will receive special attention. For more information about Winsteps software, please visit: http://www.winsteps.com/winsteps.htm.


 

Please contact the conference organizers, Nathaniel Brown and Brent Duckor, with any questions.

 

Berkeley Evaluation & Assessment Research Center