Raters’ Assessment Quality in Measuring Teachers’ Competency in Classroom Assessment: Application of Many Facet Rasch Model
Keywords:Many Facet Rasch Model, Competency, Classroom Assessment, Rater severity, Multi-rater Analysis
This study examines the raters’ assessment quality when measuring teachers’ competency in Classroom Assessment (CA) using the Many Facet Rasch Model (MFRM) analysis. The instrument used consists of 56 items built based on 3 main constructs: knowledge in CA, skills in CA, and attitude towards CA. The research design of this study is a quantitative method with a multi-rater approach using a questionnaire distributed to the raters. Respondents are 68 raters consisting of The Head of Mathematics and Science Department, The Head of Mathematics Panel, and the Mathematics Teacher to assess 27 ratees. The ratees involved in this study are 27 secondary school Mathematics teachers from Selangor. The results show that among the advantages of MFRM are that it can determine the severity and consistency level of the raters, also detect bias interaction between rater and ratee. Although all raters were given the same instrument, the same aspects of evaluation, and scale category, MFRM can compare the severity level for each rater individually. Furthermore, MFRM can detect measurement biases and make it easier for researchers to communicate about the research findings. MFRM has the advantage of providing complete information and contributes the understanding of the consistency analysis of the rater’s judgement with quantitative evidence support. This indicates that MFRM is an alternative model suitable to overcome the limitations in Classical Test Theory (CTT) statistical models in terms of multi-rater analysis.
Allen, M. (2017). The SAGE encyclopedia of communication research methods (Volume 1). Retrieved from United States of America
Abdullah Al-Awaid, S. A. (2022). Online education and assessment: Profiling EFL teachers’ competency in Saudi Arabia. World Journal of English Language, 12(2), 82. https://doi.org/10.5430/wjel.v12n2p82
Bahagian Pembangunan Kurikulum. (2019). Panduan pelaksanaan pentaksiran bilik darjah edisi Ke-2. Putrajaya: Kementerian Pendidikan Malaysia.
Barkaoui, K. (2013). Multifaceted Rasch analysis for test evaluation. The Companion to Language Assessment, 1–46. https://doi.org/10.1002/9781118411360.wbcla070
Bartok, L., & Burzler, M. A. (2020). How to assess rater rankings? A theoretical and a simulation approach using the sum of the Pairwise Absolute Row Differences (PARDs). Journal of Statistical Theory and Practice, 14(37). https://doi.org/10.1007/s42519-020-00103-w
Bond, T. G., & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences (Third Edit). New York: Routledge Taylor & Francis Group.
Boone, W. J. (2020). Rasch basics for the novice. In Rasch measurement: Applications in quantitative educational research (pp. 9–30). Singapore: Springer Nature Singapore Pte Ltd.
Brennan, R. L. (2010). Generalizability theory and classical test theory. Applied Measurement in Education, 24(1), 1–21. https://doi.org/10.1080/08957347.2011.532417
Cai, H. (2015). Weight-based classification of raters and rater cognition in an EFL speaking test. Language Assessment Quarterly, 12(3), 262–282. https://doi.org/10.1080/15434303.2015.1053134
Calhoun, A. W., Boone, M., Miller, K. H., Taulbee, R. L., Montgomery, V. L., & Boland, K. (2011). A multirater instrument for the assessment of simulated pediatric crises. Journal of Graduate Medical Education, 3(1), 88–94.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104
Cronbach, L. J. (1990). Essentials of Pychological Testing (5th Editio). New York: Harper & Row.
Donnon, T., McIlwrick, J., & Woloschuk, W. (2013). Investigating the reliability and validity of self and peer assessment to measure medical students’ professional competencies. Creative Education, 4(6), 23–28. https://doi.org/10.4236/ce.2013.46a005
Eckes, T. (2015). Introduction to Many-Facet Rasch measurement: Analyzing and evaluating rater-mediated assessment. Frankfurt: Peter Lang Edition.
Eckes, T. (2019). Many-facet Rasch measurement: Implications for rater-mediated language assessment. In V. Aryadoust & M. Raquel (Eds.), Quantitative Data Analysis for Language Assessment Volume I: Fundamental Techniques (pp. 153–176). https://doi.org/10.4324/9781315187815-2
Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a many‐faceted Rasch model. Journal of Educational Measurement, 31(2), 93–112. https://doi.org/10.1111/j.1745-3984.1994.tb00436.x
Engelhard, G., & Wind, S. (2018). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. New York: Routledge Taylor & Francis Group.
Fahmina, S. S., Masykuri, M., Ramadhani, D. G., & Yamtinah, S. (2019). Content validity uses Rasch model on computerized testlet instrument to measure chemical literacy capabilities. AIP Conference Proceedings, 2194(020023). https://doi.org/10.1063/1.5139755
Fan, J., Knoch, U., & Bond, T. G. (2019). Application of Rasch measurement theory in language assessment: Using measurement to enhance language assessment research and practice. Papers in Language Testing and …, 8(2).
Farrokhi, F., Esfandiari, R., & Dalili, M. V. (2011). Applying the many-facet Rasch model to detect centrality in self-assessment , peer-assessment and teacher assessment. World Applied Sciences Journal, 15, 70–77.
Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33, 613–619. https://doi.org/10.1177/001316447303300309
Goffin, R. D., & Jackson, D. N. (1992). Analysis of multitrait-multirater performance appraisal data : Composite direct product method versus confirmatory factor analysis. Multivariate Behavioral Research, 27(3), 363–385.
Goodwin, L. D., & Leech, N. L. (2003). The meaning of validity in the new standards for educational and psychological testing. Measurement and Evaluation in Counseling and Development, 36(3), 181–191. https://doi.org/10.1080/07481756.2003.11909741
Han, C. (2021). Detecting and measuring rater effects in interpreting assessment: A methodological comparison of classical test theory, generalizability theory, and many-facet Rasch measurement. New Frontiers in Translation Studies, (April), 85–113. https://doi.org/10.1007/978-981-15-8554-8_5
Hargreaves, A., Earl, L., & Schmidt, M. (2002). Perspectives on alternative assessment reform. American Educational Research Journal, 39(1), 69–95.
Hodges, T. S., Scott, C. E., Washburn, E. K., Matthews, S. D., & Gould, C. (2019). Developing pre-service teachers' critical thinking and assessment skills with reflective writing. In Handbook of Research on Critical Thinking Strategies in Pre-Service Learning Environments (pp. 146-173). IGI Global. https://doi.org/10.4018/978-1-5225-7823-9.ch008
Hsu, L. M., & Field, R. (2003). Interrater agreement measures: Comments on Kappa n , Cohen’s Kappa, Scott’s π, and Aickin’s α. Understanding Statistics, 2(3), 205–219. https://doi.org/10.1207/s15328031us0203_03
Kane, T. J., & Staiger, D. O. (2012). Gathering feedback for teaching: Combining high-quality observations with student surveys and achievement gains. Washington: Bill and Melinda Gates Foundation.
Kudiya, K., Sumintono, B., Sabana, S., & Sachari, A. (2018). Batik artisans’ judgement of batik wax quality and its criteria: An application of the many-facets Rasch model. In Q. Zhang (Ed.), Pacific Rim Objective Measurement Symposium (PROMS) 2016 Conference Proceedings (pp. 27–38). https://doi.org/10.1007/978-981-10-8138-5
Linacre, J. M. (1994). Many-facet Rasch Measurement. Chicago: MESA PRESS.
Linacre, J. M. (2006). A user’s guide to Winsteps/ Ministep Rasch-model computer programs. Chicago: www.winsteps.com.
Lindell, M. K., & Brandt, C. J. (1999). Assessing interrater agreement on the job relevance of a test: A comparison of the cvi, t, rwg(j), and r*wg(j) indexes. Journal of Applied Psychology, 84(4), 640–647. https://doi.org/10.1037/0021-9010.84.4.640
Lohman, M. C. (2004). The development of a multirater instrument for assessing employee problem-solving skill. Human Resource Development Quarterly, 15(3).
Lumley, T., & Mcnamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1), 54–71. https://doi.org/10.1177/026553229501200104
Matsuno, S. (2009). Self-, peer-, and teacher-assessments in Japanese university EFL writing classrooms. Language Testing, 26(1), 075–100. https://doi.org/10.1177/0265532208097337
McNamara, T., & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement in language testing. Language Testing, 29(4), 555–576. https://doi.org/10.1177/0265532211430367
Mohd Yusri Ibrahim, Mohd Faiz Mohd Yaakob, & Mat Rahimi Yusof. (2019). Communication skills: Top priority of teaching competency. International Journal of Learning, Teaching and Educational Research, 18(8), 17–30. https://doi.org/10.26803/ijlter.18.8.2
Newton, P. E. (2009). The reliability of results from national curriculum testing in England. Educational Research, 51(2), 181–212. https://doi.org/10.1080/00131880902891404
Noor Lide Abu Kassim. (2011). Judging behaviour and rater errors: An application of the many-facet Rasch model. GEMA Online Journal of Language Studies, 11(3), 179–197.
Nor Mashitah, Mariani, Jain Chee, Mohamad Ilmee, Hafiza, & Rosmah. (2015). Penggunaan model pengukuran Rasch many-facet (MFRM) dalam penilaian perkembanagn kanak-kanak berasaskan prestasi. Jurnal Pendidikan Awal Kanak-Kanak, 4, 1–21.
Nor Mashitah Mohd Radzi. (2017). Pembinaan dan pengesahan instrumen pentaksiran prestasi standard awal pembelajaran dan perkembangan awal kanak-kanak. Universiti Malaya.
Nur ’Ashiqin Najmuddin. (2011). Instrumen kemahiran generik pelajar pra-universiti berdasarkan penilaian oleh pensyarah. Universiti Kebangsaan Malaysia.
Nurul Farahin Ab Aziz, & Siti Mistima Maat. (2021). Kesediaan dan efikasi guru matematik sekolah rendah dalam pengintegrasian teknologi semasa pandemik COVID-19. Malaysian Journal of Social Sciences and Humanities (MJSSH), 6(8), 93–108. https://doi.org/10.47405/mjssh.v6i8.949
OECD. (2013). Preparing teachers for the 21st century: Using evaluation to improve teaching. In OECD Publishing. OECD Publishing.
Polit, D. F., & Beck, C. T. (2006). The content validity index: Are you sure you know what’s being reported? Critique and recommendations. Reseacrh in Nyrsing & Health, 29, 489–497. https://doi.org/10.1038/s41590-018-0072-8
Rural, J. D. (2021). Competency in assessment of selected DepEd teachers in National Capital Region. European Online Journal of Natural and Social Sciences, 10(4), 639–646. http://www.european-science.com
Sahin, M. G., Teker, G. T., & Güler, N. (2016). An analysis of peer assessment through many facet Rasch model. Journal of Education and Practice, 7(32), 172–181.
Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25(4), 465–493. https://doi.org/10.1177/0265532208094273
Schmidt, F. L., Oh, I.-S., & Shaffer, J. A. (2016). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. In Validity and Uitility of Selection Methods. https://doi.org/10.1037/0033-2909.124.2.262
Scullen, S. E., Mount, M. K., & Goff, M. (2000). Understanding the latent structure of job performance ratings. Journal of Applied Psychology, 85(6), 956–970. https://doi.org/10.1037/0021-9010.85.6.956
Seifert, T., & Feliks, O. (2019). Online self-assessment and peer-assessment as a tool to enhance student-teachers' assessment skills. Assessment & Evaluation in Higher Education, 44(2), 169-185. https://doi.org/10.1080/02602938.2018.1487023
Shin, Y. (2010). A Facets analysis of rater characteristics and rater bias in measuring L2 writing performance. English Language & Literature Teaching, 16(1), 123–142.
Siti Rahayah Ariffin. (2008). Inovasi dalam pengukuran dan penilaian. Bangi: Fakulti Pendidikan, Universiti Kebangsaan Malaysia.
Spencer, L. M., & Spencer, S. M. (1993). Competence at work: Models for superior performance. United States of America: John Wiley & Sons, Inc.
Springer, D. G., & Bradley, K. D. (2018). Investigating adjudicator bias in concert band evaluations: An application of the many-facets Rasch model. Musicae Scientiae, 22(3), 377–393. https://doi.org/10.1177/1029864917697782
Styck, K. M., Anthony, C. J., Sandilos, L. E., & DiPerna, J. C. (2020). Examining rater effects on the classroom assessment scoring system. Child Development, 00(0), 1–18.
Sumintono, B. (2016). Aplikasi pemodelan Rasch pada asesmen pendidikan: Implementasi penilaian formatif (assessment for learning). Jurusan Statistika, Institut Teknologi.
Sunjaya, D. K., Herawati, D., Puteri, D. P., & Sumintono, B. (2020). Development and sensory test of eel cookies for pregnant women with chronic energy deficiency using many facet Rasch model: a preliminary study. Progress in Nutrition, 22(3), 1–11. https://doi.org/10.23751/pn.v22i3.10040
Tomasevic, B. I., Trivic, D. D., Milanovic, V. D., & Ralevic, L. R. (2021). The programme for professional development of chemistry teachers’ assessment competency. Journal of the Serbian Chemical Society, 86(10), 997–1010. https://doi.org/10.2298/JSC210710052T
Wang, P., Coetzee, K., Strachan, A., Monteiro, S., & Cheng, L. (2021). Examining rater performance on the CELBAN speaking : A many-facets Rasch measurement analysis. Canadian Journal of Applied Linguistics, 23(2), 73–95.
Warrens, M. J. (2010). A formal proof of a paradox associated with Cohen’s kappa. Journal of Classification, 27(3), 322–332. https://doi.org/10.1007/s00357-010-9060-x
Webb, N. M., Shavelson, R. J., & Steedle, J. T. (2018). Generalizability theory in assessment contexts. In Handbook on measurement, assessment, and evaluation in higher education (pp. 284–305). https://doi.org/10.4324/9780203142189
Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing, 10(3), 305–319.
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago: MESA PRESS.
Zhu, W., Ennis, C. D., & Chen, A. (1998). Many-faceted Rasch modeling expert judgment in test development. Measurement in Physical Education and Exercise Science, 2(1), 21–39.
Zhu, Y., Fung, A. S. L., & Yang, L. (2021). A methodologically improved study on raters’ personality and rating severity in writing assessment. SAGE Open, 1–16.
Zuliana Mohd Zabidi, Sumintono, B., & Zuraidah Abdullah. (2021). Enhancing analytic rigor in qualitative analysis : Developing and testing code scheme using many facet Rasch model. Quality & Quantity, 55(2). https://doi.org/10.1007/s11135-021-01152-4
How to Cite
Copyright (c) 2022 Rosyafinaz Mohamat, Bambang Sumintono, Harris Shah Abd Hamid
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.