The impact of rater training on rater reliability in an English oral test


  • Shiknesvary Karuppaiah SMK Canossian Convent, Kluang, Johor, Malaysia
  • Abdul Halim Abdul Raof Language Academy, Faculty of Social Sciences and Humanities, Universiti Teknologi Malaysia, Skudai, Johor Bahru, Johor, Malaysia



Rater Training, Rater Reliability, Oral Interaction Test, Speaking Skill, Rater


Speaking skill assessment is gaining great interest in the field of assessment nowadays. Literature has highlighted reliability of raters in rating a speaking performance as one of the challenges due to human’s subjective nature. This study has attempted to explore the influence of rater training on rater reliability in the assessment of a spoken task. A qualitative research design was used and, semi-structured interview was employed to obtain data for this study. A total of 21 secondary school teachers participated in the study. They were raters trained to assess an oral English interaction test. Data were analyzed using thematic content analysis which resulted in three main categories i.e. importance of rater training, effects of rater training on rater reliability, and improvement of rater training. The results show that rater training is essential before any rating is to be done, and its effects include, among others, maintaining rating consistency, exposure to test task, and criteria for grading. While suggestions to improve rater training sessions are related to the length, frequency, and quality of training.  


Download data is not yet available.


Bachman, L.F., Lynch, B.K. and Mason,M. (1995). Investigating variability in tasks and rater judgements in a performance test of foreign language speaking. Language Testing. 12, 238-257.

Baldwin, S.G., Harik, P., Keller, L.A., Clauser, B.E., Baldwin, P., Rebbecchi, T.A.(2009). Assessing the impact of modifications to the documentation component's scoring rubric and rater training on USMLE integrated clinical encounter scores. Acad Med, 84, 97-100.

Bijani, H. (2018). Investigating the validity of oral assessment training program: A mixed-methods study of ratrer’ perceptions and attitudes before and after training. Cogent Education, 5(1), 1-20.

Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency. Language Testing, 20, 1- 25.

Brown, G., Bull, J., and Pendlebury, M. (1997). Assessing student learning in higher education. London: Routledge.

Clark, J.L.D. and Lett, J. (1988). A research agenda. In Pardee,L,J.,and Charles,W.S,(Ed) Second Language Proficiency Assessment:Current Issues.(pp. 54 -82). Englewood Cliffs,NJ:Prentice Hall.

Culham, R.and Spandel, V. (1993). Problems and Pitfalls Encountered by Raters. Developed at the Northwest Regional Educational Laboratory for the Oregan Department of Education.

Cook, S.S. (1989). Improving the quality of student ratings of instructions: A look at two strategies. Research in Higher Education, 30(1), 31-45.

Davidson, M., Howell, K. W. and Hoekema, P. (2000). Effects of ethnicity and violent content on rubric scores in writing samples. Journal of Educational Research. 93, 367-373.

Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing. 33(1), 117-135.

Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T. and McNamara, T. (1999). Dictionary of Language Testing. Cambridge: Cambridge University Press.

Elder, C, Knoch, U., Barhuizen, G. and Von Randow, J. (2005). Individual feedback To enhance rater training: Does it work? Language Assessment Quarterly. 2(3), 175-196.

Fahim, M. and Bijani, H. (2011). The Effects of Rater Training on Raters’ Severity and Bias in Second Language Writing Assessment. Iranian Journal of Language Testing. 1, 1 – 16.

Farrokhi, F., Esfandiari, R. and Schaefer, E. (2012). ‘A Many-Facet Rasch Measurement ofDifferential Rater Severity / Leniency and Teacher Assessment’, Journal of Basic and Applied Scientific Research. 2(9), 8786–8798.

Fulcher, G. (2003). Testing second language speaking. London: Pearson Longman.

Goulden,N.R., (1994). Relationship of analytic and holistic methods to rater’s scores for speeches. The Journal of Research and Developmentin Education, 27, 73 – 82.

Guest,G., Bunce, A. and Johnson, L. (2006). How many interviews are enough? An experiment with data saturation and variability. Field Methods. 18, 59-82.

Hoyt, W. T. (2000). Rater bias in psychological research:When is it a problem and what can we do about it? Psychological Methods. 4, 64-86.

Jenkins, S. and Parra, I. (2003). Multiple layers of meaning in an oral proficiency test: Thecomplementary roles of non-verbal, paralinguistic and verbal behaviors in assessment decisions. The Modern Language Journal. 67, 90-107.

Joe,J.N. (2008). Using Verbal Reports to Explore Rater Perceptual Processes in Scoring: An Application to Oral Communication Assessment. PhD Thesis, James Madison University,US.

Jonsson,A. and Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review. 2, 130-144.

Kang, O. (2012). ‘Impact of Rater Characteristics and Prosodic Features of Speaker Accentedness on Ratings of International Teaching Assistants’ Oral Performance’. Language Assessment Quarterly. 9(3), 249–269.

Kang, O., Rubin, D., & Kermad, A. (2019). The effect of training and rater differences on oral proficiency assessment. Language Testing, 36(4), 481–504.

KeshavarzMehr, N. (2011). The critical role of subjectivity at the item level in a test of spoken English: variability in rater estimations. PhD Thesis, Melbourne Graduate School of Education, The University of Melbourne.

Kondo-Brown, K. (2002). A Facets analysis of rater bias in measuring Japanese second language writing performance. Language Testing. 19, 3- 31.

Lumley, T. (1998). Perceptions of language – trained raters and occupational experts in a test of occupational English language proficiency. English for Specific Purpose. 17, 347 – 367.

Lumley, T. and McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing. 12(1), 54-71.

Luoma, S. (2004) Assessing Speaking. Cambridge: Cambridge University Press.

Lunz, M. E., Wright, B. D. and Linacre, J. M. (1990). Measuring the impact of judge severity on examination scores. Applied measurement in education.3, 331-345.

McClellan, C.A. (2010). Constructed-response scoring: Doing it right. R and D Connections,13, 1-7.

McNamara, T.F. (1993). The importance and effectiveness of moderation training on the reliability of teacher assessment of ESL writing samples. Unpublished master’s thesis, Faculty of Education, the University of Melbourne, Melbourne, Australia.

Orr, M. (2002). The FCE speaking test: Using rater reports to help interpret test scores.System. 3(1), 143-154.

Reddy,Y.M., and Andrade, H. (2010). A review of rubric use in higher education. Assessment and Evaluation in Higher Education, 35(4), 435-448.

Schaefer,E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25, 465 – 493.

Schneider, P. (2001). Microstructure analyses: Referential cohesion. Presented as part of seminar: Development and implementation of a narrative norms project”, American Speech-Language-Hearing Association Convention, New Orleans, LA, November.

Shohamy, E. (1983). Rater reliability of the oral interview speaking test. Foreign Language Annals, 16(3), 219-222.

Stahl, J. A. and Lunz, M. E. (1991). Judge performance reports: Media and message.Paper presented at the annual meeting of the American Educational Research Asscoiation,SanFrancisco , CA.

Swartz, C. W., Hooper, S. R., Montgomery, J. W., Wakely, M. B., De Kruif, R. E. L., Reed, M., et al. (1999). Using generalizability theory to estimate the reliability of writing scores derived from holistic and analytical scoring methods. Educational and Psychological Measurement, 59, 492–506.

Wang, H. (2010). Investigating the justifiability of an additional test use: An application of assessment use argument to an English as a foreign language test. Doctoral dissertation, University of California, Los Angeles.

Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11,197-223.

Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing,15(2), 263-287.

Weigle, S. C. (1999).Investigating rater prompt interactions in writing assessment:Quantitative and qualitative approaches. Assessing Writing,6, 145 -178.

Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing, 10, 305-335.

Xi,X. and Mollaun,P. (2009). How do raters from India perform in scoring the TOEFL iBT speaking section and what kind of training helps. Princeton,NJ: Educational Testing Service.




How to Cite

Karuppaiah, S., & Abdul Raof, A. H. (2020). The impact of rater training on rater reliability in an English oral test. Asian Journal of Assessment in Teaching and Learning, 10(2), 94–105.