What do the L2 Generalizability Studies Tell Us?


  • James Dean Brown University of Hawai‘i at Manoa, Campus Rd, Honolulu, Hi, UNITED STATES of AMERICA


generalizability theory, norm-referenced relative decisions, measurement facets, variance components


This research synthesis examines the relative magnitudes of the variance components found in 44 generalizability (G) theory studies in L2 testing. I begin by explaining what G theory is and how it works. In the process, I explain the diffrences between relative and absolute decisions, between crossed and nested facets, and between random and fixed facets, as well as what variance components (VCs) are and how VCs are calculated. Next, I provide an overview of G-theory studies in L2 testing and discuss the purposes of this research synthesis. In the methods section, I describe the materials used in this research synthesis in terms of the samples of students, the tests, and the G-study designs used. I also present the analyses in terms of how the datawere compiled and analyzed. The results are sorted and displayed to reveal patterns in the relative contributions to test variance of various individual facets as well as interactions between and among facets for different types of tests. I next discuss these patterns and put them into perspective. I conclude by exploring what I think the results mean for L2 testing in general.



Download data is not yet available.


Abeywickrama, P. S. (2007). Measuring the knowledge of textual cohesion and coherence in learners of English as a second language (ESL). (Unpublished PhD dissertation). University of California at Los Angeles.

Alharby, E. R. (2006). A comparison between two scoring methods, holistic vs analytic, using two measurement models, the generalizability theory and many-facet Rasch measurement, within the context of performance assessment. (Unpublished PhD dissertation). Pennsylvania State University, State College, PA.

Bachman, L. F. (1997). Generalizability theory. In C. Clapham & D. Corson (Eds.), Encyclopedia of languages and education Volume 7: Language testing and assessment (pp. 255 ‒ 262). Dordrecht, Netherlands: Kluwer Academic.

Bachman, L. F. 2004: Statistical analyses for language assessment. Cambridge: Cambridge University Press.

Bachman, L. F., Lynch, B. K., & Mason, M. (1995). Investigating variability in tasks and rater judgments in a performance test of foreign language speaking. Language Testing, 12(2), 239 ‒ 257.

Banno, E. (2008). Investigating an oral placement test for learners of Japanese as a second language. (Unpublished PhD dissertation). Temple University, Philadelphia, PA.

Blok, H. (1999). Reading to young children in educational settings: A meta-analysis of recent research. Language Learning, 49(2), 343 ‒ 371.

Bolus, R. E., Hinofotis, F. B., & Bailey, K. M. (1982). An introduction to generalizability theory in second language research. Language Learning, 32, 245 ‒ 258.

Brennan, R. L. (1983). Elements of generalizability theory. Iowa City, IA: American College Testing Program.

Brennan, R. L. (2001). Generalizability theory. New York: Springer.

Brown, J. D. (1982). Testing EFL reading comprehension in engineering English. (Unpublished PhD dissertation). University of California at Los Angeles.

Brown, J. D. (1984). A norm-referenced engineering reading test. In A.K. Pugh & J.M. Ulijn (Eds.), Reading for professional purposes: studies and practices in native and foreign languages. London: Heinemann Educational Books.

Brown, J. D. (1988). 1987 Manoa Writing Placement Examination: Technical Report #1. Honolulu, HI: Manoa Writing Program, University of Hawai‘i at Manoa.

Brown, J. D. (1989). 1988 Manoa Writing Placement Examination: Technical Report #2. Honolulu, HI: Manoa Writing Program, University of Hawai‘i at Manoa.

Brown, J. D. (1990a). 1989 Manoa Writing Placement Examination: Technical Report #5. Honolulu, HI: Manoa Writing Program, University of Hawai‘i at Manoa.

Brown, J. D. (1990b). Short-cut estimates of criterion-referenced test consistency. Language Testing, 7(1), 77 ‒ 97.

Brown, J. D. (1991). 1990 Manoa Writing Placement Examination: Technical Report #11. Honolulu, HI: Manoa Writing Program, University of Hawai‘i at Manoa.

Brown, J. D. (1993). A comprehensive criterion-referenced language testing project. In D. Douglas & C. Chapelle (Eds.), A New Decade of Language Testing Research (pp. 163 ‒ 184). Washington, DC: TESOL.

Brown, J. D. (1999). Relative importance of persons, items, subtests and languages to TOEFL test variance. Language Testing, 16(2), 216 ‒ 237.

Brown, J. D. (2005a). Testing in language programs: A comprehensive guide to English language assessment (New edition). New York: McGraw-Hill.

Brown, J. D. (2005b). Statistics corner ‒ Questions and answers about language testing statistics: Generalizability and decision studies. Shiken: JALT Testing & Evaluation SIG Newsletter, 9(1), 12 – 16. Retrieved from http://jalt.org/test/bro_21.htm. [accessed Dec. 10, 2006].

Brown, J. D. (2007). Multiple views of L1 writing score reliability. Second Language Studies (Working Papers), 25(2), 1-31.

Brown, J. D. (2008). Raters, functions, item types, and the dependability of L2 pragmatic tests. In E. Alcón Soler & A. Martínez-Flor (Eds.), Investigating pragmatics in foreign language learning, teaching and testing (pp. 224 ‒ 248). Clevedon, UK: Multilingual Matters.

Brown, J. D., & Bailey, K. M. (1984). A categorical instrument for scoring second language writing skills. Language Learning, 34, 21 ‒ 42.

Brown, J. D., & Hudson, T. (2002). Criterion-referenced language testing. Cambridge: Cambridge University.

Brown, J. D., & Ross, J. A. (1996). Decision dependability of item types, sections, tests, and the overall TOEFL test battery. In M. Milanovic & N. Saville (Eds.), Performance testing, cognition and assessment (pp. 231 ‒ 265). Cambridge: Cambridge University.

Chiu, C. W.T. (2001). Scoring performance assessments based on judgments: Generalizability theory. Boston: Kluwer Academic.

Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: Wiley.

Cronbach, L. J., Rajaratnam, N., & Gleser, G. C. (1963). Theory of generalizability: A liberalization of reliability theory. British Journal of Statistical Psychology, 16, 137 ‒ 163.

Gao, L., & Rodgers, T. (2007). Cognitive-psychometric modeling of the MELAB reading items. Paper presented at the National Council of Measurement in Education Conference, Chicago,IL.

Gerbil, A. (2009). Score generalizability of academic writing tasks: Does one test method fit all? Language Testing, 26, 507 ‒ 531.

Gerbil, A. (2010). Bringing reading-to-writing and writing-only assessment tasks together: A generalizability analysis. Assessing Writing, 15, 100 ‒ 117.

Glass, G. V. (1976). Primary, secondary, and meta-analysis. Educational Researcher, 5, 3 ‒ 8.

Goldschneider, J., & DeKeyser, R. M. (2001). Explaining the “natural order of L2 morpheme acquisition” in English: A meta-analysis of multiple determinants. Language Learning, 51, 1–50.

Jeon, E., & Kaya, T. (2006). Effects of L2 instruction on interlanguage pragmatic development: A meta-analysis. In J. Norris & L. Ortega (Eds.), Synthesizing Research on Language Learning and Teaching (pp. 165 ‒ 211). Philadelphia: John Benjamins.

Kim, Y.H. (2009). A G-theory analysis of rater effect in ELS speaking assessment. Applied Linguistics, 30(3), 435 ‒ 440.

Kirk, R. E. (1968). Experimental design: Procedures for the behavioral sciences. Belmont, CA: Brooks/Cole.

Kozaki, Y. (2004). Using GENOVA and FACETS to set multiple standards on performance assessment for certification in medical translation of Japanese into English. Language Testing, 21(1), 1 ‒ 27.

Kunnan, A. J. (1992). An investigation of a criterion-referenced test using G-theory, and factor and cluster analysis. Language Testing, 9(1), 30-49.

Lane, S., & Sabers, D. (1989). Use of generalizability theory for estimating the dependability of a scoring system for sample essays. Applied measurement in education, 2(3), 195 ‒ 205.

Lee, Y.-W. (2005) Dependability of scores for a new ESL speaking test: Evaluating prototype tasks. TOEFL Monograph MS-28. Princeton, NJ: ETS.

Lee, Y.-W. (2006). Dependability of scores for a new ESL speaking assessment consisting of integrated and independent tasks. Language Testing, 23(2), 131 ‒ 166.

Lee, Y.-W, Gentile, C., & Kantor, R. (2008). Analytic scoring of TOEFL CBT essays: Scores from humans and e-rater. TOEFL Research Report RR-81. Princeton, NJ: ETS.

Lee, Y.-W, & Kantor, R. (2005). Dependability of ESL writing test scores: Evaluating prototype tasks and alternative rating schemes. TOEFL Monograph MS-31. Princeton, NJ: ETS.

Lee, Y.-W, & Kantor, R. (2007). Evaluating prototype tasks and alternative rating schemes for a new ESL writing test through G-theory. International Journal of Testing, 7(4), 353 – 385

Lynch, B. K., & McNamara, T. F. (1998). Using G-theory and many-facet Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Language Testing, 15, 158 ‒ 180.

Mackey, A., & Goo, J. (2007). Interaction research in SLA: A meta-analysis and research synthesis. In A. Mackey (Ed.), Conversational interaction in second language acquisition: A series of empirical studies (pp. 407 – 452). Oxford: Oxford University.

Masgoret, A.-M., & Gardner, R. C. (2003). Attitudes, motivation, and second language learning: A meta-analysis of studies conducted by Gardner and associates. Language Learning, 53, 123 – 163.

McNamara, T. F. (1996). Measuring second language performance. New York: Longman.

Molloy, H., & Shimura, M. (2005). An examination of situational sensitivity in medium-scale interlanguage pragmatics research. In T Newfields, Y. Ishida, M. Chapman, & M. Fujioka (Eds.), Proceedings of the May 22 ‒ 23, 2004 JALT Pan-SIG Conference Tokyo: JALT Pan SIG Committee (pp. 16-32). Available online at www.jalt.org/pansig/2004/HTML/ShimMoll.htm. [accessed Dec. 10, 2006].

Norris, J. M., & Ortega, L. (2000). Effectiveness of L2 instruction: A research synthesis and quantitative meta-analysis. Language Learning, 50, 417 – 528.

Norris, J. M., & Ortega, L. (2006). The value and practice of research synthesis for language learning and teaching. In J. M. Norris & L. Ortega (Eds.), Synthesizing research on language learning and teaching (pp. 3 – 50). Philadelphia: John Benjamins.

Norris, J. M., & Ortega, L. (2007). The future of research synthesis in applied linguistics: Beyond art or science. TESOL Quarterly, 41(4), 805 ‒ 815.

Oswald, F. L., & Plonsky, L. (2010). Meta-analysis in second language research: Choices and challenges. Annual Review of Applied Linguistics, 30, 85 ‒ 110.

Park, T. (2007). Investigating the construct validity of the Community Language Program (CLP) English Writing Test. (Unpublished PhD dissertation). Teachers College, Columbia University, New York, NY.

Rolstad, K., Mahoney, K., & Glass, G. (2005). Weighing the evidence: A meta-analysis of bilingual education in Arizona. Bilingual Research Journal, 29, 43 ‒ 67.

Ross, S. (1998). Self-assessment in second language testing: A meta-analysis and analysis of experiential factors. Language Testing, 15(1), 1 ‒ 20.

Russell, J., & Spada, N. (2006). The effectiveness of corrective feedback for the acquisition of L2 grammar: A meta-analysis of the research. In J. M. Norris & L. Ortega (Eds.), Synthesizing research on language learning and teaching (pp. 133 ‒ 164). Philadelphia: John Benjamins.

Sahari, M. (1997). Elaboration as a text-processing strategy: A meta-analytic review. RELC Journal, 28(1), 15 ‒ 27.

Sawaki, Y. (2003). A comparison of summarization and free recall as reading comprehension tasks in web-based assessment of Japanese as a foreign language. (Unpublished PhD dissertation). University of California at Los Angeles.

Sawaki, Y. (2007). Construct validation of analytic rating scales in a speaking assessment: Reporting a score profile and a composite. Language Testing, 24(3), 355-390.

Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation modeling. Language Testing, 22(1), 1-30.

Shavelson, R. J., & Webb, N. M. (1981). Generalizability theory: 1973-1980. British Journal of Mathematical and Statistical Psychology, 34, 133-166.

Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage.

Shin, S. (2002). Effects of subskills and text types on Korean EFL reading scores. Second Language Studies (Working Papers), 20(2), 107-130. Retrieved from http://www.hawaii. edu/sls/uhwpesl/on-line_cat.html. [accessed Dec. 10, 2006].

Solano-Flores, G., & Li, M. (2006). The use of generalizability (G) theory in testing of linguistic minorities. Educational Measurement: Issues and Practice, Spring, 13-22.

Stansfield, C. W., & Kenyon, D. M. (1992). Research of the comparability of the oral proficiency interview and the simulated oral proficiency interview. System, 20, 347-364.

Sudweeks, R. R., Reeve, S., & Bradshaw, W. S. (2005). A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing, 9, 239–261

Suen, H. K. (1990). Principles of test theories. Hillsdale, NJ: Lawrence Erlbaum.

Tang, X. (2006). Investigating the score reliability of the English as a Foreign Language Performance Test. (Unpublished PhD dissertation). Queen’s University, Kingston, Ontario, Canada.

Taylor, A., Stevens, J., & Asher, W. (2006). The effects of explicit reading strategy training on L2 reading comprehension: A meta-analysis. In J. M. Norris & L. Ortega (Eds.),

Synthesizing research on second language learning and teaching (pp. 3-50). Philadelphia: John Benjamins.

Van Moere, A. (2006). Validity evidence in a university group oral test. Language Testing, 23(4), 411 ‒ 440.

Van Weeren, J., & Theunissen, T. J. J. M. (1987). Testing pronunciation: An Application of generalizability theory. Language Learning, 37(1), 109 – 122.

Xi, X. (2003). Investigating language performance on the graph description task in a semidirect oral test. (Unpublished PhD dissertation). University of California at Los Angeles.

Xi, X. (2007). Evaluating analytic scoring for the TOEFL® Academic Speaking Test (TAST) for operational use. Language Testing, 24(2) 251 ‒ 286.

Xi, X., & Mollaun, P. (2006). Investigating the utility of analytic scoring for the TOEFL Academic Speaking Test (TAST). TOEFL iBT Research Report, TOEFLiBT-01. Princeton, NJ: ETS.

Yamamori, K. (2003). Evaluation of students’ interest, willingness, and attitude toward English lessons: Multivariate generalizability theory. The Japanese Journal of Educational Psychology, 51(2), 195 ‒ 204.

Yamanaka, H. (2005). Using generalizability theory in the evaluation of L2 writing. JALT Journal, 27(2), 169-185.

Yoshida, H. (2004). An analytic instrument for assessing EFL pronunciation. (Unpublished Ed.D. PhD dissertation). Philadelphia, PA: Temple University.

Yoshida, H. (2006). Using generalizability theory to evaluate reliability of a performance-based pronunciation measurement. (Unpublished ms). Osaka Jogakuin College.

Zhang, S. (2004). Investigating the relative effects of persons, items, sections, and languages on TOEIC score dependability. (Unpublished MA thesis). Ontario Institute for Studies in Education of the University of Toronto.

Zhang, S. (2006). Investigating the relative effects of persons, items, sections, and languages on TOEIC score dependability. Language Testing, 23(3), 351 – 369.

Zhang, Y. (2003). Effects of persons, items, and subtests on UH ELIPT reading test scores. Second Language Studies, 21(2), 107-128. Retrieved from http://www.hawaii.edu/sls/ uhwpesl/on-line_cat.html. [accessed Dec. 10, 2006]




How to Cite

Brown, J. D. (2011). What do the L2 Generalizability Studies Tell Us?. Asian Journal of Assessment in Teaching and Learning, 1, 1–37. Retrieved from https://ojs.upsi.edu.my/index.php/AJATeL/article/view/1895