AUTTAPON PALADPHOM,PRAKITTIYA TUKSINO,ANUCHA SOMABUT

DOI: https://doi.org/

Rater-induced error significantly challenges the scoring reliability of creative mathematical problem-solving assessments. This study applied Generalizability Theory to analyze score variance from 140 students and 3 raters across three scoring designs. The Generalizability (G) study revealed the person-by-rater interaction as the largest error source (35.50-35.90%), highlighting inconsistent rater judgments. A Decision (D) study showed that increasing raters from one to three substantially improved reliability (relative G-coefficient: .45 to .71). Notably, a design where each rater specializes in scoring specific items (p x (i:r)) yielded the highest absolute reliability (.69). These findings provide empirical guidance for designing effective scoring procedures to enhance the reliability of complex skill assessments.