Scoring an Oral Language Test using Automatic Speech Recognition
Pre-CALICO Workshop on Automatic Analysis of Learner Language Presentation pdf
[Deryle Lonsdale, C. Ray Graham, Casey Kennington, Aaron Johnson and Jeremiah McGhee]
With the advent of the proficiency movement in foreign language instruction in the United States and communicative language teaching worldwide, the issue of measurement of oral language skills has taken on an ever greater importance. Conventional methods of testing speaking skills in second language acquisition (SLA) are expensive and time consuming. The present paper reports the results of research aimed at creating a computer-administered test which can be administered to large numbers of learners in fifteen minutes or less and which can be scored automatically using automatic speech recognition (ASR).
The oral language testing method that we chose has been used for measuring normal and abnormal native language development in children and second language development in children and adults for several decades. It is called elicited imitation (EI). It consists of having learners repeat, word for word, sentences of increasing length, until they are no longer able to repeat them accurately. There is substantial evidence that the interlanguage competencies which underlie sentence imitation are similar to those that underlie other forms of language use and that EI provides a reasonable approximation of global oral language proficiency (Blay-Vroman & Chaudron 1994). A resurgence of interest in using EI as a tool for exploring oral language development in second language learners (Erlam 2006; Chaudron, Prior, & Kozok 2005) has prompted us to pursue this research.
In this workshop we present the results of a study using EI to predict oral language proficiency scores and a second study to use ASR to create a computer-based scoring mechanism for the imitated sentences. In the first study, three 60-item EI instruments were administered to 232 ESL learners and scored independently by two raters. Rasch analyses were performed on the resulting data and a fourth EI instrument was created using the best-fitting items from the three original forms. The newly created instrument was then administered to 156 additional students, along with four other measures of oral language proficiency, including an ACTFL Oral Proficiency Interview. Correlations as high as .69 were achieved among the instruments.
In the second study, which is the major focus of this presentation, we employed the widely used Sphinx ASR engine to score the imitated sentences. We began by using the approximately 150 stimulus sentences spoken by native speakers as the training set. In the presentation we discuss how we gradually improved the accuracy of recognition until it approached 90 % correct recognition of all sentences. We then used the hand-scored second language imitations of the 232 ESL learners as training data to adjust the speech recognizer until we were able to achieve high levels of accuracy on second language speech. We then used the responses of the 156 learners in the second administration of the EI as a test set. In the presentation we will discuss the process by which we adjusted the speech recognition engine to achieve acceptable levels of scoring as compared with the two native speaker judges who scored each item by hand. We discuss the promising results of the Sphinx engine's processing of these files and indicate how the integrated tool will be used in the future.
We have identified several directions to take in future work. First, we are refining the EI instrument, culling out the sentences that do not perform well from an ASR perspective. We are also assessing the interrelationships between student responses and EI variables such as sentence length, complexity, and vocabulary. Further examination of responder variables such as working memory, native language, and age will be addressed. We also have several hundred more EI tests that still need to be scored. To enhance our diagnostic capabilities we are working toward forced alignment (Li et al., 2005) to better identify passages where scoring was not successful. We are considering the possibility of training up non-native acoustic models, though there is a paucity of aligned corpora for this task. Finally, we intend to pursue the development of EI instruments and ASR models for testing L2 learner abilities for other languages besides English. Our eventual goal is to develop a run-time adaptive speaking test similar to those used currently in reading and listening.