Modern technologies in teaching FLT - рефераты, скачать рефераты, бесплатно рефераты

Рефераты. Modern technologies in teaching FLT

b>C. Lexicon

The lexicon, or dictionary, contains the phonetic spelling for all the words that are expected to be observed by the recognizer. It serves as a reference for converting the phone sequence determined by the search algorithm into a word. It must be carefully designed to cover the entire lexical domain in which the system is expected to perform. If the recognizer encounters a word it does not "know" (i.e., a word not defined in the lexicon), it will either choose the closest match or return an out-of-vocabulary recognition error. Whether a recognition error is registered as a misrecognition or an out-of-vocabulary error depends in part on the vocabulary size. If, for example, the vocabulary is too small for an unrestricted dictation task--let's say less than 3K--the out-of-vocabulary errors are likely to be very high. If the vocabulary is too large, the chance of misrecognition errors increases because with more similar-sounding words, the confusability increases. The vocabulary size in most commercial dictation systems tends to vary between 5K and 60K.

D. The Language Model

The language model predicts the most likely continuation of an utterance on the basis of statistical information about the frequency in which word sequences occur on average in the language to be recognized. For example, the word sequence A bare attacked him will have a very low probability in any language model based on standard English usage, whereas the sequence A bear attacked him will have a higher probability of occurring. Thus the language model helps constrain the recognition hypothesis produced on the basis of the acoustic decoding just as the context helps decipher an unintelligible word in a handwritten note. Like the HMMs, an efficient language model must be trained on large amounts of data, in this case texts collected from the target domain.

In ASR applications with constrained lexical domain and/or simple task definition, the language model consists of a grammatical network that defines the possible word sequences to be accepted by the system without providing any statistical information. This type of design is suitable for CALL applications in which the possible word combinations and phrases are known in advance and can be easily anticipated (e.g., based on user data collected with a system pre-prototype). Because of the a priori constraining function of a grammar network, applications with clearly defined task grammars tend to perform at much higher accuracy rates than the quality of the acoustic recognition would suggest.

E. Decoder

Simply put, the decoder is an algorithm that tries to find the utterance that maximizes the probability that a given sequence of speech sounds corresponds to that utterance. This is a search problem, and especially in large vocabulary systems careful consideration must be given to questions of efficiency and optimization, for example to whether the decoder should pursue only the most likely hypothesis or a number of them in parallel (Young, 1996). An exhaustive search of all possible completions of an utterance might ultimately be more accurate but of questionable value if one has to wait two days to get a result. Trade-offs are therefore necessary to maximize the search results while at the same time minimizing the amount of CPU and recognition time.

PERFORMANCE AND DESIGN ISSUES IN SPEECH APPLICATIONS

For educators and developers interested in deploying ASR in CALL applications, perhaps the most important consideration is recognition performance: How good is the technology? Is it ready to be deployed in language learning? These questions cannot be answered except with reference to particular applications of the technology, and therefore touch on a key issue in ASR development: the issue of human-machine interface design.

As we recall, speech recognition performance is always domain specific--a machine can only do what it is programmed to do, and a recognizer with models trained to recognize business news dictation under laboratory conditions will be unable to handle spontaneous conversational speech transmitted over noisy telephone channels. The question that needs to be answered is therefore not simply "How good is ASR technology?" but rather, "What do we want to use it for?" and "How do we get it to perform the task?"

In the following section, we will address the issue of system performance as it relates to a number of successful commercial speech applications. By emphasizing the distinction between recognizer performance on the one hand--understood in terms of "raw" recognition accuracy--and system performance on the other; we suggest how the latter can be optimized within an overall design that takes into account not only the factors that affect recognizer performance as such, but also, and perhaps even more importantly, considerations of human-machine interface design.

Historically, basic speech recognition research has focused almost exclusively on optimizing large vocabulary speaker-independent recognition of continuous dictation. A major impetus for this research has come from US government sponsored competitions held annually by the Defense Advanced Research Projects Agency (DARPA). The main emphasis of these competitions has been on improving the "raw" recognition accuracy--calculated in terms of average omissions, insertions, and substitutions--of large-vocabulary continuous speech recognizers (LVCSRs) in the task of recognizing read sentence material from a number of standard sources (e.g., The Wall Street Journal or The New York Times). The best laboratory systems that participated in the WSJ large-vocabulary continuous dictation task have achieved word error rates as low as 5%, that is, on average, one recognition error in every twenty words (Pallet, 1994).

CURRENT TRENDS IN VOICE-INTERACTIVE CALL

In recent years, an increasing number of speech laboratories have begun deploying speech technology in CALL applications. Results include voice-interactive prototype systems for teaching pronunciation, reading, and limited conversational skills in semi-constrained contexts. Our review of these applications is far from exhaustive. It covers a select number of mostly experimental systems that explore paths we found promising and worth pursuing. We will discuss the range of voice-interactions these systems offer for practicing certain language skills, explain their technical implementation, and comment on the pedagogical value of these implementations. Apart from giving a brief system overview, we report experimental results if available and provide an assessment of how far away the technology is from being deployed in the commercial and educational environments.

Pronunciation Training

A useful and remarkably successful application of speech recognition and processing technology has been demonstrated by a number of research and commercial laboratories in the area of pronunciation training. Voice-interactive pronunciation tutors prompt students to repeat spoken words and phrases or to read aloud sentences in the target language for the purpose of practicing both the sounds and the intonation of the language. The key to teaching pronunciation successfully is corrective feedback, more specifically, a type of feedback that does not rely on the student's own perception. A number of experimental systems have implemented automatic pronunciation scoring as a means to evaluate spoken learner productions in terms of fluency, segmental quality (phonemes) and supra-segmental features (intonation). The automatically generated proficiency score can then be used as a basis for providing other modes of corrective feedback. We discuss segmental and supra-segmental feedback in more detail below.

Segmental Feedback. Technically, designing a voice-interactive pronunciation tutor goes beyond the state of the art required by commercial dictation systems. While the grammar and vocabulary of a pronunciation tutor is comparatively simple, the underlying speech processing technology tends to be complex since it must be customized to recognize and evaluate the disfluent speech of language learners. A conventional speech recognizer is designed to generate the most charitable reading of a speaker's utterance. Acoustic models are generalized so as to accept and recognize correctly a wide range of different accents and pronunciations. A pronunciation tutor, by contrast, must be trained to both recognize and correct subtle deviations from standard native pronunciations.

A number of techniques have been suggested for automatic recognition and scoring of non-native speech (Bernstein, 1997; Franco, Neumeyer, Kim, & Ronen, 1997; Kim, Franco, & Neumeyer, 1997; Witt & Young, 1997). In general terms, the procedure consists of building native pronunciation models and then measuring the non-native responses against the native models. This requires models trained on both native and non-native speech data in the target language, and supplemented by a set of algorithms for measuring acoustic variables that have proven useful in distinguishing native from non-native speech. These variables include response latency, segment duration, inter-word pauses (in phrases), spectral likelihood, and fundamental frequency (F0). Machine scores are calculated from statistics derived from comparing non-native values for these variables to the native models.

In a final step, machine generated pronunciation scores are validated by correlating these scores with the judgment of human expert listeners. As one would expect, the accuracy of scores increases with the duration of the utterance to be evaluated. Stanford Research Institute (SRI) has demonstrated a 0.44 correlation between machine scores and human scores at the phone level. At the sentence level, the machine-human correlation was 0.58, and at the speaker level it was 0.72 for a total of 50 utterances per speaker (Franco et al., 1997; Kim et al., 1997). These results compare with 0.55, 0.65, and 0.80 for phone, utterance, and speaker level correlation between human graders. A study conducted at Entropic shows that based on about 20 to 30 utterances per speaker and on a linear combination of the above techniques, it is possible to obtain machine-human grader correlation levels as high as 0.85 (Bernstein, 1997).

Others have used expert knowledge about systematic pronunciation errors made by L2 adult learners in order to diagnose and correct such errors. One such system is the European Community project SPELL for automated assessment and improvement of foreign language pronunciation (Hiller, Rooney, Vaughan, Eckert, Laver, & Jack, 1994). This system uses advanced speech processing and recognition technologies to assess pronunciation errors by L2 learners of English (French or Italian speakers) and provide immediate corrective feedback. One technique for detecting consonant errors induced by inter-language transfer was to include students' L1 pronunciations into the grammar network. In addition to the English /th/ sound, for example, the grammar network also includes /t/ or /s/, that is, errors typical of non-native Italian speakers of English. This system, although quite simple in the use of ASR technology, can be very effective in diagnosing and correcting known problems of L1 interference. However, it is less effective in detecting rare and more idiosyncratic pronunciation errors. Furthermore, it assumes that the phonetic system of the target language (e.g., English) can be accurately mapped to the learners' native language (e.g., Italian). While this assumption may work well for an Italian learner of English, it certainly does not for a Chinese learner; that is, there are sounds in Chinese that do not resemble any sounds in English.

Страницы: 1, 2, 3, 4