In today’s world information and communication technology has
almost integrated with every human work. Computer along with the internet have
become one of the most important tool and has sparked a revolution and made the
current era a digital age.
The integration of computer to language learning and teaching is
the most commonly practiced mode of language education worldwide.
Computer-assisted language testing (CALT) employs computer applications
eliciting and evaluating test takers’ performance in a second language. CALT
encompasses computer-adaptive testing (CAT), the use of multimedia in language
test tasks, and automatic response analysis (Chapelle & Douglas,2006).
While learning and teaching a language, especially a foreign language, becomes
the most essential part. The three main motives for using technology in
language testing are efficiency, equivalence, and innovation.
The paper aims to highlight the detailed description of CALT along
with its various dimensions; its application and the methods involved. It will
also throw lights on assessing English for Specific Purposes (ESP). Since technology is a challenging task,
particularly computer and language teaching and learning, it will explore the
challenges of CALT, referring to Indian Classrooms.
Key-words: CALT, Learning, Teaching, Issues and Challenges
José Noijons (1994) defines CALT is an integrated
procedure in which language in which language performance is elicited and
assessed with the help of a computer. CALT
encompasses computer-adaptive testing (CAT), the use of multimedia in language
test tasks, and automatic response analysis (Chapelle & Douglas, 2006).
(2010) distinguishes three main motives for using technology in language
efficiency, equivalence, and innovation.
Efficiency is achieved through computer adaptive testing and analysis-based
assessment that utilizes automated writing evaluation (AWE) or automated speech
evaluation (ASE) systems.
Equivalence refers to research on making computerized tests equivalent to paper
tests that are considered to be “the gold standard” in language testing.
technology can create a true transformation of language testing—is
revealed in the reconceptualization of the L2 ability construct in CALT as “the
ability to select and deploy appropriate language through the technologies that
are appropriate for a situation” (Chapelle & Douglas, 2006, p. 107).
Table1.1 Framework for Computer Assisted Language Tests
Linear, adaptive, and semi-adaptive testing
Computer-based and Web-based testing
Single medium and multimedia
Single language skill and integrated skills
Human-based, exact answer matching, and analysis-based
Low stakes, medium stakes, and high stakes
Curriculum-related (achievement, admission, diagnosis,
placement, progress) and non-curriculum-related (proficiency
Selected response and constructed response
Selective (e.g., multiple choice), productive (e.g., short
cloze task, written and oral narratives), and interactive (e.g.,
matching, drag and drop)
The use of computer in the field of assessment and testing practice
dates back to 1935 when the IBM model 805 was used for scoring objective tests
in the United States of America to reduce the labour intensive and costly
business of scoring millions of tests taken each year. But the year 1980is a
crucial year which led to many advancements in the area of CALT.
In the 1980s, as the microcomputers came within reach for many
applied linguists and item response theory (IRT) also appeared at the same time
to make use of this new technology for innovating the existing assessment and
testing practice. In 1985, Larson and Madsen developed the first CAT at Brigham
Young University, in the USA which was
technologically advanced assessment measures (Dunkel, 1999). They developed
large pool of test items for test delivery using computers. In the Computer
Adapted Test, designed by them, the program selected and presented items in a
sequence based on the test taker’s response to each item. If a student answered
an item correctly, a more difficult item was presented; and conversely, if an
item was answered incorrectly, an easier item was given. In short, the test
“adapted” to the examinee’s level of ability. The computer’s role was to
evaluate the student’s response, select an appropriate succeeding item and
display it on the screen. The computer also notified the examinee of the end of
the test and of his or her level of performance (Larson 1989: 278). Larson and
Madsen’s (1985) above referred CAT served as an impetus for the construction
and development of many more computer adapted tests throughout the 1990s (e.g.,
Kaya-Carton, Carton & Dandonoli, 1991; Burston & Monville-Burston,
1995; Brown & Iwashita, 1996; Young, Shermis, Brutten & Perkins, 1996)
which helped language teachers in making more accurate assessment of the test
taker’s language ability and attracted many as it appeared to be of immense
potentials both for language teachers and learners.
As Item Response Theory and many computer softwares, for
calculating the item statistics and providing adaptive control of item
selection, presentation and evaluation, witnessed advancements, the use of
computer technology in the field of language assessment and testing started
becoming inevitable reality though the challenge of availability of
infrastructure and the cross-disciplinary knowledge, required in the field,
hampered its progress for some time at its early stage.
Today the use of computer technology, in the field of language
assessment and testing, has become so widespread and so inclusive that it is
regarded as the inseparable part of today’s education system. The web of many
useful computer adapted tests CATs as well as web based tests WBTs is
constantly growing and computers are used not only for test delivery but also
for evaluation of complex types of test responses. Even the large testing
companies, who showed little interest in the field at its early stage, have also
stepped in and are producing and administrating these CATs as well as WBTs. The
administration and delivery of highly popular and useful tests such as TOEFL,
IELTS, DIALANG etc., to mention a few, speak volumes about the role played by
computer technology in the field of language assessment today.
Prominent Testing Services
The realm of CALT is constantly expanding and encompassing even the
field of scoring and rating as well. Today computers are used not just to score
objective type of test tasks but also to assess and rate much more complex task
types like essays and spoken English. The Educational Testing Service’s (see
http://www.ets.org), automated systems known as Criterion (see
http://www.criterion.ets.org) and e-rater (see http://www.ets.org/erater), for
rating extended written responses based on aspects of NLP analysis, Vantage
Laboratories’ (see http://www.vantage.com), IntelliMetric, Pearson Knowledge
Technologies’ (see http://www.knowledge-technologies.com) Intelligent Essay
Assessor (IEA ), and Pearson’s Versant, (see http://www.versanttest.com), a
computer-scored test of spoken English for non-native speakers, using NLP
technology, etc. indicate how rapidly the realm of CALT is growing and
reshaping, innovating and revolutionizing the field of language assessment and
testing by adapting itself successfully with the new challenges in technology
and assessment practice .
Testing and Evaluation is the most important part of language
learning because without learning process there can be test. The systematic
evaluation can be done by recognising the influence on learning of three main
perspectives (software designer, teacher and student) and taking into account
three sets of interactions between them:
a two-way direct interaction.
One of the main variables here is
the teacher’s role, which may be ‘resource provider’, ‘manager’,
‘coach’,’researcher’ or ‘facilitator’.
Primarily a one-way influence, although the designer’s perception of the
student’s learning characteristics will implicitly be of help.
primarily a one-way influence, with the designer’s perception of the
teacher having some influence.
This framework assists the evaluator to identify the key issues on
which judgements must be made in the particular context of the proposed use
(predictive evaluation) or actual use (interpretive evaluation). (Soromic,
CALT in ESP Classrooms
The application of
technology in the realm of English for Specific Purposes (ESP) has gained
tremendous popularity among English as a Foreign Language (EFL) researchers and
scholars (Arno, 2012; Butler-Pascoe, 2009; Jarvis, 2009; Plastina, 2003).
ESP instruction is goal-oriented and based on the specific needs of
students (Robinson, 2003).
Corpus helps to
test the communicative ability and efficiency. Content, language, grammar and
vocabulary knowledge is being assessed. The assessment of curriculum,
instructional materials are constantly assessed. The most important part of
testing involves the language usage for a specific purposes, i.e. business,
medical, law, science and technology, etc. and the usage of vocabulary. The
assessment of curriculum development is the primary task.
Challenges in CALT
The views regarding the current status and the future
of CALT vary slightly
among researchers, with some being more concerned about the severity of
problems than others. Ockey (2009), for instance, believes that due to numerous
limitations and problems “CBT has failed to realize its anticipated potential”
836), while Chalhoub-Deville (2010) contends that “L2 CBTs, as currently
conceived, fall short in providing any radical transformation of assessment
(p. 522). In the meantime, other researchers (e.g., Chapelle, 2010; Douglas,
appear to be somewhat more positive about the transformative role of CALT and
stress that despite existing unresolved issues technology remains “an
aspect of modern language testing” and its use in language assessment “really
isn’t an issue we can reasonably reject—technology is being used and will
continue to be used” (Douglas,2010,p.139).
Still, everyone seems to acknowledge the existence of challenges in CALT,
maintaining that more work is necessary to solve the persisting problems. In
particular, a noticeable amount of discussion in the literature has been
to the issues plaguing computer-adaptive testing, which, according to some
researchers, led to the decline of its popularity, especially in large scale
assessment (e.g., Douglas & Hegelheimer, 2007; Ockey, 2009). Of primary
concern for CATs is the security of test items (Wainer & Eignor, 2000).
Unlike a linear CBT that presents the same set of tasks to a group of test
takers, a computer adaptive language test provides different questions to test
takers. To limit the exposure of items, CATs require a signifiantly larger item
pool, which makesthe construction of such tests more costly and time-consuming.
suggests that one way to avoid problems associated with test takers’
memorization of test items is to create computer programs that would generate
Some test developers suggest starting a CAT with easy items, whereas others
recommend beginning with items of average diffiulty. Additionally, no consensus
has been reached on how the algorithm should proceed with the selection of
items once a test taker has responded to the first question, nor are there
agreed-upon rules on when exactly an adaptive test should stop (Thissen &
Mislevy, 2000). Nonetheless, research is being carried out to address this
issue and new methods of item selections in computer-adaptive testing such as
the Weighted Penalty Model (see Shin, Chien, Way, & Swanson, 2009) have recently
Another major problem with computer-adaptive tests concerns their reductionist
approach to the measured L2 constructs. Canale (1986) was one of the first to argue
that the unidimensionality assumption deriving from the IRT models used in CATs
poses a threat to the L2 ability construct, making it unidimensional as well. Their
main argument suggests that the L2 ability construct should be multidimensional
and consist of multiple constituents that represent not only the cognitive aspects
of language use, but also knowledge of language discourse and the norms of social
interaction, the ability to use language in context, the ability to use metacognitive
strategies, and, in the case of CALT, the ability to use technology. Hence, Chalhoub-Deville
(2010) asserts that, because of the multidimensional nature of the L2 ability
construct, measurement models employed in CBTs must be multidimensional as well—a
requirement that many adaptive language tests do not meet. Finally, the
unidimensionality assumption of IRT also precludes the use of integrated
language tasks in computer-adaptive assessment (Jamieson, 2005). As a result of
some of these problems,
ETS, for instance, decided to abandon the computer-adaptive mode that was
employed in TOEFL CBT and instead return to the linear approach in the newer
The limitations of the adaptive approach prompted some researchers to move
toward semi adaptive assessment (e.g., Winke, 2006). The advantages of this
of assessment include a smaller number of items (compared to linear tests) and
the absence of necessity to satisfy IRT assumptions. Thus, Ockey (2009) argues
that semi adaptive tests can be the best compromise between adaptive and linear
approaches and predicts that they will become more widespread in medium-scale
Automated scoring is another contentious area of CALT. One of the main issues
with automated scoring of constructed responses, both for writing and for
speaking assessment, is related to the fact that computers look only at a
of features in test takers’ output. Even though research studies report
high correlation indices between the scores assigned by AWE systems and human
raters (e.g., Attali & Burstein, 2006), Douglas (2010) points out that it
is not clear
whether the underlying basis for these scores is the same. Specifially, he
“are humans and computers giving the same score to an essay but for different
reasons, and if so, how does it affect our interpretations of the scores?”
2010, p. 119). He thus concludes that although “techniques of computer-assisted
natural language processing become more and more sophisticated, . . . we are
still some years, perhaps decades, away from being able to rely wholly on such
systems in language assessment” (Douglas, 2010, p. 119). Since machines do not
understand ideas and concepts and are not able to evaluate the meaningful
writing, critics contend that AWE “dehumanizes the writing situation, discounts
the complexity of written communication” (Ziegler, 2006, p. 139) and “strikes a
death blow to the understanding of writing and composing as a meaning-making
activity” (Ericsson, 2006, p. 37).
Automatic scoring of speaking skills is even more problematic than that of
writing. In particular, speaking assessment involves an extra step which
assessment does not have: recognition of the input (i.e., speech). Unlike
writing assessment, the assessment of speaking also requires the evaluation of
segmental features (e.g., individual sounds and phonemes) and suprasegmental
features (e.g., tone, stress, and prosody). Since automated evaluation systems
cannot perform at the level of human raters and cannot evaluate coherence,
content, and logic the way humans do. Other challenges faced by CALT are
related to task types and design, namely
the use of multimedia and integrated tasks. Although the use of multimedia
is believed to result in a greater level of authenticity in test tasks by
more realistic content and contextualization cues, it remains unclear how the
inclusion of multimedia affects the L2 construct being measured by CBTs
(Jamieson, 2005). Some researchers even question the extent to which multimedia
enhances the authenticity of tests (e.g., Douglas & Hegelheimer, 2007)
since comparative studies on the role of multimedia in language assessment have
mixed results (see Ginther, 2002; Wagner, 2007; Suvorov, 2009). With regards to
integrated tasks, their implementation in CBTs is generally viewed favourably
because such tasks seem to better reflct what test takers would be required to
in real-life situations. The use of integrated tasks is therefore believed to
authenticity of language tests (Fulcher & Davidson, 2007). However, Douglas
(2010) warns that the interpretation of integrated tasks can be problematic
if the test taker’s performance is inadequate, it is virtually impossible to find
whether such performance is caused by one of the target skills or their
combination. This concern appears to be more relevant in high stakes testing
than in low
To sum up, all
the negative aspects and caveats associated with CALT mentioned so far are
worthy of concern and research but they should not lead to the suspicion
towards CALT. Technology can be instrumental in expansion and innovation in
language testing. Since its advent, CALT has changed and innovated the existing
testing practices, to make them in line with the needs of the 21st century
e-generation of second language learners by making them more flexible,
innovative, individualized, efficient and fast. The realization of these
benefits embedded in it and their implications, is making it integral part of
today’s education system to make testing practice more flexible, innovative,
dynamic, efficient and individualized as well as to enhance the quality and
standard of education. In the form of CALT, we are witnessing these
opportunities for the reflections and need to capitalize on.
Alderson, J. C.
(1988). Innovations in language testing: Can the microcomputer help? Special
Report No 1 Language Testing Update. Lancaster, UK: University of Lancaster.
Alderson, J. C. (1990). Learner-centered
testing through computers: Institutional issues in individual assessment in J.
de Jong & D. K. Stevenson (eds.) Individualizing the assessment of language
abilities. Clevedon, UK: Multilingual Matters.
Alderson, J. C.
(2000). Assessing reading. Cambridge: Cambridge University Press
& Monville-Burston, M. (1995). Practical design and implementation
considerations of a computer-adaptive foreign language test: The Monash/
Melbourne French CAT. CALICO Journal, 13(1), 26-46
Carol A. Chapelle
and Dan Douglas (2006) Assessing language through computer technology
Cambridge: Cambridge University Press. Center for Applied Linguistics:
www.cal.org (Accessed 10 June 2012).
and Hricko, M. (eds.): 2006, Online Assessment and Measurement: Case Studies
from Higher Education, K-12 and Corporate. Idea Group, Hershey, PA
(1994). Testing computer assisted language tests: Towards a checklist for CALT.
CALICO Journal, 12(1), 37-58.
www.market-leader.net), www.ecollege.com, & www.myenglishlab.com (Accessed
15 June 2012)
(1986). Using the Writer’s Workbench in composition teaching and testing. In C.
Stansfield (ed.), Technology and language testing (pp. 167—88). Washington, DC: