What Aspects of Speaking Can AI Assess?

The Future of AI Graders

In AI, the movie by Steven Spielberg, the robot child named David attains what most experts consider almost impossible for AI to achieve, the ability to love. With this level of technology, it would be possible for AI graders to assess every aspect of test takers’ speaking skills.
On the other hand, in the movie Blade Runner, the protagonist, portrayed by the beloved Harrison Ford, distinguishes a replicant from a human through conversation. It goes as follows: “One more question. You’re watching a stage play. A banquet is in progress. The guests are enjoying an appetizer of raw oysters. The entree consists of boiled dog.” When the subject responds normally to this question, the protagonist knows she is a replicant. In this example, the AI is not good at assessing how coherent or logical the conversation or passage is.
These movie scenes aside, we experimented on what aspects of speaking AI are good at accessing. We used two subjects. One was a level-C1-plus speaker (Subject A). His ability had been proven by many tests, including iTEP, IELTS, TEAP, EIKEN, and G-CAS, and their indicators showed his ability is at C1-plus on average. The other subject’s speaking ability was at B2, proven almost in the same way (Subject B).


Uh Oh….

The content you’re trying to access is for members only. Please consider joining our community for exclusive access to all video and written media, plus webinar participation and a free annual conference registration. Already a member? Click here to sign in.

These two subjects took a widely accepted AI speaking assessment created by a British company five times each in two days. This test can assess test takers’ speaking skills in a dozen minutes and show the score instantly in a few minutes. Several hours apart, they took the test, and Subject A changed the way he spoke slightly every time. On the first attempt, Subject A got 68 out of 80, which is a mediocre score considering his overall speaking skills. On the fifth attempt, he got 77, which was close to the total score. Subject B’s first score was 62, and the last score was 63. Her best score was 66. Her scores were mostly stable.
Subject A’s rapid rise in his score and Subject B’s stability suggested that the AI grader concentrates only on some aspects of speaking. Subject A is an international businessperson and educator and is accustomed to enunciating his words so that non-native speakers understand what he says. In other tests, graders judged that his pronunciation, intonation, and fluency were high, while this AI test did not judge it that way on the first attempt. So, he adjusted his speaking style to the one where he linked more words and spoke faster, which resulted in a gradual rise of his score almost to the full. The rise of the score is primarily due to Subject A’s ability to speak two types of English: 1) intentional enunciation and 2) native speaker-like speech. The AI grader scored the second type much higher. Here we should note the fact that generally, the faster a person speaks, the more his or her coherence of content deteriorates. This person was no exception, but his score went up drastically.
Subject B was a young ongoing learner of English who spoke with all her might every time she took the test. Her pronunciation, intonation, and speed mainly stayed the same every time she took the test. Also, the accuracy and the content stayed almost the same. After all, as all readers know, it is virtually impossible to improve one’s speaking skills drastically in two days.
Although this was a small personal experiment, it suggested essential things about the assessment by AI: It is still in the developmental stage and has advantages and disadvantages. Unfortunately, assessment robots have not reached the level we see in the movies.
Here we want to share what we found about AI assessment. (Please note that these are still only hypotheses and need more experiments to prove.) We assume AI graders are measuring the proximity of test takers’ speech to the native speakers’ sound data in the big data collected and amassed from English-speaking countries. They use state-of-art voice recognition technology to texturize test takers’ answers and assess their grammar and vocabulary. Also, we assume that AI graders measure the response time and speaking speed.
The advantages are 1. Grading is 100 percent objective with no room for human subjectiveness; and 2. You get the results quickly in several minutes, which can open up many opportunities for educators. The disadvantages are 1. They probably cannot assess the coherence and logicalness of speech; and 2. They will not value enunciation or slow and easy speech, which is commonly appreciated in global communication.
All in all, considering these advantages and disadvantages, it is crucial to know where to use AI graders and human graders. For example, we would use AI graders for screening for a particular pronunciation and grammar level and human graders for selecting a candidate for an academic or business endeavor.
Let us see how far AI can go in 2030. Will it be the level we see in the movie AI or Blade Runner?

Get featured in our blog

You might also enjoy

Looking for something?

Type your search term below

Sign In

Directory of Experts

Get listed today

The English Language Testing Society is compiling a directory of member experts in the field of English language testing. If you are interested in being listed*, please fill out the form below.

*Applicable for members only. If you are not a member, consider joining us by clicking here.

let's talk about english testing

We Would be happy to hear from you