Robo-graders like long words, not so big on intellectual coherence

When I glanced at the title of a recent New York Times piece on automated essay grading, “Facing a Robo-Grader? Just Keep Obfuscating Mellifluously,” I assumed it was just another fluffy popular science article. Surely no serious organization would use a computer program to grade essays. Not long into the article, however, I discovered that the “robo-grader,” named the E-rater, was developed not by university scientists but by the Educational Testing Service — the organization that administers the GRE and the TOEFL, among other exams.

For now, E-rater only grades essays that are also read by a human grader. Though the grades given by humans and E-rater have been remarkably similar, Les Perelman, an MIT professor, has his reservations about the software. After a month of testing, he has determined that E-rater favors long paragraphs and sentences, connecting words like “moreover,” and words with many syllables. Most troubling is that the E-rater can’t determine the truth or intellectual coherence of statements in the essay, used to hilarious effect in an example essay  by Perelman.

Learning about these loopholes made me curious to find out how the E-rater algorithm works. ETS is fairly transparent about the research behind its computer program. Much of their research has been published in peer-reviewed journals, and the full bibliography is available on the ETS website. The heart of the ETS algorithm is “natural language processing” (NLS), the details of which can be found here.

Other scientists are conducting research to try to close some of the obvious loopholes of the E-rater. Some of the most interesting have to do with coherence — the logical flow from on sentence to another — and topicality — the relationship of key words and phrases to the essay prompt. Researchers at the University of Pennsylvania incorporated what’s known as centering theory into the E-rater algorithm in order to grade coherence. In this implementation, an essay is broken into phrases, and then the logical connection between adjacent phrases is scored. By including this information for scoring with the E-rater, the computer-generated scores more closely matched those of human graders.

It’s interesting research on human language, to be sure, and possibly a useful tool for some teachers. A replacement for a person? Probably not anytime soon.


Leave a Reply