|
function showContent(){
?>
Grades and Grading
Researchers have attempted to automate the grading of student essays since 1960s. The approach has been to define a large number of objectively measurable features in the essays, such as essay length, average word length, and so forth, and use multiple linear regression to try to predict the scores that human graders would give these essays. Even in this early work, results were surprisingly good. The scores assigned by computer correlated at around .50 with the English teachers who provided the manually assigned grades. This was about as well the English teachers correlated with each other. More recent systems consider more complex features of essays, for example, work at ETS (Educational Testing Service) has attempted to simulate criteria similar what a human judge would use, emphasizing sophisticated techniques from computational linguistics, to extract syntactic, rhetorical, and content features. The Intelligent Essay Grader (IEA) attempts to represent the semantic content of essays by using features that group associated words together via singular value decomposition (SVD).
The present approach to automated essay grading involves statistical classifiers. Although this approach was a new way to attack the essay grading problem when we first reported it, is widely used in information retrieval and text categorization applications. Binary classifiers were trained to distinguish "good" from "bad" essays, and the scores output by these classifiers were used to rank essays and assign grades to them. The grades based on these classifiers can either be used alone, or combined with other simple variables in a linear regression.
It may seem strange to treat grading as a binary classification problem ("good" versus "bad" rather than an " a choice among n> 2 alternatives, with a class for each possible numeric grade, 1 through 6". However, poorly written essays with the same grade do not necessarily resemble each other. Pilot studies performed for this project showed better performance in training a classifier to recognize good essay, than classifiers to identify bad versus fair mediocre, etc. essays.
Five data sets were obtained from Educational Testing Service. The sets varied in the number of points their grading scale and the size of the data sets. They covered widely different content areas and were aimed at different age groups. The first set, Soc, was a social studies question where certain facts were expected to be covered. The second set, Phys, was a physics question requiring an enumeration and discussion of different kinds of energy transformations in a particular situation. The third set, Law, required the evaluation of a legal argument presented in the question. The last two questions sets, Gl and G2, were general questions from an exam for college students who want to pursue graduate studies. Gl was a very general opinion question intended to evaluate how well the student could present a logical argument. G2 presented specific scenario with an argument the student had to evaluate. All the questions except Gl required the student to cover certain points. In contrast, a good answer to Gl would be judged less by what was covered than how it was expressed.
}
function inThisSection() {
global $switchInThisSection;
if ($switchInThisSection == 1){
include('sub_menu_1_2.php');
}
}
?> |