The automated evaluation of student assignments presents a significant challenge in education. Traditional grading methods are labor-intensive, demanding considerable human effort and are susceptible to biases and inconsistencies arising from grader fatigue and subjective interpretations. This blog post introduces an AI-assisted grading system leveraging the CrewAI framework to address these limitations. By employing a multi-agent approach, we harness the power of Large Language Models (LLMs) to streamline the grading process, enhance objectivity, and provide more detailed feedback to students. CrewAI facilitates the decomposition of the complex grading task into manageable sub-tasks, each handled by specialized AI agents. This approach allows for the utilization of LLMs that are best suited for specific aspects of the evaluation, while also enabling the use of targeted prompts that ensure accurate and relevant assessments.
Our AI-assisted grading system is built upon the CrewAI framework, which enables the creation of collaborative multi-agent systems. The architecture comprises two distinct agents: a Grader and a Checker. The Grader agent is responsible for the initial evaluation of student submissions based on a predefined rubric. The Checker agent then reviews the Grader's assessment, ensuring accuracy, consistency, and alignment with the rubric's learning outcomes.
The grading process is divided into two primary tasks: Evaluation and Reviewing. The Evaluation task involves the Grader agent analyzing the student's submission, comparing it against the rubric, and assigning scores and feedback for each learning outcome. The Reviewing task is performed by the Checker agent, who examines the Grader's evaluation to verify the accuracy of the assigned scores and the quality of the feedback provided.
Both agents receive the following inputs: the assignment description, the grading rubric, and the student's submission. These inputs, combined with carefully crafted prompts, guide the LLMs in performing their respective tasks. The prompts are designed to provide clear instructions and context, ensuring that the agents focus on the key aspects of the assignment and the learning outcomes defined in the rubric.
For the evaluation of the system, we used programming assignments and their corresponding rubrics. We tested various combinations of LLMs to evaluate their performance in the different tasks. The results were evaluated by hand, by qualitatively reviewing the assigned grade and the review of the grade generated by the crew.
Large Language Model | As grader | As reviewer | Human assestment of Crew |
---|---|---|---|
groq/llama-3.3-70b-versatile | In one case the feedback was limited to the rubric's level, in another it provided detailed feedback. Provided a good overall feedback. | Accurate assestment of the assignments | Very good |
grok-2-1212 | In both cases it provided accurate feedback for each learning outcome. Provided a good overall feedback. | Accurate assestment of the assignments | Very good |
gpt-4o-mini | Provided very good feedback and suggestions. Provided a good overall feedback. | The verification accuratly reflects the correspondance between the evaluacion and the rubric | Very good |
gemini/gemini-2.0-flash-exp | In one case the feedback was limited to the rubric's level. Provided a good overall feedback. | The verification accuratly reflects the correspondance between the evaluacion and the rubric | Very good |
An additional overall observation is that in most cases, the output of a single LLM varied between assignments, sometimes producing a table and other times producing a text output. A more detailed prompt could help reduce this variability.
Large Language Model | As grader | As reviewer | Human assestment of Crew |
---|---|---|---|
groq/llama-3.3-70b-versatile | Perfectly identified limitations and errors in one submission | The verifier made a good assessment and highlighted the generous score regarding one problem in the submission. For one submission the verifier did not provided detailed explanation of the verification. | Very good |
grok-2-1212 | Identified errors an limitations for one submission. For one of the submissions did not provided detailed feedback. | The verification was accurate with respect to the rubric. | Very good |
gpt-4o-mini | Provided accurate feedback for the submissions. | Highlighted the fact that some marks need to be downgraded accurately. | Excellent |
gemini/gemini-2.0-flash-exp | Identified errors and limitations in one submission. For the good submission did not provide detailed feedback | Suggested adjusting the mark for a poorly satisfied learning outcome. | Excellent |
In addition, all the models provided a summary of the overall evaluation.
The experiments demonstrate the potential of AI-assisted grading using the CrewAI framework. Overall, the evaluations produced by the AI agents were of high quality, providing valuable feedback to students. However, there was some variability in the marks assigned by different models, highlighting the need for careful selection and configuration of LLMs for specific grading tasks.
One notable observation is the effectiveness of the verification agent in identifying and correcting inconsistencies in the initial evaluations. In several instances, the verification agent proposed revised marks with clear justifications, demonstrating its ability to ensure fairness and accuracy in the grading process.
While the grading agents generally provided detailed and constructive feedback, there were a few cases where the feedback was limited or lacked specific details. Similarly, there was one instance where the reviewer's assessment was not as comprehensive as expected. These instances underscore the inherent variability in LLM outputs, even when using the same models and prompts across different assignments. This variability suggests that further refinement of prompts and agent configurations may be necessary to improve the consistency and reliability of the AI-assisted grading system.
In summary, our results indicate that CrewAI can be a valuable tool for automating and enhancing the grading process. The multi-agent architecture allows for the decomposition of the grading task into smaller, more manageable sub-tasks, enabling the use of specialized LLMs and targeted prompts. However, it is important to acknowledge the variability in LLM outputs and to continuously monitor and refine the system to ensure consistent and accurate evaluations. Future research should focus on developing more robust prompts, exploring different agent configurations, and incorporating mechanisms for detecting and mitigating potential biases in LLM-generated feedback.
There are no models linked
There are no datasets linked
There are no datasets linked
There are no models linked