A Comparative Empirical Evaluation of Single-Agent and Multi-Agent LLM Prompting Strategies for Automated Formative Feedback in Education
DOI:
https://doi.org/10.70393/6a696574.343135ARK:
https://n2t.net/ark:/40704/JIET.v1n2a05Disciplines:
Intelligent SystemsSubjects:
OtherReferences:
28Keywords:
Large Language Model, AI Agent, Formative Feedback, Multi-Agent PromptingAbstract
Automated formative feedback has emerged as a focal point in educational technology research, as large language models (LLMs) offer the prospect of providing personalized commentary on student writing at a scale that human instructors alone cannot match. What is less well examined, however, is how the underlying prompting design—particularly the choice between single-agent and multi-agent setups—shapes the pedagogical value of the feedback produced. To examine this question, we conducted a controlled comparison across four prompting configurations on a corpus of 200 undergraduate argumentative essays: a zero-shot single-agent baseline, a chain-of-thought single-agent variant, a dual-role multi-agent pipeline in which one model drafts feedback and another critiques it, and a tri-role multi-agent pipeline that introduces a dedicated revision stage on top of the draft-and-critique loop. Each set of feedback outputs was assessed along a multi-dimensional rubric covering accuracy, specificity, constructiveness, and tone, with three trained raters scoring independently. We also computed automated textual similarity metrics against expert-authored reference feedback to complement the human ratings and provide a more independent check. The tri-role multi-agent configuration produced the highest composite quality scores and, notably, the lowest rates of over-praise and hallucinated claims about essay content. The chain-of-thought single-agent variant, while not topping the rankings, delivered surprisingly close quality at a fraction of the inference cost, making it an attractive option when computational budget or latency matters. We close by discussing what these patterns mean in practice for educators and developers looking to integrate LLM-based feedback agents into higher-education writing workflows at scale.
References
[1] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837.
[2] Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., Kruber, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Pouly, O., Renz, L., Schneider, D., Schuller, B., ... Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274.
[3] Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., Li, B., Jiang, L., Zhang, X., & Wang, C. (2023). AutoGen: Enabling next-gen LLM applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155.
[4] Chen, L., Chen, P., & Lin, Z. (2020). Artificial intelligence in education: A review. IEEE Access, 8, 75264–75278.
[5] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing reasoning and acting in language models. In Proceedings of the International Conference on Learning Representations (ICLR).
[6] Guo, S., Wei, W., Xu, L., Wang, X., Cai, Z., & Li, H. (2024). Using generative AI and multi-agents to provide automatic feedback. arXiv preprint arXiv:2411.07407.
[7] Yan, L., Sha, L., Zhao, L., Li, Y., Martinez-Maldonado, R., Chen, G., Li, X., Jin, Y., & Gašević, D. (2024). Practical and ethical challenges of large language models in education: A systematic scoping review. British Journal of Educational Technology, 55(4), 1340–1373.
[8] Sheng, J. Y., Jia, X. Y., Guo, Z. H., Gao, Y., Cao, Y. P., & Feng, X. Q. (2025). Characterizing Layer-Specific Mechanical Properties of Soft Materials by Pipette Aspiration Using Transformer Model and SHapley Additive exPlanations. International Journal of Applied Mechanics, 17(06), 2550048.
[9] Guo, Z., Man, Y., Sheng, J., Lin, B., Ahmed, A., Jiang, B., ... & Zhang, C. (2026). Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams. arXiv preprint arXiv:2601.15655.
[10] Zhang, D., & Ma, X. (2025). Machine Learning-Based Credit Risk Assessment for Green Bonds: Climate Factor Integration and Default Prediction Analysis. Journal of Sustainability, Policy, and Practice, 1(2), 121-135.
[11] Trinh, T. K., & Zhang, D. (2024). Algorithmic fairness in financial decision-making: Detection and mitigation of bias in credit scoring applications. Journal of Advanced Computing Systems, 4(2), 36-49.
[12] Zhang, Y. (2026). A Comparative Study of Machine Learning Methods for Automated Customer Service Dialogue Quality Assessment. Journal of Science, Innovation & Social Impact, 2(1), 328-338.
[13] Dong, B., Zhang, D., & Xin, J. (2024). Deep reinforcement learning for optimizing order book imbalance-based high-frequency trading strategies. Journal of Computing Innovations and Applications, 2(2), 33-43.
[14] Zhang, D., & Feng, E. (2024). Quantitative Assessment of Regional Carbon Neutrality Policy Synergies Based on Deep Learning. Journal of Advanced Computing Systems, 4(10), 38-54.
[15] Abu-Rasheed, H., Weber, C., & Fathi, M. (2024). Knowledge graphs as context sources for LLM-based explanations of learning recommendations. In 2024 IEEE Global Engineering Education Conference (EDUCON) (pp. 1–5). IEEE.
[16] Liang, D. (2026). Identifying Undisclosed Related Party Relationships and Revenue Recognition Irregularities: A Rule-Based Analytical Approach for Audit Planning. Journal of Science, Innovation & Social Impact, 2(2), 26-36.
[17] Tang, T., & Yu, M. (2024). A Comparative Empirical Study of Semantic Signal Enhancement Methods for User Interest Features in CTR Prediction: Applicability of TF-IDF Weighting, Sentence-BERT Embeddings, and LDA Topic Fusion. Journal of Computing Innovations and Applications, 2(1), 165-174.
[18] Li, M., Wang, X., & Yu, M. (2025). Comparative Evaluation of Zero-Shot and Few-Shot Performance of Large Language Models in Low-Resource Language Machine Translation. Journal of Global Engineering Review, 3(2), 59-68.
[19] Wang, X., Fu, X., & Zou, D. (2025). Passage, Sentence, or Proposition? An Empirical Comparison of Retrieval Granularity Effects on LLM Answer Accuracy in Retrieval-Augmented Generation. Journal of Global Engineering Review, 3(1), 81-90.
[20] Dai, Y., Liu, A., & Li, H. (2025). A practical guide for supporting formative assessment and feedback using generative AI. arXiv preprint arXiv:2505.23405.
[21] Zhang, M., Lindsay, E. D., Thorbensen, F. B., Poulsen, D. B., & Bjerva, J. (2025). SEFL: Harnessing large language model agents to improve educational feedback systems. arXiv preprint arXiv:2502.12927.
[22] Park, J., Kim, S., Lee, H., & Chen, W. (2025). Enhancing game-based learning with AI-driven peer agents. In Proceedings of the IEEE Frontiers in Education Conference (FIE). IEEE.
[23] Scarlatos, A., Brinton, C., & Lan, A. (2025). Dialogue-driven knowledge tracing with LLM agents. In Proceedings of the ACM Conference on Learning @ Scale.
[24] Cohn, C., Hutchins, N., Biswas, G., & Hastings, P. (2026). A theory of adaptive scaffolding for LLM-based pedagogical agents. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI.
[25] Chung, P. T. (2026). Multi-Objective Optimization of Process Parameters for Dental Resin 3D Printing Using Improved NSGA-II Algorithm. Journal of Science, Innovation & Social Impact, 2(1), 276-287.
[26] Liu, Y. (2026). AI-Enhanced Healthcare Data Quality Governance: An Integrated Approach for Anomaly Detection and Integrity Verification. Journal of Sustainability, Policy, and Practice, 2(1), 215-229.
[27] Wang, Y. (2026). Explainable Risk Stratification for Polypharmacy-Related Adverse Outcomes in Community-Dwelling Elderly: A Rule-Enhanced Machine Learning Approach. Journal of Sustainability, Policy, and Practice, 2(2), 18-31.
[28] Li, Y. (2026). Performance Benchmarking and Optimization Strategies for Depth Estimation Algorithms in Unstructured Environments. Journal of Sustainability, Policy, and Practice, 2(2), 32-43.
Downloads
Published
How to Cite
Issue
Section
ARK
License
Copyright (c) 2026 The author retains copyright and grants the journal the right of first publication.

This work is licensed under a Creative Commons Attribution 4.0 International License.