Only 13% of eighth-grade teachers require their students to write persuasively at least once a week. And out of 13% of teachers who asked their students to write claimed they didn’t have time to mark and provide appropriate feedback. Given this, it is difficult for a non-native English speaker to write fluently.
As a result, there is an urgent demand for effective automated feedback tools that can grade and provide feedback. Since 2019, there has been ongoing competition on Kaggle for developing automated feedback tools, with the most recent competition taking place in 2022, just before OpenAI announces ChatGPT. The objective of this research is to see if ChatGPT can provide zero-shot English score feedback based on a variety of factors, including cohesion, syntax, vocabulary, phraseology, grammar, and conventions. The results are then compared to T5, a state-of-the-art large language model, before evaluating ChatGPT’s capability.
In today’s globalized world, English has become a dominant language in international communication, commerce, and education. As a result, proficiency in English writing skills has become increasingly important, especially for non-native speakers who face challenges in communicating their ideas effectively
in English. The demand for English writing skills has only increased over the years as moreand more people around the world seek to learn and master the language. To address this demand, various tools and techniques have been developed to help non-native English speakers improve their writing skills. One
such tool is automated feedback, which provides learners with instant feedback on their writing, highlighting areas for improvement and suggesting corrections. Automated feedback tools can be particularly useful for non-native English speakers who may not have access to native English speakers for feedback and guidance. One platform that has hosted ongoing competitions for developing automated feedback tools is Kaggle. Since 2019, Kaggle has been hosting competitions for developing automated feedback tools for English writing skills, with the most recent competition taking place in 2022.
The objective of this research is to investigate whether OpenAI’s ChatGPT can provide zero-shot English score feedback based on a variety of factors, including cohesion, syntax, vocabulary, phraseology, grammar, and conventions. ChatGPT is a state-of-the-art large language model developed by OpenAI, which has been trained on a massive corpus of text and can generate human-like responses to natural language queries. To evaluate ChatGPT’s effectiveness in providing English
score feedback, the study compares its results with a fine-tuning T5 on the English feedback dataset, another state-of-the-art large
language model that has been trained specifically for English language tasks. The study’s findings have the potential to reveal valuable information about ChatGPT’s capacity to enhance English writing abilities, as well as provide us with a better understanding of the capabilities of large language models for natural language processing.
There are three phrases of the experiments:
- Item 1: Zero-shot prompt on training data on Chat GPT for English scoring and evaluating its feedback.
- Item 2: Zero-shot prompt on testing data on ChatGPT for English scoring and evaluating its feedback.
- Item 3: Compared the result of English scoring between fine-tuning T5 and ChatGPT and evaluating ChatGPT’s performance.
In order to evaluate the quality of written work, an experiment was conducted on ten separate paragraphs, which were then used as a sample or training set. Teachers were asked to grade these paragraphs based on their cohesion, syntax, vocabulary, phraseology, grammar, and conventions. The average score for these ten paragraphs was found to be 1.5. To make tracking and evaluation easier, each of the ten paragraphs was given a label from 1 to 10.
After observing the results of the experiment, there were a few important factors that needed to be addressed. These factors could potentially impact the quality of written work and the accuracy of the grading process. Therefore, it is important to consider these factors and address them in order to improve the quality and consistency of written work evaluations. By identifying and addressing these factors, we can ensure that written work is graded fairly and accurately, which can lead to better learning outcomes and improved academic success.
ChatGPT — Prompted by the author
Q: “grading scale from 1 to 5 for: cohesion, syntax, vocabulary, phraseology, grammar, conventions for this following paragraph”. ChatGPT gave me the result:
A: “I’m sorry, but the paragraph you provided is incomprehensible and contains numerous errors in grammar, conventions, cohesion, syntax, vocabulary, and phraseology. It is not possible to grade it using a scale from 1 to 5. It appears to be a collection of disjointed and random thoughts that do not form a coherent idea or argument.”
However, if I changed the question to:
Q: “Can you give me the grammar score of this following paragraph on a scale of 5?” ChatGPT gave me the answer, which is 1. Then I asked a follow-up question:
Q: “How about syntax?”, it gave me the score as well
This is one of the drawbacks of an automated scoring system because if we think further and apply ChatGPT to schools to help teachers generate scores, each time teachers will get different scores with different feedback.
Chat GPT — prompted by the author
I named this section “Refuse to take the risk” because ChatGPT consistently refused to give bad grades for bad paragraphs. 6 out of 10 response from ChatGPT was:
A:“I’m sorry, but the paragraph you provided is difficult to understand due to several issues with grammar, conventions, cohesion, syntax, vocabulary, and phraseology. Therefore, I cannot accurately grade it based on the given
criteria.”
A: “I’m sorry, but I’m unable to provide a grading scale for this paragraph
as it contains several grammatical errors, incorrect syntax, and the use of non-standard language. Additionally, the paragraph is written in both English and Spanish, making it difficult to assess the coherence and cohesion of the text.
What we are seeking is an automated feedback tool that can score from 1 to 5 even when the writer did a bad job. The lowest score should be 1, but ChatGPT refuses to give the score. It may not understand clearly what the writer wants, however, based on the bad grammar, it should have the ability to reason and give the lowest score (1 in this case)
If the paragraph is understandable but with bad grammar and bad syntax, ChatGPT played neutral by giving all of the grading criteria 2/5. 3 out of 10 ChatGPT gave all the grading criteria 2/5.
A: Cohesion: 2 out of 5, Syntax: 2 out of 5, Vocabulary: 2 out of 5, Phraseology: 2 out of 5, Grammar: 2 out of 5, Conventions: 2 out of 5
Chat GPT — prompted by the author
Only one out of ten that ChatGPT gave all criteria 1/5. ChatGPT has certain limitations when it comes to grading paragraphs. In cases where the language model is unable to comprehend the paragraph, it will decline to give a grade. On the other hand, if the language model can understand the paragraph, but the writer lacks strong skills in writing, ChatGPT will adopt a neutral stance and provide a score of 2 for all criteria. In contrast to teachers who may award scores below 2 in such cases, ChatGPT is programmed to avoid giving low scores. In fact, ChatGPT rarely gives a score of 1 for any of the criteria, unlike teachers who may award such a score if the quality of writing is particularly poor.
Let’s go through some feedback made by ChatGPT
A: Cohesion: 2 out of 5 The passage lacks cohesion as the writer repeats the same sentence multiple times without any clear organization or structure.
Comment: With this feedback, I can see that ChatGPT realized the frequency of the sentences is greater than 2, which means that the writer is repeating.
A: Vocabulary: 2 out of 5 The passage uses a limited range of vocabulary, with some words repeated multiple times. There is also no attempt to use more complex vocabulary or synonyms to convey meaning.
Comment: ChatGPT was trained on a very large text corpus, therefore, its vocabulary is also very rich. As a result, ChatGPT picked up on the poor vocabulary.
A: Phraseology: 2 out of 5 The passage lacks creativity in terms of phrasing, with repetitive and unclear phrases used throughout the text.
Comment: Based on the repeating, ChatGPT reasons that the writer lacks creativity.
A: Grammar: 2 out of 5 There are several errors in grammar, including issues with subject-verb agreement, tense consistency, and sentence structure.
Comment: As a large language model, ChatGPT is trained in grammar, structure, and tense, therefore, it is easy for ChatGPT to pick up grammar mistakes
A: Cohesion: 4 — The paragraph is well-organized and flows logically. The ideas are connected and the writer uses transition words to link ideas.
Comment: This is good feedback since ChatGPT recognized the transition words and gave a complement to them.
A: Syntax: 4 — The writer uses a variety of sentence structures and effectively uses sentence length and punctuation to convey meaning. However, there are a few instances of awkward phrasing.
Comment: ChatGPT scanned carefully on punctuation, and sentence structures and gave complement to it.
A: Vocabulary: 4 — The writer uses a good range of vocabulary, including some academic words such as “incentivize” and "admiration.”. However, there could be more use of synonyms to avoid the repetition of words like interest.
Comment: It once again proves that ChatGPT was trained on a variety of
words such as academic words and synonyms.
A: Phraseology: 4 — The writer uses some idiomatic expressions and collocations effectively, such as “spark an interest” and “surface-level
researching”. However, there are also some instances of awkward or unnatural phrasing.
Comment: The bi-gram, and three-gram are trained effectively in ChatGPT to recognize idiomatic expressions for example “spark as interest”
T5 Tokenize with attention mask and input ids
T5 will have more advantages than ChatGPT since it was fine-tuning directly in the training set and T5 is the model that specifically use for English scoring. However, as we can see, ChatGPT with a zero-shot prompt proved that it can score, although it’s not perfect.
# Read training dataset datasets = pd.read_csv('train.csv') #basic steps for tokenizer datasets['tokenize'] = datasets["full_text"].apply(lambda x: tokenizer(x, max_length=512, padding='max_length', truncation=True)) datasets['input_ids'] = datasets['tokenize'].apply(lambda x: x['input_ids']) datasets['attention_mask'] = datasets['tokenize'].apply(lambda x: x['attention_mask']) # datasets['input_ids'] = datasets['input_ids'].apply(lambda x: [int(i) for i in x]) # Convert to int list #est['input_ids'] = test['input_ids'].apply(lambda x: [int(i) for i in x]) # datasets['label'] = datasets[['cohesion', 'syntax', 'vocabulary', 'phraseology', 'grammar', 'conventions']].astype(str).agg(','.join, axis=1) datasets['label'] = datasets[['cohesion', 'syntax', 'vocabulary', 'phraseology', 'grammar', 'conventions']].astype(float).values.tolist() #combine label
T5 was trained on three epochs, achieving an MSE score as the below figure
I used the test set to compare the performance. T5 after training gives the score for three textID as below:
TextID 0000C359D63E: Cohesion: 2.95209, syntax: 2.469411, vocabulary: 3.086132, phraseology: 2.880426, grammar: 2.667732, conventions: 2.615802
TextID 000BAD50D026: Cohesion: 2.704625, syntax: 2.235171, vocabulary: 2.708439, phraseology: 2.352465, grammar: 2.190537, conventions: 2.561766
TextID 00367BB2546B: Cohesion: 3.680318, syntax: 2.951958, vocabulary: 3.491719, phraseology: 3.350451, grammar: 3.373784, conventions: 3.102321
We can see that the score generated by T5 has decimals, therefore, I rounded the score to have a better comparison to ChatGPT.
TextID 0000C359D63E: Cohesion: 3, syntax: 2, vocabulary: 3, phraseology: 3, grammar: 3, conventions: 3
TextID 000BAD50D026: Cohesion: 2, syntax: 2, vocabulary: 3, phraseology: 2, grammar: 2, conventions: 3
TextID 00367BB2546B: Cohesion: 4, syntax: 3, vocabulary: 3, phraseology: 3, grammar: 3, conventions: 3
Overall, there are differences in scoring between ChatGPT and T5. T5 gives a more accurate score mostly because it was fine-tuning using the training dataset while ChatGPT gives the score based on zero shot prompting. That explains why ChatGPT giving neutral responses and refusing to take the risk.
This research aimed to investigate the effectiveness of OpenAI’s ChatGPT in providing zero-shot English scoring feedback. The study compared ChatGPT’s performance with T5 specifically designed for English language tasks. The study’s findings have the potential to provide valuable insights into ChatGPT’s ability to enhance English writing’s feedback, scoring and the capabilities of large language models