Martin Krizan
02.10.2025
To evaluate the instruction adherence, accuracy, usability, and reliability of an AI chatbot through structured defect tracking.
-Instruction Handling
-Major
-AI frequently ignores user instructions and performs actions contrary to request.
-Improve prompt parsing logic to ensure user commands are followed precisely.
-Overpromising
-Critical
-AI claimed it could provide real-world data but used fabricated data without disclosure.
-Implement fact-checking mechanism before response generation and add clear disclaimers when using synthetic data prior to commencing projects.
-UI Usability
-Minor
-Chat send button lags, leading to accidental double taps that interrupt responses.
-Add real-time button feedback and implement a debounce function to prevent duplicate sends.
-False Sense of Security
-Critical
-AI provided an incomplete study plan prep, leading to false confidence and unpreparedness in real exams.
-Add automatic disclaimers when providing study plans and cross-check against official exam syllabi.
-Convoluted Code Output
-Major
-AI-generated code was more complex than necessary, making it harder to understand and maintain.
-Optimize code suggestions by prioritizing readability and efficiency over complexity.
-Overzealous Content Moderation-
Major
-AI incorrectly flags historical discussions, music lyrics, or film analysis as potential violations, despite being non-violent discussions.
-Improve semantic understanding of context to distinguish between academic discussion vs. harmful intent.
-ChatGPT has many significant flaws in semantic understanding, adherance to instruction, programmed overconfidence in its own abilities and accuracy without sufficient disclaimers which could lead to negative UX. The errors are consistent and reproduceable which should indicate that they are also fixable. It is the author's opinion that issues of confidence in the output of AI should be prioritized over a proverbial arms race.
There are no datasets linked
There are no datasets linked