With advancements in AI and launch of newer versions of AI agents every week, it has become quite difficult to be sure of whether or not we are using the best technology available in the market. Leveraging AI for software development is an ongoing exercise across all IT and software development companies across the world with a few claiming to be doing 50% development through AI already. Keeping aside the best prompting methods and assuming all the resources are available, the question still remains as to which LLM works best?
In this project, the goal is to determine which available model is the best for a simple website development.
Originally it was planned to use CursorAI with Claude 3.5 and 3 as models and then compare the output on certain parameters. Later on with the launch of DeepseekR1, this model was also added in the comparison.
The first task in this comparison was to create a structured method for comparing the outputs of LLMs.
An AI IDE - Cursor was used for development of a basic website project. The LLM models were provided by adding APIs into Cursor settings.
The test procedure can be found at the link Github
It was decided to provide 5 prompts in the Cursor chat and the evaluate the code produced on pre-defined metrics. The table below gives the prompts and the metrics against each prompt.
S No. | Prompt | Metric |
---|---|---|
1 | I want to create a modern responsive webpage with a navigation bar, hero section, features section, and contact form. Please generate the initial project structure and HTML boilerplate. Use semantic HTML5 elements and set up the file structure for CSS and JavaScript. | 1. Time to generate response, 2.Completeness of structure, 3. Additional clarifications needed |
2 | Create modern CSS styles for the webpage structure. Include: - Responsive navigation that becomes a hamburger menu on mobile - Hero section with a gradient background - Card-based features section - Styled contact form Use Flexbox/Grid and implement a mobile-first approach. | 1. Quality of responsive design ,2. Modern CSS practices used, 3. Browser compatibility considerations: 8 |
3 | Add JavaScript functionality to the page: 1. Smooth scrolling navigation 2. Form validation with error messages 3. Mobile menu toggle 4. Simple animation for feature cards on scroll Use modern JavaScript (ES6+) and implement error handling. | 1. Code modularity, 2. Error handling implementation, 3. Modern JS features usage |
4 | The feature cards section needs improvement. Each card should: - Have hover effects - Include an icon - Support an image - Have a "Learn More" button Update the HTML, CSS, and JavaScript accordingly. | Number of iterations required to get the fixes implemented |
A detailed comparison of each code snippet was undertaken using Claude Sonnet 3.5 free version. The code was fed and instructions on metrics were given. It was ensured that the same definition of metrics was used for each comparison.
The code snippets can be found out as given below:
At the end of the comparison, the scores came out as below:
###Overall Comparison Summary
Metric | V1 (3.5) | V2 (3) | V3 (R1) |
---|---|---|---|
Total Dev Time | 4 hours | 3 hours | 3 hours |
# of iterations required | 49 | 42 | 45 |
Bug fix success rate | 100% | <100% | 100% |
Code Quality Metrics | 8/10 | 5.5/10 | 7.4/10 |
The detailed code snippet wise analysis can be found out as given below:
The current study gives a framework to benchmark available LLMs and versions on end to end software development tasks. The author has also shown the current standings of three best touted LLMs for software development using Cursor AI.
All prompts and code results are available in the inline links given (public github repositories).
There are no models linked
There are no datasets linked
There are no datasets linked