Leveraging Large Language Models for Software Engineering

Overview

At Trestle, we continually strive to enhance our software development practices by adopting cutting-edge technologies. Recognizing the potential of advanced language models, we started a journey of using tools built upon state-of-the-art AI models like GPT-4 and Claude 3 Opus to identify how the tools could best support our engineering team. This blog is an effort to share our insights and experiences with these models, detailing their practical applications in our workflows and evaluating their effectiveness in various software engineering tasks. Additionally, this supports our objective of offering AI-enhanced services to our clients, further embedding cutting-edge technology into our solutions.

Our aim is not only to determine which tools and models best suit our needs but also to guide other organizations that might be considering similar integrations. As a technology-driven company committed to continuous improvement, we explore these innovations to enhance efficiency and promote a culture of innovation. This effort is part of our broader commitment to leveraging technology for organizational advancement.

Design

HLD

High-level design is a crucial aspect of software engineering that involves defining a software system’s overall structure, components, and interactions. It requires a deep understanding of the project requirements, existing components, and their relationships. Unfortunately, current language models often struggle to provide accurate and meaningful assistance.

One major challenge is that these models lack context about the specific components and architectures used in a project. They are trained on vast amounts of general data but do not know an individual software system’s unique setup and constraints. Without this context, their suggestions for high-level design are generic, irrelevant, or even incompatible with the existing codebase.

While ChatGPT Plus does offer capabilities for generating component diagrams, flowcharts, and sequence diagrams, the quality of these outputs is subpar and rarely incorporates the requirements accurately. The generated diagrams often lack precision, clarity, and adherence to standard conventions. They also fail to capture the nuances and complexities of real-world software systems, leading to confusing or misleading representations.

LLD

Once the high-level design is established and the overall architecture is in place, it boils down to implementation details. This is where these models excel, as they can generate well-structured code based on detailed requirements.

With a clear and comprehensive set of requirements, these models can produce boilerplate code that adheres to good coding standards and incorporates appropriate design patterns. They provide a solid foundation to build upon. The generated code follows best practices, such as proper encapsulation and modular and extensible design. This usually saves significant time and effort, allowing us to focus on more complex and project-specific tasks.

Another strength of these models in low-level design is their ability to suggest multiple approaches to solving a problem. These models can generate several potential solutions when presented with a specific coding challenge. They can provide a detailed analysis of the pros and cons of each approach, considering factors such as performance, readability, and maintainability, which sometimes opens us up to unanticipated scenarios.

Coding

Personally, their biggest productivity unlock is their ability to assist with not-so-complex coding tasks. I’ve broken down their usefulness into different categories, which don’t involve generating thousands of lines of code but instead, areas where short and concise answers turn out to be more valuable.

Snippet Generation

They excel at generating concise and efficient code snippets for specific requirements. Whether you need a method to perform a particular calculation, manipulate data structures, or implement a specific algorithm, these models can quickly provide high-quality code. This saves significant time and effort so that the focus can be shifted to higher-level tasks and problem-solving.

Refactoring

Sometimes, I paste a method I have written or even an entire class to obtain suggestions for improving code quality, readability, and efficiency. These models can identify potential issues, such as redundant code, suboptimal algorithms, or violations of design principles. They can then propose alternative implementations or modifications to enhance the code. This iterative process of refactoring can lead to cleaner, more maintainable codebases.

Query Generation

We use technologies like SQL, Elastisearch, and AWS CloudWatch, which create TBs of data. In order to analyze this data or understand its patterns, efficient queries are required. I use LLMs to generate boilerplate queries by providing the schema and some data samples. Although the generated queries are not perfect right away, they can be tuned to meet the needs pretty quickly.

Python Scripting

These models are really effective for ad hoc scripting requirements. They can assist in reading and parsing CSV files, making batched API calls, or extracting information from documents. Providing high-level instructions or requirements is good enough to get a working script, eliminating the need to start from scratch and enabling quick prototyping.

Integrated IDE Assistants

The primary programming language at Trestle is Java, so we use IntelliJ heavily. Plugins like Tabnine are beneficial and seamlessly integrate with these models. Although their on-the-fly code completion (Autocomplete) is intrusive and provides not-so-great suggestions, they understand the coding style well and provide personalized code, sometimes saving time. The most helpful feature of Tabnine is its chat window, which is available within the IDE and allows talking to many models from the same chat, so you don’t even have to go to a browser to use ChatGPT or Claude.

Debugging

Debugging is another aspect where the models’ effectiveness is minimal. Debugging often requires a deep understanding of the specific features and intricacies of a particular product or codebase, which these models lack. In many cases, debugging involves more than just fixing syntax errors or resolving known issues that can be easily found through a quick online search. It often requires analyzing log files, tracing program execution via stack trace, and understanding data flow through various components. These models do not have the specialized knowledge necessary for effective debugging in complex, product-specific scenarios. They can help identify common pitfalls or suggest best practices, but familiarity with the codebase becomes crucial when it comes to deep, intricate bugs.

Trestle Copilot

We recently developed a new CustomGPT, Trestle Copilot, using ChatGPT Plus. This AI-powered tool is designed for our external clients, providing them with comprehensive and accurate lead verification capabilities. Trestle Copilot utilizes our authoritative identity database to offer graded assessments with explainability on lead prioritization and contact worthiness. The human-accessible interface allows clients to access these verification capabilities without technical assistance, streamlining their processes and improving overall productivity.

GPT-4 vs. Claude 3 Opus

In my extensive testing with GPT-4 and Claude 3 Opus models, it took a lot of work to identify a clear winner. Both models are good enough for all the points mentioned above. One advantage of Opus over GPT-4 is that it seems to provide more complete answers and rarely asks me to fill in or complete something independently. GPT-4 has been infamous for being “too lazy,” which I agree with. However, this has been dealt with for the most part in its latest update (4/9). On the other hand, GPT-4 is faster and more responsive than Opus, with higher limits.

What’s intriguing is that Opus’s and GPT-4’s performance is remarkably similar. If I were presented with outputs from either model without any indication of who generated them, it would be challenging to distinguish between them, except for the differences above.

Conclusion

The integration of state-of-the-art AI models into our workflows has been effective. These tools have proven especially beneficial in low-level design, code generation, and refactoring, where they augment our team’s capabilities with speed and efficiency. We’ve found that the most significant impact of these models comes from their ability to streamline routine tasks, allowing engineers to devote more time to complex problem-solving and creative endeavors. The recent development of Trestle Copilot further demonstrated our commitment to leveraging AI to enhance client services, providing them with sophisticated tools powered by AI.

Looking ahead, we are excited to explore additional AI functionalities to boost productivity further. A key area of interest is the potential of GitHub Copilot’s AI-powered features, such as AI-assisted PR Reviews and PR Summaries. These tools promise to optimize our development workflow by providing automated code reviews and concise summaries of pull requests. This can significantly speed up our code integration process, enhance code quality, and maintain high development standards.

Artificial Intelligence is a critical component of our strategic vision. Our commitment to integrating cutting-edge AI technologies into our operations drives a desire to remain aligned with current industry advancements and pioneer new standards in software engineering. This strategic incorporation of AI is fundamental to our mission of developing more sophisticated, efficient, and innovative software solutions.

This blog post was written by Deepak Kumar Ganesan, Software Engineer at Trestle.