Developers dedicate 30% of their time to code reviews. Although time-consuming, it's a crucial step. After all, fixing bugs in a production environment can be costly.
Here at this company, we're tackling this issue head-on, leveraging the current era of LLMs to speed up code reviews with an AI-driven approach.
In this post, we'll unveil the inner workings of Watermelon's RAG-powered GitHub application for code review, and how we want PRs to have more context than an LGTM comment.
In the realm of AI engineering, Retrieval Augmented Generation (RAG) refers to pulling information from external sources to enhance the outputs of LLMs.
Take, for instance, ChatGPT, which now uses RAG. With function calling, you can prompt it to "browse the web and fetch San Francisco’s current temperature." It retrieves information from Bing and presents you with the fact, bypassing the usual limitation of not having current data.
In our case we’re doing RAG not by browsing the web with function calling, but by building oAuth integrations to different company information sources that allow us to retrieve information by calling different APIs. We're using RAG to provide richer context to pull requests, going beyond the all-too-common "LGTM". Let's dive into some interesting LGTM facts.
The first action we take towards contextualizing PRs beyond lgtm is tracing the code context linked to a new PR.
In our case the first step in our RAG process is retrieving a new PR’s metadata. This is not only the first step in the process but the one that influences the most how the rest of the retrieval flow will behave. For a PR we retrieve:
Octokit has a limitation: we can only send up to six words as search parameters to the endpoint that retrieves PRs. Therefore, using some mechanical intuition, we remove generic, duplicate, and stop words. Examples of stopwords include and, for, etc... as well as words like development, removed, and "words" used in templates such as [x] among others.
An area for improvement here is to select these six words more effectively. Could an LLM assist in this? Maybe we should consider a heuristic like word frequency. Also, it's worth debating whether lgtm should be classified as a stopword. Regardless, this approach provides us with an array of PRs that share some context with the new PR. Next, we need to sort them by relevance.
We've considered various heuristics, such as the number of line changes or lines added, but this can be misleading for several reasons. For example, moving a file might result in many line changes without signifying a major alteration. Additionally, the sheer volume of line changes doesn't always reflect the significance of the change.
Regarding the date of PRs, we've observed significant discrepancies among teams. Opinions vary: some believe older PRs provide more relevant context, while others argue that newer PRs are more pertinent.
Ultimately, we've settled on the number of comments as our primary metric. The number of comments in a PR correlates not only with the amount of context provided but also with the extent of debate around the business logic, which is crucial for Watermelon to index. PRs that involve more than just 'lgtm' responses are most beneficial for our purposes.
While latency is often cited as the major limitation for RAG-powered applications, it's not our primary concern. Code review can be time-consuming in computational terms, and it doesn't need to be a real-time operation. We're able to respond within a few seconds, a process that typically takes hours or even days to receive an initial response. Our more pressing constraint is GitHub’s API (Octokit) rate-limiting, which requires us to respond in no more than 3 seconds. If this becomes an issue, we have a strategy to run the algorithm on a separate thread, but that's not a priority right now.
The technical challenge in this part of the flow is accuracy. Not necessarily because of hallucination, after all the killer use case for LLMs is summarization, but because building a search algorithm is actually hard.
Which means that we gotta execute an additional step when hitting certain APIs. For instance, when hitting the Jira API we gotta run a JQL query that searches for issues containing the random words either in their summary or description, and sort them by descending date to improve accuracy
Detecting code smells and security vulnerabiilities is also part of an effective code review process. Beginning with the detection of console logs (and their equivalents across various major programming languages), Watermelon comments PR line diffs whenever an error is detected. Expanded capabilities in identifying a broader range of errors soon.
To do this we run 2 sub-steps:
First, we ignore comments. That is lines that start with // or ## (with multi-line comments being our point of failure).
Then, we parse the line diff of the PR through an LLM with the following prompt:
To make it accurate we have to build a RegEx per language to detect console logs. It could work if we invest the hours into building very good RegExes. However, parsing the code in the line diffs of a PR with an LLM could allow us to do more ambitious things in the future, such as allowing us to more accurately compare intent to implementation (more on that later on this blog post).
Tree-siter also comes into play in our vision. Not only it would allow us to filter out multi-line comments very easily, but allow us to better understand what a line diff is doing from a semantic point of view by parsing the PR’s AST (Abstract Syntax Tree). Again, this could also allow us to compare intent to implementation in a better way.
It could also be a complement of techniques. This flow is still very exploratory.
The purpose of this feature is to streamline the code review process. Its goal is to help developers identify which PRs need more thorough review and to encourage more meaningful comments than just lgtm.
Currently, we're focusing on two key areas: comparing the PR's stated intent with its actual implementation, and detecting console logs. We assign a rating from 1 to 10 to each PR based on these criteria. Based on the score, we categorize the PRs as "Don't Merge," "Take a Deeper Dive," or "Safe to Merge."
We assess the alignment of intent with implementation by semantically comparing the PR’s title, which represents the intent, with the aggregate content of all the PR’s commits, which constitutes the implementation.
This is an emerging area for us. As previously mentioned, by parsing the PR’s AST we anticipate significantly advancing this capability.
As huge believers of open-source we want to support the ability to run Watermelon with an open-source LLM. Llama and Mistral are candidates we’re taking a look at, while on the longer term we also want to support a self-hosted version of Watermelon which would imply we would also include support for Ollama.
We also want to help companies do more than detect console logs to maintain their SOC 2 Compliance. For instance, helping companies not push PII data to production. There’s much more to comment than an lgtm on a PR after all ;)
To improve how we score and label PRs we also want to bring in author behavior. We want to measure the PR comment / approval ratio. A PR author who scores high in this ratio is some who is both reviewing PRs, and sharing context via her comments as well (an lgtm doesn’t count), great! What this person sends has a higher chance of meeting the business requirements. The person isn’t approving PRs? That person isn’t doing code review, therefore isn’t learning about the codebase and the score for a PR sen by that person should be punished. High number of approvals but no comments? The person is just rubber stamping (perhaps with an lgtm), and such score punishment should occur as well.