Building a Code Archeology Toolbox Without Storing Your Code


- Watermelon is a VS Code Extension to explain code context, which is what goes beyond syntax

- We aggregate info for a given block of code from different sources: Git, Slack, Jira, and finally provide a GPT-based summary

- Your code never passes through our server, and our open source repo proves it

The need for immediate code context

Remote work is here to stay. Here are a few statistics for 2023 according to a study ran by

- 86% of software devs work remotely

- 90% feel as or more productive vs onsite

- 33% would consider quitting if they could no longer be remote

Companies need to adapt to this reality. Particularly in a time of massive layoffs and a VC winter. We’re seeing it. Companies not renewing office leases, startups looking to hire remotely to be more responsible with the use of funds. 

Remote work is awesome and it’s here to stay, but it has huge caveats. Team culture and the lack of information osmosis. Expanding on the idea of the lack of information osmosis: Not being able to tap a teammate on the shoulder to ask questions about the code she wrote, is a huge trade-off that remote and distributed engineering teams have to make. Losing bandwidth in something as highly abstract as writing software is a huge loss, we have to admit. 

We don’t think that endlessly hopping on Zoom calls is a solution either. There’s such a thing as Zoom fatigue. It’s also a solution that remote teams distributed across time zones don’t have as an option that’s not burning people out (aka: Making them hop on calls at crazy schedules such as midnight). 

So we thought… What if we could explain everything about the context of a block of code asynchronously? The same way we like to say “this meeting could have been an email”, but for code. “This pair programming session could’ve been a Watermelon query”.

What if we could bring code context immediately?

We aggregate data from different sources to explain code context

There are 3 places where conversations around code (aka passive documentation) are held: Your company’s git repository, Slack workspace, and ticketing system. 

For a given block of code we run git blame on the background, obtain the commit hashes, use those commit hashes to obtain the associated Pull Requests and then pass the info of the most relevant PR as part of the prompt to GPT. 

The heuristic we use to find the most relevant PR is by sorting them by the number of comments. A large number of comments indicates that the discussion was rich. We’ve received suggestions to use a heuristic such as the number of line diffs, but that’s a lagging indicator. A change can be huge in terms of lines of code, but small in terms of changes in the business logic. 

With the title of the most relevant PR, we’re able to get the Jira ticket and the Slack thread that are most closely related to the PR, therefore the block of code in question. We also send the info indexed from Slack and Jira to the GPT prompt. 

NOTE: Slack and Jira context passed to the GPT prompt is in experimental mode as of March 2023

We don’t store your code

- Your code never passes through our server

- You can take a look at a specific file, and a specific line of code on that file, on our public GitHub repo to verify that

- We do this because we’re not an evil corporation

- And because our goal is to explain code context. Explaining syntax is a low hanging fruit and it’s not that useful. Explaining code context replaces the need for a Zoom call

We’ve never talked about passing code as part of the prompt to generate code context summaries with GPT. That’s because first and foremost, we understand that code is intellectual property; but second and perhaps counterintuitively, passing code to the GPT prompt doesn’t increase the quality of the context summary generated via GPT. 

Watermelon's code is open source (or open core to be very precise with the terminology), which means that anyone can take a look at the code and verify that it is not doing anything nefarious. This is our commitment to transparency and data privacy.

Here’s a link to the specific file and line of code in our public GitHub repository that proves such thing.

We send to our server the PR's title and body, as well as the authenticated user's email. We don't send the code.

The getCodeContext function receives

- sortedPRs[0]?.title || parsedMessage,

- sortedPRs[0]?.body || parsedCommitObject.body,

- Session.account.label

We pass the title and body of the most relevant PR, and we also pass the email of the authenticated user for tracking purposes (we like asking those who are most engaged what they like, so that we can double down on that and build something even cooler for them). 

This is how we’re building a code archeology toolbox without passing your code through our server. If we can make this work, we will enable something that we love at scale: Remote asynchronous work. 

We hope that this blog posts makes it clear that we don’t store your code. 

And if you liked this blog post, please star us on GitHub and install us on VS Code