Awaiting for Responses in RAG: Errors We’ve Seen

Intro to async/await

If you're a JavaScript or TypeScript developer, you've probably written the logic to make an HTTP request to an API endpoint countless times.

Async/await helps here. More than syntactic sugar for the language, it’s a feature that makes a block of code non-blocking. That is, an async-chronous function can await hitting an endpoint via an http request, while the rest of the code runs. For example, this allows us to render a UI while the data while the UI is loaded. 

Async/await provides the JS ecosystem with concurrency capabilities. It’s also important to note that concurrency and parallelism are different. Concurrency is when multiple computations run in overlapping periods. Parallelism is when multiple computations run at the same time. This diagram illustrates it well.

When async/await isn’t enough

When you only need to async/await a single promise, you have not too much to worry about. It’s straightforward. 

Cool. Now, what if you need to await a bunch of promises that depend on each other. As an example, take a look at this code snippet previously in place in our GitHub application's repo.

The final result of this code snippet

The final result for our GitHub application is using AI to give a 1-10 score to a PR. Based on the value of such a score, we add a label to the PR saying it’s “safe to merge”, “don’t merge” or “take a deeper dive”. This step involves 2 API calls: The LLM API to be able to assign a score to the PR based on the previous steps (we’ll talk about that in a moment), and a second which is calling Octokit to assign the label itself. 

The score is assigned by analyzing the code quality of the code diffs + assessing how much the intent meets the implementation.

Some technical constraints

Analyzing code quality is subdivided into 2 steps. The first one is just running a bunch of RegExes. Depending on the RegEx, code complexity can vary but it’s O(N) on average. Because we run the RegExes through each line, it’s O(N^2). However they are code diffs and as large as a PR is (for practical purposes), the size of N is negligible here to worry about the complexity. We do comment when the code diff on the GitHub UI when the RegEx matches a code smell though. So that’s potentially a series of http requests inside that function there. 

Among other things, most people would reject more than a few dozen lines, and ask for stacked PRs. 

Assessing how much the intent meets the implementation is much more complex in terms of calling APIs. We have written a blog post detailing this, but TLDR: We trace the code context associated with a new PR (older PRs, as well as Slack threads, Linear tickets, Notion docs, etc.). Then, we aggregate the most relevant traced context and generate a natural language summary of such, and that’s what we call the implementation. The intent of the PR is the title of such. We then ask the LLM “how similar are the intent and the implementation?” to give an initial assessment. The point here is that there are even more http requests. 

We need something more than async/awaiting the resolution of a single promise here. We need promise.all. 

When promise.all isn’t enough

Promise.all allows us to ask our code to fire all promises inside, but will handle a rejected one immediately, running the catch block and this will be the code run.

In the snippet below you can see how we’re chaining the series of functions (containing async http requests inside) mentioned above. 

Promise.all is dangerous because it will only resolve if all promises resolve. In the example above, if our LLM provider is down, everything else fails. When calling GitHub’s, Slack’s, Notion’s, etc. API has nothing to do with calling the LLM. 

This is where promise.allSettled() comes into play.

Promise.allSettled() is the solution

Promise.allSettled() doesn’t depend on all promises being successful. If in the example above the LLM provider is down, we’ll still be able to run the rest of the workflow and provide a (not so complete but still valuable) PR pre-review that includes commenting the code diffs.

In the very simple example below, you can see how they behave differently.

However, we have an additional technical constraint here: Rate limits. Not for the immediate future, but something we need to think about for the medium-term. 

Octokit forces us to respond in 3 seconds at most. In a world where LLMs are increasing their context window size very fast and where the cost of processing each token is also decreasing with time, two things happen. 

First, we can imagine new use cases and second we can build cost-prohibitive workflows that will become affordable in a few months. There’s a huge incentive to keep adding more tokens to the context window. The tradeoff here is that this makes Octokit ultimately respond slower, and potentially above the 3-second threshold.