Building document Q&A apps using LLMs has become mainstream now and I built one a few months ago. Constructing a prompt with a context and getting a response from an LLM is pretty simple. Tools like the OpenAI library and the huggingface transformer library have made it a walk in the park. The fun part of building such apps is putting together the different layers of the application to generate a proper prompt with a context, that gives you a near-accurate response out of the LLM. So, this article will focus more on those layers and techniques rather than showing the complete source code for every operation. Because the code is pretty simple and it is the interaction between the different layers that makes this more interesting. Anyway, the entire project is available on Github for you to explore and hack out something creative
I have used supabase extensively for building this application and it being a super awesome platform, has done a lot of heavy lifting with clean abstractions. The project uses supabase for storage & retrieval, authentication, real-time notifications and on-the-fly embedding generation
Apart from that, it is a simple react app, making a bunch of api calls and mutating the components on the screen
Prerequisites
A react app bootstrapped using vite
A Supabase account with a new project set up. The project should include a few tables which we will discuss in the upcoming sections
A huggingface account to download the LLAMA2 model
A little bit of Python knowledge to write the inference API
What does this app do?
The app is a simple search tool for extracting content out of the documentation files that you have in your github repos. If you are working for an organization that uses github then there could be a bunch of documents stored in the form of Markdown files (.md), .mdx files or .rst files. These files contain valuable information such as setup instructions for a project or how running a single script within the project could drop the entire production database or something vital. This app lets you get a specific piece of information from such documents using conversational search queries. You can ask it "What is the request schema for creating a new user?" and it will give you the relevant output by gathering information from your documents
Below is a sample
Github REST API
After going through all the words from above, you must have gotten a rough idea that we need to collect information from your github repositories. The content we need includes the repository info, the list of files from the repo and the actual content of the files
You would have already guessed that we are not going to do all these things manually by opening every markdown file and copying the content out of it. As fun as it sounds, it could be time-consuming. So we will be using the REST api provided by github to do everything, just like normal devs would do. If you want to access private repos then you need to be authenticated
If you are just testing out the API or using it for a simple use case, then a simple Personal Access Token would suffice, but for an app that could be used by multiple users, it is not recommended to use a common auth token. The clean way is to use a proper auth setup and that is where supabase authentication comes into play
Supabase authentication
Using Supabase Authentication, we can set up a new github oauth provider that can be used to authenticate the users. This lets you log in to the app using your github credentials and granting access to selective orgs that you are part of. Based on your selection, only the granted org repositories will be available to you when you use the REST api
Setting up the github provider is pretty simple and the following are the steps,
The first step is to create a new Github oauth app
On creating the oauth app, you will get a client ID and a secret, and the same needs to be configured under the supabase auth provider section
The following documentation explains the entire process clearly and it can be referred to get a clear picture
https://supabase.com/docs/guides/auth/social-login/auth-github
In the react app, you can use the signInWithOAuth
function to login with the github oauth app. We specify the scope
as "repo" and this will limit the access of the user to the repository-oriented operations alone. On signing in, supabase will automatically persist the session information in the browser's localStorage and you can fetch the session anytime you want using the supabaseClient.auth.getSession()
method
supabaseClient.auth.signInWithOAuth({
provider: "github",
options: { scopes: "repo" },
});
The session data includes the user and identity information, and the important thing that we need is the provider_token
. This is the temporary auth token provided by github and we can pass this as the bearer token to access the Github REST API endpoints. Without this token, you will get access only to the public repos from the org and the private repos will be out of scope
In the app I have created a react hook that calls the getSession()
method to fetch the provider_token
and return it to the components/hooks that invoke it. You can use your favorite HTTP client for calling the api endpoints. I use axios
in this project with a common instance containing the baseURL and all the required request headers, so that I need not duplicate it in multiple places. Below is a sample code to fetch the existing user session
return supabaseClient.auth.getSession().then(({ data }) => {
if (!data.session) {
return null;
}
if (!data.session.provider_token) {
return null;
}
return {
accessToken: data?.session.access_token,
providerToken: data?.session.provider_token,
refreshToken: data?.session.refresh_token,
userName: data?.session.user.user_metadata.user_name,
};
});
Ingesting the document content
Now that we are done with the auth setup, the next stage is to ingest the content from all the valid documents into the Database. This content will be supplied as the context in the prompt for chatting with LLAMA2
Table creation
We need to create 3 tables in supabase and the purpose of those tables are as follows
repositories
- To store the repository and the org datadocuments
- To store the document path, content and related repository IDdocument_embeddings
- To store the embedding vector and the content of the documents. The document_embedding is a vector field and this needs thepgvector
extension to be enabled for your supabase project.
When the user lands on the application, they will get the following page. This shows all the repositories that are part of the org and the ones that the user is authorized to access
Below are the steps to get the repo list like the one we have above,
Fetching the repos
As the repos are part of the org, we must fetch the list of orgs first and then use the org names to fetch the repo list
/users/{user_name}/orgs
- This endpoint returns the org list. The user name in the path can be obtained from the supabase session data session.user.user_metadata.user_name
. The response will be an array with the list of repos and the login
field in the JSON denotes the name of the org
/orgs/{org_name}/repos
- Invoke this endpoint iteratively with the org names to fetch the repositories and extract only the required fields from the response. For this application, we need just the repository name which will be available as the name
field in the response body. To keep track of the repositories, we check the repositories
table from the DB and if a record does not exist, then we will insert it into the table
const getAllRepositories = async () => {
const { data, error } = await supabaseClient.from("repositories").select("*");
if (error) {
console.error(error);
return null;
}
if (!data || data.length === 0) {
return null;
}
return data;
};
// Compare the repo list response from the Github api
// If the table does not have a matching record, then insert the repo into the table
const insertNewRepository = async ({ org, repo }) => {
const { data, error } = await supabaseClient
.from(table)
.insert({ name: repo, org })
.select("*");
if (error) {
console.error(error);
return null;
}
if (!data || data.length === 0) {
return null;
}
return data[0];
};
The repository details that we are adding to the DB will be used in the actual document ingestion stage
Once all the record collection is done, you can render a clean view like the one above. In the actual application, I have used
@tanstack/react-query
and a few react hooks to fetch the response from the API. The table is rendered using@mui/x-data-grid
Fetching the list of files
The next step is to get the document files from the repository. When the user clicks on the repo from the above table, the app navigates the user to a new page. This page includes the basic repo details and the filtered documents
The repo details that you see on the page are fetched from the DB table and the documents are fetched from the Github REST api
To get the files from a github repo,
You need to fetch the default branch of the repo
Use the branch to fetch the files from the HEAD of the repo
/repos/${org}/${repo_name}
- This will give the repo data and the response includes a field called default_branch
which is the default branch of the repository
/repos/${org}/${repo_name}/git/trees/${default_branch}?recursive=1
- This endpoint will list all the latest version of the files from the repo's default branch. The response includes a field called as tree
which is an array of all the files in the repo. The path
field within the trees array has the absolute path of the files. This array needs to be looped through to filter only the document files such as *.md, *.mdx etc., It is just a matter of splitting the path extension and including it only if it matches the allowed extensions
const allowedExtensions = ["md", "mdx", "rst"];
const filteredFiles = response.data.tree
.filter((item) => {
const extension = item.path.split(".").pop();
return allowedExtensions.includes(extension);
})
.map((item) => {
return {
documentPath: item.path,
};
});
Now you can use the filtered document names to render a table with the status and an action button to ingest it
Downloading and ingesting the documents
Now that we have all the documents, we can manually ingest a single document or all the documents from the repo to the Database. Ingesting is nothing but downloading the content of the document, inserting it into the DB and generating the embedding vector for the content
We will use the serverless prowess of supabase to handle all these
Downloading the document content - The content of a file can be downloaded using the /repos/${org}/${repo_name}/contents/${path}
endpoint. This will return the content of the file in base64 format and we can convert it into utf-8 using Buffer.from(response.data.content, "base64").toString()
Saving the content - Store the entire document content in the documents
table along with the other required fields (document_name and repo_id)
Generating Embeddings - If you are new to LLMs or embeddings, then I suggest going through this article from supabase. It explains the concepts clearly and how you can use supabase to store the embedding vector. To generate the embeddings for the documents, we will make use of supabase edge functions and webhooks to do it in an event-driven fashion
Supabase edge functions use deno as the runtime and you can learn more about it here. We create a new function using the supabase CLI and it generates an index.ts
file which will serve as the entry point for the function. Inside this file, we will write the logic to generate the embeddings using the huggingface transformers library. A basic version of the function will look something like this. If you want the entire code, then it is available here
import { serve } from "https://deno.land/std@0.168.0/http/server.ts";
import {
env,
pipeline,
} from "https://cdn.jsdelivr.net/npm/@xenova/transformers@2.5.0";
import { createClient } from "https://esm.sh/@supabase/supabase-js@2";
env.useBrowserCache = false;
env.allowLocalModels = false;
const supabaseClient = createClient(
Deno.env.get("SUPABASE_URL") ?? "",
Deno.env.get("SUPABASE_ANON_KEY") ?? ""
);
const pipe = await pipeline("feature-extraction", "Supabase/gte-small");
serve(async (req) => {
const data = await req.json();
const {
record: { id, document_name, document_content, repo_id },
} = data; // This payload will be sent by the webhook
const output = await pipe(document_content, {
pooling: "mean",
normalize: true,
});
const embeddings = Array.from(output.data);
const payload = {
document_id: id,
document_name,
embeddings,
document_content,
};
// Inserting the generated embeddings into the table
const res = await supabaseClient
.from("document_embeddings")
.insert(payload);
return new Response(JSON.stringify({ payload }), {
headers: { "Content-Type": "application/json" },
});
});
This documentation from supabase explains this process clearly and the use of each of every line from the function
After deploying the edge function, we need to setup a webhook and that can be done from Supabase project -> Database -> Webhooks
. The webhook will be set for all INSERT
and UPDATE
events on the documents table and the trigger will be the edge function that we have just created. Whenever we insert a new document or update an existing document, the webhook will trigger the edge function to generate the embedding vector for the document content
Once the edge function completes the process, we will have the table with the embedding vector thus completing the entire ingestion flow. To ingest all the documents from repo, we just need to repeat this process iteratively and the app includes an "INGEST ALL" button which does the same
As an extra flare, I have used supabase realtime to notify the UI about the ingestion status. On clicking the "INGEST" action button, the status gets toggled to "Processing" and the edge function will notify the completion status to update the status to "Ingested"
Searching the documents
We have everything in place now, which means we can pass it on to the LLM and get a response
Why llama2? Well, The first reason is that I wanted to experiment with the model and the next is that, I thought it would be cool to tell people that I have an LLM running on my personal computer. In addition to that, if you are building a similar application for your team and if your team is skeptical about passing around sensitive data to an externally hosted LLM service, then using an open LLM model hosted on your org's infrastructure will be a viable alternative. Keep in mind that this is a Large Language Model (emphasis on the word "Large") and running it even for inference alone is going to be computationally intensive and you need a powerful machine to pull it off. Let's just say that you won't get favorable results by running the model on a t2.micro instance
To be specific, the project uses the GGUF variant of LLAMA2 which is the quantized version of the original model. In theory, quantized models are lightweight hence they will be less resource intensive. You can read more about it here
The steps involved in chatting with the model are as follows,
Similarity search - When the user enters the conversational query to search the document, we need to perform a vector similarity search to fetch only the relevant document content. This can be done using supabase DB functions.
You can execute the following query in the supabase SQL editor and it will create a new function for you that will perform the similarity search on the embeddings to filter only the closely related content
create or replace function match_documents (
query_embedding vector(384),
match_threshold float,
match_count int
)
returns table (
document_id uuid,
document_content text,
similarity float
)
language sql stable
as $$
select
document_embeddings.document_id,
document_embeddings.document_content,
1 - (document_embeddings.embeddings <=> query_embedding) as similarity
from document_embeddings
where 1 - (document_embeddings.embeddings <=> query_embedding) > match_threshold
order by similarity desc
limit match_count;
$$;
From the UI we need to generate the embedding vector for the user query and the above function will perform the similarity search based on the generated vector to return the relevant content
import { pipeline } from "@xenova/transformers";
const pipe = await pipeline("feature-extraction", "Supabase/gte-small");
const generateEmbeddings = async (query) => {
const output = await pipe(query, {
pooling: "mean",
normalize: true,
});
return Array.from(output.data);
};
const search = async (query) => {
const { data, error } = await supabaseClient.rpc("match_documents", {
query_embedding: generateEmbeddings(query),
match_threshold: 0.5,
match_count: 2,
});
return data
.map((item) => {
return item.document_content;
})
.join(" ");
};
Inference API - This is the only layer where we burst out of the JS bubble and take a stroll into the Python land. We will create a very simple Python script to run the inference on LLAMA2 and expose it as a REST endpoint using FastAPI
from fastapi import FastAPI
from pydantic import BaseModel
import uvicorn
from ctransformers import AutoModelForCausalLM
app = FastAPI()
model_id = "TheBloke/Llama-2-7b-Chat-GGUF"
llm = AutoModelForCausalLM.from_pretrained(
model_id, context_length=4096, model_type="llama", gpu_layers=120
)
class InferenceRequest(BaseModel):
query: str
context: str
class InferenceResponse(BaseModel):
response: str
@app.post("/api/infer")
async def inference(request: InferenceRequest) -> InferenceResponse:
prompt = f"""
[INST] <<SYS>>
Use the following pieces of context to answer the question at the end. If the question asks to list something, then list it as Markdown bullet points with a newline character. If the answer includes source code or commands, then format it using Markdown notation. If you don't know the answer, truthfully say "I don't know". If the question is out of the context, then say "The query is not related to the context".\n\nContext:
{request.context}
<</SYS>>
User: {request.query} [/INST]
"""
generated_text = llm(prompt, max_new_tokens=4096)
return InferenceResponse(response=generated_text)
if __name__ == "__main__":
uvicorn.run(app, port=8000)
The above api accepts the user query and the filtered document content as request payload, using which we generate a prompt. Notice that we have a bunch of gibberish in the template like [INST]
,<<SYS>>
. This is the suggested way of writing a prompt for LLAMA2 and more details about the purpose of these notations can be found in this blog
Once this API is invoked, the prompt is generated and passed to the pipeline. It takes a few seconds or minutes to respond with the answer depending on the system specifications. The model runs better on a GPU and the gpu_layers=n
parameter runs the specified number of layers in the GPU
I haven't run a thorough benchmark, but to give a rough estimate, the inference takes somewhere around 20 to 30 seconds to complete on my machine powered by an RTX 3060 12G GPU. If you have a more powerful GPU, then the performance will be even better
By invoking this API from the react app, we can get the relevant results as the response.
That's all, We did it
That's all to it. This is how you build a document search tool for your github organization using React + Supabase + LLAMA2. This use case is not specific to github alone and you can use a similar setup for platforms like Confluence, Sharepoint or Google Docs. The API you use to fetch the content will differ but the embedding generation, similarity search and LLM inference will remain the same
I initially tried to run the LLM within the browser inside a worker thread and tranformers.js enables this using ONNX models, but I couldn't get it to work with LLAMA2 and that is something I will be working on in the future
The entire project is available on Github. Configure the auth tokens for Supabase & Huggingface, and try it out
Thanks for reading and Happy Hacking!