I have been trying to build a document Q&A chatbot for the past two months. I initially started off with an application that doesn't rely on GPT, but uses the impira/layoutlm-document-qa model. It didn't work out as the model was not suitable for carrying out causal conversations.
My next attempt involved a python application that extracts the content from the PDF document using tesseract-ocr
+ pytesseract
and passes it to OpenAI api for driving the entire Q&A process. I was just getting started with langchain
so there were lot of boiler plate and I was not using it right. That's when I decided to scrap it all and start afreshT
Third time is a charm
With the learnings I got from the second try, I stuck to GPT as the core LLM. I chose Next.js 13 as the one-stop solution for building the UI and backend of the application
Before getting into the details, below are the core tools used to build the application
And here is a sneak peek of everything put together
The flow
The user journey of the application and what goes on behind the scenes for each user action are as follows,
⬆️ Uploading the document
The journey starts with the user uploading the PDF document. For this, I used the react-dropzone library. The library supports both drag-and-drop and click-based uploads. Once the user drops the file into the input, the file is sent to the backend for processing
import { useDropzone } from 'react-dropzone';
const { getRootProps, getInputProps, isDragActive } = useDropzone({
onDrop: (acceptedFiles) => {
//check if acceptedFiles array is empty and proceed
//call the API with the file as the payload
const formData = new FormData();
formData.append('file', acceptedFiles[0]);
axios.post("/api/upload", { data: formData });
},
multiple: false, //to prevent multi file upload
accept: {
'application/pdf': ['.pdf'] //if the file is not a PDF, then the list will be empty
}
});
The /upload
api takes care of three things
Uploading the document to supabase storage bucket
Extracting the content from the document
Persisting the document details in the supabase Database
// Route to handle the upload and document processing
import { NextResponse } from 'next/server';
import { loadQAChain } from 'langchain/chains';
import { Document } from 'langchain/document';
import { PDFLoader } from 'langchain/document_loaders/fs/pdf';
import { supabase } from '../supabase';
export const POST = async (req) => {
const form = await req.formData();
const file = form.get('file');
// Using langchain PDFLoader to extract the content of the document
const docContent = await new PDFLoader(file, { splitPages: false })
.load()
.then((doc) => {
return doc.map((page) => {
return page.pageContent
.replace(/\n/g, ' '); // It is recommended to use the context string with no new lines
});
});
const fileBlob = await file.arrayBuffer();
const fileBuffer = Buffer.from(fileBlob);
// Uploading the document to supabase storage bucket
await supabase()
.storage.from(bucket)
.upload(`${checksum}.pdf`, fileBuffer, {
cacheControl: '3600',
upsert: true,
contentType: file.type
});
// storing the document details to supabase DB
await supabase
.from("documents_table")
.insert({
document_content: docContent,
// insert other relevant document details
});
return NextResponse.json({ message: "success" }, { status: 200 });
};
🔃 Initialising Socket.io
The application uses socket.io to send and receive messages in a non-blocking way. After uploading the document successfully, the UI invokes an API - /api/socket
to open a socket server connection
Setting up a socket.io server is usually easy, but it was a bit challenging with Next.js 13. The recent versions of Next.js introduced a new paradigm called the AppRouter and setting up a socket server is not possible with it (or at least I couldn't find any documentation for it anywhere). So I had to use the old PageRouter
paradigm to initialize a socket server connection
Handler file => src/pages/api/socket.js
Dependencies required => yarn add socket.io socket.io-client
import { Server } from 'socket.io';
export default function handler(req, res) {
const io = new Server(res.socket.server, {
path: '/api/socket_io',
addTrailingSlash: false
});
res.socket.server.io = io;
// When the UI invokes the /api/socket endpoint, it opens a new socket connection
io.on('connection', (socket) => {
socket.on('message', async (data) => {
// For every user message from the UI, this event will be triggered
const { message } = data;
// pass on the question and content from the message to Langchain
});
});
res.end();
}
💬 The real chatting
Now that we have the document content handy and the socket open to receive events, it's time to do some chatting.
The UI emits a socket event called message
every time the user enters a new message. This message will include 2 important things in the payload,
The actual question
The content extracted from the document (we will get this as the response from the
/upload
API)
We have already setup an event listener when we initialized the socket server and we will do the actual LLM stuff within this listener to get the answer from GPT.
Dependencies required => yarn add openai pdf-parse langchain
import { Document } from 'langchain/document';
import { loadQAStuffChain } from 'langchain/chains';
import { llm } from '@/app/api/openai';
export default function handler(req, res) {
const io = new Server(res.socket.server, {
path: '/api/socket_io',
addTrailingSlash: false
});
res.socket.server.io = io;
// When the UI invokes the /api/socket endpoint, it opens a new socket connection
io.on('connection', (socket) => {
socket.on('message', async (data) => {
// For every user message from the UI, this event will be triggered
const { question, content } = data;
const llm = new OpenAI({
openAIApiKey: process.env.OPENAI_API_KEY,
modelName: 'gpt-3.5-turbo'
});
// We will be using the stuff QA chain
// This is a very simple chain that sets the entire doc content as the context
// Will be suitable for smaller documents
const chain = loadQAStuffChain(llm, { verbose: true });
const docs = [new Document({ pageContent: content })];
const { text } = await chain.call({
input_documents: docs,
question
});
// Emitting the response from GPT back to the UI
socket.emit('ai_message', { message: text })
});
});
res.end();
}
In the above snippet, we use the Stuff QA chain which is a simple and suitable one for smaller documents. We generate a new Document
object array with the content of our target document. The final step is invoking the call
method to pass on the payload to OpenAI api and get the response
Behind the scenes, langchain generates a prompt in the following format and sends it to the OpenAI api to get the answer
Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
<the_entire_document_content>
Question: <some question>?
Helpful Answer:
The above prompt will be the same for every question you ask and the document content will be passed over as the context.
On receiving the response from the API, we emit a new event called ai_message
. The UI has an event listener for this event and we handle the logic of displaying the chat bubbles based on the message
socket?.on("ai_message", async (data) => {
setConversations((prev) => {
return [
...prev,
{
user: "ai",
message: data.message,
},
];
});
});
This concludes the entire flow
⚙ Behind The Scenes
The code snippets mentioned above are the trimmed version with only the key items required for this article. Below are some of the things that the application handles along with the document QA
Persisting the document details: The document details such as the checksum of the document, original document name and the content of the document are stored in the supabase DB. This data will be used to display a chat history on the UI and enables the user to return to their conversations anytime
Persisting the conversations: The chat messages sent by the user and generated by the AI are also persisted in the database. The user can click on any document's chat history and see all the exchanged conversations
Storing the actual document: The original PDF document uploaded by the user is stored in supabase storage bucket. This is for enabling the user to download the document from the chat section to see the originally uploaded content
User authentication: I wanted to try supabase authentication, so I added a login flow to the application. With supabase's Row Level Security (RLS) and policies, the chat history and the conversations will be shown only to the authenticated users
🙋🏻 One more thing...
If your aim is just to chat with a small document with just a couple of pages, then you can skip this section
Until this point whatever we have seen will do good for documents with no more than 4 pages of textual content. For instance, if you want to chat with a research paper that often goes over 10 pages then it becomes a scalability issue to send the entire content of the document back and forth for every question. I initially tested it out with the well-known research paper about transformers that has over 40k characters and the application choked out with that content. So I had to rethink the solution
Embeddings to the rescue... Embeddings are vectors that hold a bunch of numbers that denote the similarity or closeness between different sentences or words. To tackle the problem, I did the following
Split the extracted document content into smaller chunks
Generate OpenAI embeddings for each chunk
import { PDFLoader } from 'langchain/document_loaders/fs/pdf';
import { CharacterTextSplitter } from 'langchain/text_splitter';
import { MemoryVectorStore } from 'langchain/vectorstores/memory';
export const extractDocumentContent = async (file) => {
const chunks = await new PDFLoader(file).loadAndSplit(
new CharacterTextSplitter({
chunkSize: 5000,
chunkOverlap: 100,
separator: ' '
})
);
const openAIEmbedding = new OpenAIEmbeddings({
openAIApiKey: process.env.OPENAI_API_KEY,
modelName: 'text-embedding-ada-002'
});
const store = await MemoryVectorStore.fromDocuments(chunks, openAIEmbedding);
const splitDocs = chunks.map((doc) => {
return doc.pageContent.replace(/\n/g, ' ');
});
return {
wholeContent: splitDocs.join(''),
chunks: {
content: splitDocs,
embeddings: await store.embeddings.embedDocuments(splitDocs)
}
};
};
Store the chunk and its respective embeddings in subapabse DB. This blog can be referred to know how to work with
vector
data types in supabaseconst saveDocumentChunks = async (file) => { // Invoke the function from above to get the chunks const { chunks } = await extractDocumentContent(file); const { content, embeddings } = chunks; // store the content from the chunk and its respective embedding to the DB // for context, a simple embedding vector will look something like this // [-0.021596793,0.0027229148,0.019078722,-0.019771526, ...] for (let i = 0; i < content.length; i++) { const { error } = await supabase .from('document_chunks') .insert({ chunk_number: i + 1, chunk_content: content[i], chunk_embedding: embeddings[i] // chunk_embedding is of type `vector` in the Database }); if (error) { return { error } } // Be mindful of implementing a rollback strategy even if storing a single chunk fails return { error: null }; };
Do a similarity search on the vector Database using the user's question to filter only the relevant chunks (We need to set up a supabase
plpgsql
function to rank the chunks based on similarity)create function match_documents ( query_embedding vector(1536), match_count int default null, filter_checksum varchar DEFAULT '' ) returns table ( document_checksum varchar, chunk_content text, similarity float ) language plpgsql as $ $ #variable_conflict use_column begin return query select document_checksum, chunk_content, 1 - ( document_chunks.chunk_embedding <= > query_embedding ) as similarity from document_chunks where document_checksum = filter_checksum order by document_chunks.chunk_embedding <= > query_embedding limit match_count; end; $ $;
Use the filtered chunks as the context for answering the questions
import { Document } from 'langchain/document';
import { loadQAStuffChain } from 'langchain/chains';
import { OpenAIEmbeddings } from 'langchain/embeddings/openai';
const inference = async (question) => {
const openAIEmbedding = new OpenAIEmbeddings({
openAIApiKey: process.env.OPENAI_API_KEY,
modelName: 'text-embedding-ada-002'
});
const { data, error } = await supabase.rpc('match_documents', {
query_embedding: await openAIEmbedding.embedQuery(question),
match_count: 5,
filter_checksum: "unique_document_checksum"
});
if (error) return { error };
const content = data.map((v) => v.chunk_content}).join(' ')
// set the `content` as the context of the prompt and ask the questions
const chain = loadQAStuffChain(llm, { verbose: true });
const docs = [new Document({ pageContent: content })];
const { text } = await chain.call({
input_documents: docs,
question
});
return { answer: text };
};
With these improvements in place, I uploaded the same research paper and started a conversation. To my surprise, it worked on the first try and gave back the answers with ease
In this approach, you need not pass the document content for every message because the content will be fetched from the DB based on the question and its relevance to the content
🚀 Where is the code?
I have published the entire project on Github with the instructions to set up the project locally and to set up the supabase project
🏁 Conclusion
Like they say, "third time is a charm". Finally, after two failed attempts, I have built an application that works fine with all the integrations and the accuracy with which GPT handles the questions feels mystical sometimes. You can try out the application locally by cloning the repository and running it. Just ensure that you have set up a working supabase project and are ready to spend a few bucks on OpenAI.
Happy hacking!
📚 References
https://supabase.com/blog/openai-embeddings-postgres-vector
https://js.langchain.com/docs/modules/data_connection/vectorstores/integrations/supabase
https://js.langchain.com/docs/api/document_loaders_fs_pdf/classes/PDFLoader
https://js.langchain.com/docs/api/text_splitter/classes/CharacterTextSplitter