Ragbits is a project from trenches. It's an open-source toolbox for building production ready AI applications which are specifically focused on retrieval augmented generation or rag systems. In this video, we are going to install this rag bits with Olama and we will check out how exactly this works. This is Faz Mida and I welcome you to the channel. Rag Bits uses Python, generic and pidentic models for strict typing. You can switch between 100 plus LLMs via light LLM or even you can use your own full-blown local models. It has got built-in observability testing and monitoring tools. Plus, it handles 20 plus document formats with multiple parsing engines. You can also have ray based parallel processing for large data sets. Plus, you can also do document ingestion from chat UI deployment. So, let's get started. As I said, I'm going to use Olama. If you don't know what Olama is, it is one of the easiest tools to run large language models locally, mainly in a context format. I have done hundreds of videos on it. So, if you're interested, just check my channel out very quickly. If you want to install it, just click on this download. For Linux, run this command. for Mac and Windows. Simply download this executable and run it and you should be all set to go. I already have Ola installed on this Ubuntu system and I have one GPU card Nvid RTX 6000 with 48 GB of VRAM. If I do Olama list, you should be able to see my Olama where I already have this large language model installed. So not only we need a large language model to answer the queries of users but also we need an embedding model and for that I am just pulling this nomic embed text from Olama and you can use any model of your choice. By the way as I said there are 100 plus available. Now if you want to use any other model then coin all you need to do is to just go to models and then just go with whatever model you would like to use it here. Also if you are looking to rent a GPU or VM or CPU on very good prices you can go to mass computes website and there would uh link is in video description with a discount coupon code of 50% for range of GPUs. Okay. So our Olama is good. Our models are ready. And I will be talking more about why exactly we need two models. Let me show you. So these are the two models which we are going to use in this video. Okay. So Olama is ready. Now next step which we need to do, we need to install Rag Bits. And the installation is fairly simple. You just use pip install rag Bits. And while it installs it, let me also introduce you to the sponsors of the video who are Camel AI. Kamel is a very interesting tool which is focused on building multi- aent infrastructures for finding the scaling loss of agents with applications and data generation, task automation and world simulation. And you can also find the link in video's description. And it installs lot of things. So we just have to be patient. Almost there. And that's done. Let's also install light lm just to make sure uh it's already there. So we don't even have to run this again which is cool. So that is all the installation which we needed to do. And now let me show you some coding examples as how exactly you can use this rag bits. So I have just put in these coding examples in my VS code. So in this first one what I'm going to show you is a simple uh use of this rag bit. So in this one we are going to see how to do the basic LLM usage and then we will move on to the more complex ones. So all we are doing here we are importing all of these libraries. Now this might seem lot of things which are being imported but this is a good thing. We don't have to create them um from the scratch by our own manual effort. This is what uh makes it really really cool where it has got lot of things built in. So you see it is even using a in-memory vector store. It is I just put it in some logging here. And then if you remember I told you that we are going to use pyantic which enables you to have a structured format of your data and that is where we are creating these classes. So I'm just going to go with this Q&A sort of stuff where I'm telling it that it's it's an intelligent assistant and then this will be the question and this is a basic LLM one where I am using my local Quen 3 model on my local host at this port 11434 and then I am asking it a question. There is no such rack involved there but just I wanted to show you how exactly you can build the building blocks of this rack bits. So let me run this quickly from my terminal. Okay. So let me run this. So you see it has selected our Olama model and I'll just scroll up to show you. So a simple connection test to make sure everything is running fine which it is. And then it has given us a response that what 2 + 2 means and then it is going through that looks pretty good. It has connected to our lama based model and then it has given us that an answer which we asked it before in the actual prompt. There you go. Looks really good. And this is a reasoning model. So I think I should have gone with some other model because normally uh these Quen 3 models go with a thinking tag which could make things bit hard because you would have to pass them. So maybe I will just move on to maybe some other model just to make sure it doesn't do the reasoning. But anyway, you can see that it is working. And now I have replaced the coen model with the gemma 3 model. And now you can see it's a different response because it is not reasoning through and it has given us a very very grounded response. And I'm not doing exception handling. So um and it is also telling us that there is a unclosed client session. You can do that easily. If I quickly show you my new model, there you go. So, I just replaced the coin 3 with Jama 3. This is how easy it makes it uh to switch between the models. Okay. Next up, let me show you an example of how you can do document search with embeddings. If you remember, we downloaded two models. One was embedding model, the other was large language model. The role of embedding model is that when you are doing this retrieval augmented generation, what is happening is that all of these models are trained on a huge set of data but they don't know about your own data. That is where you need to provide the context your of your own data to your models. So you take your documents you first convert them into numerical representations or embeddings with the help of embedding model. You store them in a vector store and from there whenever a user makes a query that query is meshed with similar results in vector store the prompts get appended with the new data and that is how the model gets more grounded and non context around your own data and that is why we use this embedding model. So in this one you can see that we are creating an embeder with the help of this nomic embed text. Then we are just adding these three sample text about Python, vector and ML. Each gets converted to a vector and stored inline in the inline vector store. And then we are asking it a question. So it converts our question to a vector and finds the most similar stored vector with the help of this racket. So let me run this. Okay. So okay, sorry I didn't give it the variable. So let me fix the code. And the typo is fixed. And you can see that now it is working. So this document search has found three relevant text chunks ranked by similarity to the question of what is Python. Okay. So that is easy enough. Now let me show you a full rag example. And now to put it all together, let me show you the complete rag pipeline and how it easy it is to build it. Just import the libraries. Define your classes. Call your models both embedding model and your language model. And I'm just going to go with this jamma 3. This is what user is asking. This is our own data. Let's say we just want to search through this PDF doc. It's a small document. And I'm just asking it what are the key findings. And that's pretty much it. And then we are printing the response. Now let's go back and run this. So all this code is doing it is setting up the whole infra for rag by creating an embedder to convert text to vector a vector store to hold them. You can select any one of your own and a language model for generation and then it is downloading that paper chunk chunking it into searchable pieces and storing vector embeddings of each chunk in the database. And from there when we ask what are the key finding it searches through all the paper which it is doing right now to find the most sematically relevant and then feed those specific chunk in context to the language model along with your questions. So let's wait for it. There you go. So you see it has found out all the key um findings there and you can just ignore bit of these warnings which are primarily around some post parsing. So all in all pretty decent tool makes it quite easy. Uh I think rag has come a long way. Maybe no you know touching a plateau but uh good to see this rag bit which has made it even uh more easy to build and to knit all of these components together. Let me know your thoughts. If you like the content, please like the video and share it. And if you haven't already subscribed, please do so as it helps a lot. Thank you for all the