.env + LlamaCpp + PDF/CSV + Ingest All
.env Added an env file to make configuration easier LlamaCpp Added support for LlamaCpp in .env (MODEL_TYPE=LlamaCpp) PDF/CSV Added support for PDF and CSV files. Ingest All All files in source_documents will automatically get stored in vector store based on their file type when running ingest, no longer need a path argument.
This commit is contained in:
		
							parent
							
								
									60225698b6
								
							
						
					
					
						commit
						52ae6c0866
					
				
							
								
								
									
										21
									
								
								README.md
								
								
								
								
							
							
						
						
									
										21
									
								
								README.md
								
								
								
								
							|  | @ -1,7 +1,7 @@ | ||||||
| # privateGPT | # privateGPT | ||||||
| Ask questions to your documents without an internet connection, using the power of LLMs. 100% private, no data leaves your execution environment at any point. You can ingest documents and ask questions without an internet connection! | Ask questions to your documents without an internet connection, using the power of LLMs. 100% private, no data leaves your execution environment at any point. You can ingest documents and ask questions without an internet connection! | ||||||
| 
 | 
 | ||||||
| Built with [LangChain](https://github.com/hwchase17/langchain) and [GPT4All](https://github.com/nomic-ai/gpt4all) | Built with [LangChain](https://github.com/hwchase17/langchain) and [GPT4All](https://github.com/nomic-ai/gpt4all) and [LlamaCpp](https://github.com/ggerganov/llama.cpp) | ||||||
| 
 | 
 | ||||||
| <img width="902" alt="demo" src="https://user-images.githubusercontent.com/721666/236942256-985801c9-25b9-48ef-80be-3acbb4575164.png"> | <img width="902" alt="demo" src="https://user-images.githubusercontent.com/721666/236942256-985801c9-25b9-48ef-80be-3acbb4575164.png"> | ||||||
| 
 | 
 | ||||||
|  | @ -13,26 +13,29 @@ In order to set your environment up to run the code here, first install all requ | ||||||
| pip install -r requirements.txt | pip install -r requirements.txt | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
|  | Rename example.env to .env and edit the variables appropriately. | ||||||
|  | MODEL_TYPE supports LlamaCpp or GPT4All | ||||||
|  | 
 | ||||||
| Then, download the 2 models and place them in a folder called `./models`: | Then, download the 2 models and place them in a folder called `./models`: | ||||||
| - LLM: default to [ggml-gpt4all-j-v1.3-groovy.bin](https://gpt4all.io/models/ggml-gpt4all-j-v1.3-groovy.bin). If you prefer a different GPT4All-J compatible model, just download it and reference it in `privateGPT.py`. | - LLM: default to [ggml-gpt4all-j-v1.3-groovy.bin](https://gpt4all.io/models/ggml-gpt4all-j-v1.3-groovy.bin). If you prefer a different GPT4All-J compatible model, just download it and reference it in `privateGPT.py`. | ||||||
| - Embedding: default to [ggml-model-q4_0.bin](https://huggingface.co/Pi3141/alpaca-native-7B-ggml/resolve/397e872bf4c83f4c642317a5bf65ce84a105786e/ggml-model-q4_0.bin). If you prefer a different compatible Embeddings model, just download it and reference it in `privateGPT.py` and `ingest.py`. | - Embedding: default to [ggml-model-q4_0.bin](https://huggingface.co/Pi3141/alpaca-native-7B-ggml/resolve/397e872bf4c83f4c642317a5bf65ce84a105786e/ggml-model-q4_0.bin). If you prefer a different compatible Embeddings model, just download it and reference it in `.env`. | ||||||
| 
 | 
 | ||||||
| ## Test dataset | ## Test dataset | ||||||
| This repo uses a [state of the union transcript](https://github.com/imartinez/privateGPT/blob/main/source_documents/state_of_the_union.txt) as an example. | This repo uses a [state of the union transcript](https://github.com/imartinez/privateGPT/blob/main/source_documents/state_of_the_union.txt) as an example. | ||||||
| 
 | 
 | ||||||
| ## Instructions for ingesting your own dataset | ## Instructions for ingesting your own dataset | ||||||
| 
 | 
 | ||||||
| Get your .txt file ready. | Put any and all of your .txt, .pdf, or .csv files into the source_documents directory | ||||||
| 
 | 
 | ||||||
| Run the following command to ingest the data. | Run the following command to ingest all the data. | ||||||
| 
 | 
 | ||||||
| ```shell | ```shell | ||||||
| python ingest.py <path_to_your_txt_file> | python ingest.py | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
| It will create a `db` folder containing the local vectorstore. Will take time, depending on the size of your document. | It will create a `db` folder containing the local vectorstore. Will take time, depending on the size of your documents. | ||||||
| You can ingest as many documents as you want by running `ingest`, and all will be accumulated in the local embeddings database.  | You can ingest as many documents as you want, and all will be accumulated in the local embeddings database.  | ||||||
| If you want to start from scratch, delete the `db` folder. | If you want to start from an empty database, delete the `db` folder. | ||||||
| 
 | 
 | ||||||
| Note: during the ingest process no data leaves your local environment. You could ingest without an internet connection. | Note: during the ingest process no data leaves your local environment. You could ingest without an internet connection. | ||||||
| 
 | 
 | ||||||
|  | @ -59,7 +62,7 @@ Type `exit` to finish the script. | ||||||
| Selecting the right local models and the power of `LangChain` you can run the entire pipeline locally, without any data leaving your environment, and with reasonable performance. | Selecting the right local models and the power of `LangChain` you can run the entire pipeline locally, without any data leaving your environment, and with reasonable performance. | ||||||
| 
 | 
 | ||||||
| - `ingest.py` uses `LangChain` tools to parse the document and create embeddings locally using `LlamaCppEmbeddings`. It then stores the result in a local vector database using `Chroma` vector store.  | - `ingest.py` uses `LangChain` tools to parse the document and create embeddings locally using `LlamaCppEmbeddings`. It then stores the result in a local vector database using `Chroma` vector store.  | ||||||
| - `privateGPT.py` uses a local LLM based on `GPT4All-J` to understand questions and create answers. The context for the answers is extracted from the local vector store using a similarity search to locate the right piece of context from the docs. | - `privateGPT.py` uses a local LLM based on `GPT4All-J` or `LlamaCpp` to understand questions and create answers. The context for the answers is extracted from the local vector store using a similarity search to locate the right piece of context from the docs. | ||||||
| - `GPT4All-J` wrapper was introduced in LangChain 0.0.162. | - `GPT4All-J` wrapper was introduced in LangChain 0.0.162. | ||||||
| 
 | 
 | ||||||
| # Disclaimer | # Disclaimer | ||||||
|  |  | ||||||
|  | @ -0,0 +1,5 @@ | ||||||
|  | PERSIST_DIRECTORY=db | ||||||
|  | LLAMA_EMBEDDINGS_MODEL=models/ggml-model-q4_0.bin | ||||||
|  | MODEL_TYPE=GPT4All | ||||||
|  | MODEL_PATH=models/ggml-gpt4all-j-v1.3-groovy.bin | ||||||
|  | MODEL_N_CTX=1000 | ||||||
							
								
								
									
										19
									
								
								ingest.py
								
								
								
								
							
							
						
						
									
										19
									
								
								ingest.py
								
								
								
								
							|  | @ -1,19 +1,28 @@ | ||||||
| from langchain.document_loaders import TextLoader | import os | ||||||
|  | from langchain.document_loaders import TextLoader, PDFMinerLoader, CSVLoader | ||||||
| from langchain.text_splitter import RecursiveCharacterTextSplitter | from langchain.text_splitter import RecursiveCharacterTextSplitter | ||||||
| from langchain.vectorstores import Chroma | from langchain.vectorstores import Chroma | ||||||
| from langchain.embeddings import LlamaCppEmbeddings | from langchain.embeddings import LlamaCppEmbeddings | ||||||
| from sys import argv |  | ||||||
| 
 | 
 | ||||||
| def main(): | def main(): | ||||||
|  |     llama_embeddings_model = os.environ.get('LLAMA_EMBEDDINGS_MODEL') | ||||||
|  |     persist_directory = os.environ.get('PERSIST_DIRECTORY') | ||||||
|  |     model_n_ctx = os.environ.get('MODEL_N_CTX') | ||||||
|     # Load document and split in chunks |     # Load document and split in chunks | ||||||
|     loader = TextLoader(argv[1], encoding="utf8") |     for root, dirs, files in os.walk("source_documents"): | ||||||
|  |         for file in files: | ||||||
|  |             if file.endswith(".txt"): | ||||||
|  |                 loader = TextLoader(os.path.join(root, file), encoding="utf8") | ||||||
|  |             elif file.endswith(".pdf"): | ||||||
|  |                 loader = PDFMinerLoader(os.path.join(root, file)) | ||||||
|  |             elif file.endswith(".csv"): | ||||||
|  |                 loader = CSVLoader(os.path.join(root, file)) | ||||||
|     documents = loader.load() |     documents = loader.load() | ||||||
|     text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50) |     text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50) | ||||||
|     texts = text_splitter.split_documents(documents) |     texts = text_splitter.split_documents(documents) | ||||||
|     # Create embeddings |     # Create embeddings | ||||||
|     llama = LlamaCppEmbeddings(model_path="./models/ggml-model-q4_0.bin") |     llama = LlamaCppEmbeddings(model_path=llama_embeddings_model, n_ctx=model_n_ctx) | ||||||
|     # Create and store locally vectorstore |     # Create and store locally vectorstore | ||||||
|     persist_directory = 'db' |  | ||||||
|     db = Chroma.from_documents(texts, llama, persist_directory=persist_directory) |     db = Chroma.from_documents(texts, llama, persist_directory=persist_directory) | ||||||
|     db.persist() |     db.persist() | ||||||
|     db = None |     db = None | ||||||
|  |  | ||||||
|  | @ -2,17 +2,31 @@ from langchain.chains import RetrievalQA | ||||||
| from langchain.embeddings import LlamaCppEmbeddings | from langchain.embeddings import LlamaCppEmbeddings | ||||||
| from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler | from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler | ||||||
| from langchain.vectorstores import Chroma | from langchain.vectorstores import Chroma | ||||||
| from langchain.llms import GPT4All | from langchain.llms import GPT4All, LlamaCpp | ||||||
|  | import os | ||||||
|  | 
 | ||||||
|  | llama_embeddings_model = os.environ.get("LLAMA_EMBEDDINGS_MODEL") | ||||||
|  | persist_directory = os.environ.get('PERSIST_DIRECTORY') | ||||||
|  | 
 | ||||||
|  | model_type = os.environ.get('MODEL_TYPE') | ||||||
|  | model_path = os.environ.get('MODEL_PATH') | ||||||
|  | model_n_ctx = os.environ.get('MODEL_N_CTX') | ||||||
| 
 | 
 | ||||||
| def main():         | def main():         | ||||||
|     # Load stored vectorstore |     # Load stored vectorstore | ||||||
|     llama = LlamaCppEmbeddings(model_path="./models/ggml-model-q4_0.bin") |     llama = LlamaCppEmbeddings(model_path=llama_embeddings_model, n_ctx=model_n_ctx) | ||||||
|     persist_directory = 'db' |  | ||||||
|     db = Chroma(persist_directory=persist_directory, embedding_function=llama) |     db = Chroma(persist_directory=persist_directory, embedding_function=llama) | ||||||
|     retriever = db.as_retriever() |     retriever = db.as_retriever() | ||||||
|     # Prepare the LLM |     # Prepare the LLM | ||||||
|     callbacks = [StreamingStdOutCallbackHandler()] |     callbacks = [StreamingStdOutCallbackHandler()] | ||||||
|     llm = GPT4All(model='./models/ggml-gpt4all-j-v1.3-groovy.bin', backend='gptj', callbacks=callbacks, verbose=False) |     match model_type: | ||||||
|  |         case "LlamaCpp": | ||||||
|  |             llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks, verbose=False) | ||||||
|  |         case "GPT4All": | ||||||
|  |             llm = GPT4All(model=model_path, n_ctx=model_n_ctx, backend='gptj', callbacks=callbacks, verbose=False) | ||||||
|  |         case _default: | ||||||
|  |             print(f"Model {model_type} not supported!") | ||||||
|  |             exit; | ||||||
|     qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True) |     qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True) | ||||||
|     # Interactive questions and answers |     # Interactive questions and answers | ||||||
|     while True: |     while True: | ||||||
|  |  | ||||||
		Loading…
	
		Reference in New Issue