Commit Graph

35 Commits

Author SHA1 Message Date
Iván Martínez 0db5aebf2f Use chromadb max_batch_size public attribute 2023-09-25 11:42:16 +02:00
Iván Martínez 91163a247b Batch embeddings to be processed by chromadb 2023-08-31 16:36:19 +02:00
Iván Martínez 2940f987c0
Merge pull request #822 from VaiTon/fix/env-not-existing
Better error message if .env is empty/does not exist.
2023-08-28 17:41:47 +02:00
Iván Martínez 7b294ed31f Update dependencies. Upgrade chromadb integration. 2023-08-28 17:32:56 +02:00
parampavar 8f369dd2b9
Adding support to ingest files with extensions in uppercase
Files in the source_directory where ignored if their extensions where in uppercase like (*.PDF).
This change supports ingestion of files that match either lowercase or uppercase extensions like *.pdf or *.PDF. 
This can be enhanced further to support camelcase like *.Pdf at a later stage. The assumption is that this scenario is probably less than 5%.
2023-08-16 16:03:56 -07:00
VaiTon 28537b6a84 Better error message if .env is empty/does not exist. 2023-07-06 00:16:11 +02:00
sj 05c7330643 Enhancement better performance for PDF loader 2023-06-07 23:51:05 +08:00
Ravi e9b31f7dd9
Update ingest.py
Co-authored-by: Bailey Matthews <bailey@hey.com>
2023-05-31 22:42:10 +05:30
Ravindra Prasad db341e2a40 fixed the the csv file reading issue 2023-05-31 00:04:56 +05:30
Iván Martínez 80b9b1d03e Better logs during ingestion 2023-05-20 12:11:21 +02:00
Iván Martínez 4a0e0d2e70 Use chunk_size variable in logs. Make vectorstore check more flexible 2023-05-20 12:02:40 +02:00
Iván Martínez 7180d4386b Merge branch 'main' of https://github.com/maozdemir/privateGPT into maozdemir-main 2023-05-20 11:48:29 +02:00
Iván Martínez 20554a7c9d
Merge pull request #292 from jiangzhuo/feature/multiprocessing-for-document-loading
Optimize load_documents function with multiprocessing
2023-05-20 10:57:42 +02:00
MDW 7f918a9fa1 Make scripts executeable, add basic pre-commit setup 2023-05-19 23:21:39 +02:00
MDW 4cda348cf8 Fix #294 (tested) 2023-05-19 16:23:09 +02:00
jiangzhuo ba0dbe8d1c Add progress bar to load_documents function
Enhanced the load_documents() function by adding a progress bar using the tqdm library. This change improves user experience by providing real-time feedback on the progress of document loading. Now, users can easily track the progress of this operation, especially when loading a large number of documents.
2023-05-19 10:59:38 +09:00
jiangzhuo 81b221bccb Optimize load_documents function with multiprocessing 2023-05-19 10:58:28 +09:00
MDW a862ff2be6 Add fallback for plain elm #294 #290 2023-05-19 01:04:42 +02:00
Iván Martínez b9f8dc312f
Merge pull request #254 from Fabio3rs/formatOffice97-2003
Add .doc .ppt (Word and PowerPoint 97/2003 formats)
2023-05-18 23:49:40 +02:00
impulsivus 7844553ca1
Implement a way of ingesting more documents
Move environment variables to the global scope
Add a better check for vectorstore existence
Introduced a new function for better readability
Co-authored-by: Pulp <51127079+PulpCattel@users.noreply.github.com>
2023-05-18 17:45:38 +03:00
Fabio Rossini Sluzala ec126b51d8
Fix loader mapping order 2023-05-17 22:38:30 -03:00
vilaca 79a3c00313 remove duplicate 2023-05-17 23:45:27 +01:00
Fabio Rossini Sluzala 66a9f9cde0
Add .doc .ppt (Word and PowerPoint 97/2003 formats) 2023-05-17 12:04:16 -03:00
Iván Martínez bf3bddfbb6 More loaders, generic method
- Update the README with extra formats
- Add Powerpoint, requested in #138
- Add ePub requested in #138 comment - https://github.com/imartinez/privateGPT/pull/138#issuecomment-1549564535
- Update requirements
2023-05-17 00:55:21 +02:00
Iván Martínez 23d24c88e9 Update code to use sentence-transformers through huggingfaceembeddings 2023-05-17 00:32:41 +02:00
Andrea Pinto d0aa57178a ingest unlimited number of documents 2023-05-12 15:36:20 +02:00
Andrea Pinto 01f55441e7 fix persist db directory at ingestion 2023-05-12 10:37:10 +02:00
Sorin Neacsu 544ddd9631
load .env 2023-05-11 15:34:17 -07:00
alxspiker f60dbb520e
Merge branch 'main' into main 2023-05-11 14:34:13 -06:00
alxspiker 52ae6c0866 .env + LlamaCpp + PDF/CSV + Ingest All
.env

Added an env file to make configuration easier

LlamaCpp

Added support for LlamaCpp in .env (MODEL_TYPE=LlamaCpp)

PDF/CSV

Added support for PDF and CSV files.

Ingest All

All files in source_documents will automatically get stored in vector store based on their file type when running ingest, no longer need a path argument.
2023-05-11 14:24:39 -06:00
R-Y-M-R f12ea568e5 Use constants.py file 2023-05-11 10:29:07 -04:00
R-Y-M-R 8c6a81a07f Fix: Disable Chroma Telemetry
Opts-out of anonymized telemetry being tracked in Chroma.

See: https://docs.trychroma.com/telemetry
2023-05-11 10:17:18 -04:00
Iván Martínez 026b9f895c Use RecursiveCharacterTextSplitter to avoid llama_tokenize: too many tokens error during ingestion 2023-05-09 00:21:02 +02:00
Iván Martínez 92244a90b4 Use a different text splitter to improve results. Ingest takes an argument pointing to the doc to ingest. 2023-05-05 17:32:31 +02:00
Iván martínez 55338b8f6e End-to-end working version 2023-05-02 20:32:28 +02:00