Allow indexing of large files #1040

rsdy · 2023-10-11T14:31:29Z

Allow indexing of large files, and also make a best effort at lazy loading these.

The hypothesis is that files in Git are going to be generally not too large. Due to the iteration logic we have always read large files from Git in their entirety, but discarded them early in the process. There's no change here, we just discard them later.

For files from the file system, the process will never read these until it knows that it needs the contents. This may impact the accuracy of language detection for unindexed files, but otherwise it should be safe change.

* Use accurate token counts (BloopAI#1024) * Use accurate token counts The `total` token count is now based upon an identical format conversation sent to the LLM during response generation. This results in counts that should be accurate, and prevent token limit errors entirely. * Match token counts precisely, and add baseline count * Calculate `total` as summation of other section token counts * add message to response if it ended when token limit exceeded, no translations --------- Co-authored-by: anastasiia <anastasiya1155@gmail.com> * Send full redirect target to cognito (BloopAI#1034) * Update Helm chart (BloopAI#1033) * update helm chart * remove secret.yaml * put back secret.yaml * mandatory Semantic in /search (BloopAI#1039) * Upgrade qdrant version to 1.6 (BloopAI#1037) * Upgrade qdrant version * Update qdrant binaries * Fix race in credentials polling (BloopAI#1042) Previously the assumption here was that this path is locked & safe when there is a furnished github cred in the system. However, when the user logs out, the `unwrap()` call can blow up the task, and this may cause issues. * Enable paid features for desktop users (BloopAI#1038) * Add pro features to default builds * Check user's status through `User` object * Adapt webserver layer for paid feature gate * Just enforce schema right out of the gate * Fix date parsing logic * allow paid users sync branches * Fix clippy features * Disable branch switching for local repos * cloud implies pro * Update dockerfile --------- Co-authored-by: anastasiia <anastasiya1155@gmail.com> * Update flake (BloopAI#1044) * Debug logs for initialisation (BloopAI#1036) * debug logs for initialisation * Scope logging * DB too * Clippy error --------- Co-authored-by: rsdy <p@symmetree.dev> * Allow indexing of large files (BloopAI#1040) * WIP * Support indexing large files, but lazy load them from local file system * bump tantivy to v0.21 (BloopAI#1043) * bump tantivy to v0.22 * address clippy * fix broken tests * bump version to 0.5.6 (BloopAI#1047) --------- Co-authored-by: calyptobai <111788964+calyptobai@users.noreply.github.com> Co-authored-by: anastasiia <anastasiya1155@gmail.com> Co-authored-by: rsdy <rsdy@users.noreply.github.com> Co-authored-by: Gabriel Gordon-Hall <ggordonhall@gmail.com> Co-authored-by: rsdy <p@symmetree.dev> Co-authored-by: akshay <nerdy@peppe.rs> Co-authored-by: Anastasiia Solop <35258279+anastasiya1155@users.noreply.github.com> Co-authored-by: Ilya Zedgenizov <izedgenizov@saber.games>

rsdy added 2 commits October 11, 2023 15:25

WIP

229d46d

Support indexing large files, but lazy load them from local file system

1d3ec82

rsdy added the backend label Oct 11, 2023

rsdy requested review from ggordonhall and oppiliappan October 11, 2023 14:31

oppiliappan approved these changes Oct 13, 2023

View reviewed changes

rsdy merged commit d9d747b into main Oct 13, 2023
3 checks passed

rsdy deleted the rsdy/blo-1747-allow-indexing-of-large-files branch October 13, 2023 07:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow indexing of large files #1040

Allow indexing of large files #1040

rsdy commented Oct 11, 2023

Allow indexing of large files #1040

Allow indexing of large files #1040

Conversation

rsdy commented Oct 11, 2023