Skip to content

Commit 719521c

Browse files
Document Processor v2 (#442)
* wip: init refactor of document processor to JS * add NodeJs PDF support * wip: partity with python processor feat: add pptx support * fix: forgot files * Remove python scripts totally * wip:update docker to boot new collector * add package.json support * update dockerfile for new build * update gitignore and linting * add more protections on file lookup * update package.json * test build * update docker commands to use cap-add=SYS_ADMIN so web scraper can run update all scripts to reflect this remove docker build for branch
1 parent 5f6a013 commit 719521c

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

69 files changed

+3682
-1925
lines changed

README.md

+3-9
Original file line numberDiff line numberDiff line change
@@ -74,10 +74,10 @@ Some cool features of AnythingLLM
7474

7575
### Technical Overview
7676
This monorepo consists of three main sections:
77-
- `collector`: Python tools that enable you to quickly convert online resources or local documents into LLM useable format.
7877
- `frontend`: A viteJS + React frontend that you can run to easily create and manage all your content the LLM can use.
79-
- `server`: A nodeJS + express server to handle all the interactions and do all the vectorDB management and LLM interactions.
78+
- `server`: A NodeJS express server to handle all the interactions and do all the vectorDB management and LLM interactions.
8079
- `docker`: Docker instructions and build process + information for building from source.
80+
- `collector`: NodeJS express server that process and parses documents from the UI.
8181

8282
### Minimum Requirements
8383
> [!TIP]
@@ -86,7 +86,6 @@ This monorepo consists of three main sections:
8686
> you will be storing (documents, vectors, models, etc). Minimum 10GB recommended.
8787
8888
- `yarn` and `node` on your machine
89-
- `python` 3.9+ for running scripts in `collector/`.
9089
- access to an LLM running locally or remotely.
9190

9291
*AnythingLLM by default uses a built-in vector database powered by [LanceDB](https://github.com/lancedb/lancedb)
@@ -112,6 +111,7 @@ export STORAGE_LOCATION="/var/lib/anythingllm" && \
112111
mkdir -p $STORAGE_LOCATION && \
113112
touch "$STORAGE_LOCATION/.env" && \
114113
docker run -d -p 3001:3001 \
114+
--cap-add SYS_ADMIN \
115115
-v ${STORAGE_LOCATION}:/app/server/storage \
116116
-v ${STORAGE_LOCATION}/.env:/app/server/.env \
117117
-e STORAGE_DIR="/app/server/storage" \
@@ -141,12 +141,6 @@ To boot the frontend locally (run commands from root of repo):
141141

142142
[Learn about vector caching](./server/storage/vector-cache/VECTOR_CACHE.md)
143143

144-
## Standalone scripts
145-
146-
This repo contains standalone scripts you can run to collect data from a Youtube Channel, Medium articles, local text files, word documents, and the list goes on. This is where you will use the `collector/` part of the repo.
147-
148-
[Go set up and run collector scripts](./collector/README.md)
149-
150144
## Contributing
151145
- create issue
152146
- create PR with branch name format of `<issue number>-<short name>`

cloud-deployments/aws/cloudformation/DEPLOY.md

+2-3
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# How to deploy a private AnythingLLM instance on AWS
22

3-
With an AWS account you can easily deploy a private AnythingLLM instance on AWS. This will create a url that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys and they will not be exposed - however if you want your instance to be protected it is highly recommend that you set the `AUTH_TOKEN` and `JWT_SECRET` variables in the `docker/` ENV.
3+
With an AWS account you can easily deploy a private AnythingLLM instance on AWS. This will create a url that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys and they will not be exposed - however if you want your instance to be protected it is highly recommend that you set a password one setup is complete.
44

55
**Quick Launch (EASY)**
66
1. Log in to your AWS account
@@ -30,12 +30,11 @@ The output of this cloudformation stack will be:
3030

3131
**Requirements**
3232
- An AWS account with billing information.
33-
- AnythingLLM (GUI + document processor) must use a t2.small minimum and 10Gib SSD hard disk volume
3433

3534
## Please read this notice before submitting issues about your deployment
3635

3736
**Note:**
38-
Your instance will not be available instantly. Depending on the instance size you launched with it can take varying amounts of time to fully boot up.
37+
Your instance will not be available instantly. Depending on the instance size you launched with it can take 5-10 minutes to fully boot up.
3938

4039
If you want to check the instance's progress, navigate to [your deployed EC2 instances](https://us-west-1.console.aws.amazon.com/ec2/home) and connect to your instance via SSH in browser.
4140

cloud-deployments/aws/cloudformation/cloudformation_create_anythingllm.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@
8989
"touch /home/ec2-user/anythingllm/.env\n",
9090
"sudo chown ec2-user:ec2-user -R /home/ec2-user/anythingllm\n",
9191
"docker pull mintplexlabs/anythingllm:master\n",
92-
"docker run -d -p 3001:3001 -v /home/ec2-user/anythingllm:/app/server/storage -v /home/ec2-user/anythingllm/.env:/app/server/.env -e STORAGE_DIR=\"/app/server/storage\" mintplexlabs/anythingllm:master\n",
92+
"docker run -d -p 3001:3001 --cap-add SYS_ADMIN -v /home/ec2-user/anythingllm:/app/server/storage -v /home/ec2-user/anythingllm/.env:/app/server/.env -e STORAGE_DIR=\"/app/server/storage\" mintplexlabs/anythingllm:master\n",
9393
"echo \"Container ID: $(sudo docker ps --latest --quiet)\"\n",
9494
"export ONLINE=$(curl -Is http://localhost:3001/api/ping | head -n 1|cut -d$' ' -f2)\n",
9595
"echo \"Health check: $ONLINE\"\n",

cloud-deployments/digitalocean/terraform/DEPLOY.md

+2-6
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
11
# How to deploy a private AnythingLLM instance on DigitalOcean using Terraform
22

3-
With a DigitalOcean account, you can easily deploy a private AnythingLLM instance using Terraform. This will create a URL that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys, and they will not be exposed. However, if you want your instance to be protected, it is highly recommended that you set the `AUTH_TOKEN` and `JWT_SECRET` variables in the `docker/` ENV.
4-
5-
[Refer to .env.example](../../../docker/HOW_TO_USE_DOCKER.md) for data format.
3+
With a DigitalOcean account, you can easily deploy a private AnythingLLM instance using Terraform. This will create a URL that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys, and they will not be exposed. However, if you want your instance to be protected, it is highly recommended that you set a password one setup is complete.
64

75
The output of this Terraform configuration will be:
86
- 1 DigitalOcean Droplet
@@ -12,8 +10,6 @@ The output of this Terraform configuration will be:
1210
- An DigitalOcean account with billing information
1311
- Terraform installed on your local machine
1412
- Follow the instructions in the [official Terraform documentation](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli) for your operating system.
15-
- `.env` file that is filled out with your settings and set up in the `docker/` folder
16-
1713

1814
## How to deploy on DigitalOcean
1915
Open your terminal and navigate to the `digitalocean/terraform` folder
@@ -36,7 +32,7 @@ terraform destroy
3632
## Please read this notice before submitting issues about your deployment
3733
3834
**Note:**
39-
Your instance will not be available instantly. Depending on the instance size you launched with it can take anywhere from 10-20 minutes to fully boot up.
35+
Your instance will not be available instantly. Depending on the instance size you launched with it can take anywhere from 5-10 minutes to fully boot up.
4036
4137
If you want to check the instances progress, navigate to [your deployed instances](https://cloud.digitalocean.com/droplets) and connect to your instance via SSH in browser.
4238

cloud-deployments/digitalocean/terraform/user_data.tp1

+1-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ mkdir -p /home/anythingllm
1212
touch /home/anythingllm/.env
1313

1414
sudo docker pull mintplexlabs/anythingllm:master
15-
sudo docker run -d -p 3001:3001 -v /home/anythingllm:/app/server/storage -v /home/anythingllm/.env:/app/server/.env -e STORAGE_DIR="/app/server/storage" mintplexlabs/anythingllm:master
15+
sudo docker run -d -p 3001:3001 --cap-add SYS_ADMIN -v /home/anythingllm:/app/server/storage -v /home/anythingllm/.env:/app/server/.env -e STORAGE_DIR="/app/server/storage" mintplexlabs/anythingllm:master
1616
echo "Container ID: $(sudo docker ps --latest --quiet)"
1717

1818
export ONLINE=$(curl -Is http://localhost:3001/api/ping | head -n 1|cut -d$' ' -f2)

cloud-deployments/gcp/deployment/DEPLOY.md

+4-11
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
11
# How to deploy a private AnythingLLM instance on GCP
22

3-
With a GCP account you can easily deploy a private AnythingLLM instance on GCP. This will create a url that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys and they will not be exposed - however if you want your instance to be protected it is highly recommend that you set the `AUTH_TOKEN` and `JWT_SECRET` variables in the `docker/` ENV.
4-
5-
[Refer to .env.example](../../../docker/HOW_TO_USE_DOCKER.md) for data format.
3+
With a GCP account you can easily deploy a private AnythingLLM instance on GCP. This will create a url that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys and they will not be exposed - however if you want your instance to be protected it is highly recommend that you set a password one setup is complete.
64

75
The output of this cloudformation stack will be:
86
- 1 GCP VM
@@ -11,19 +9,15 @@ The output of this cloudformation stack will be:
119

1210
**Requirements**
1311
- An GCP account with billing information.
14-
- AnythingLLM (GUI + document processor) must use a n1-standard-1 minimum and 10Gib SSD hard disk volume
15-
- `.env` file that is filled out with your settings and set up in the `docker/` folder
1612

1713
## How to deploy on GCP
1814
Open your terminal
19-
1. Generate your specific cloudformation document by running `yarn generate:gcp_deployment` from the project root directory.
20-
2. This will create a new file (`gcp_deploy_anything_llm_with_env.yaml`) in the `gcp/deployment` folder.
21-
3. Log in to your GCP account using the following command:
15+
1. Log in to your GCP account using the following command:
2216
```
2317
gcloud auth login
2418
```
2519
26-
4. After successful login, Run the following command to create a deployment using the Deployment Manager CLI:
20+
2. After successful login, Run the following command to create a deployment using the Deployment Manager CLI:
2721
2822
```
2923

@@ -57,5 +51,4 @@ If you want to check the instances progress, navigate to [your deployed instance
5751
5852
Once connected run `sudo tail -f /var/log/cloud-init-output.log` and wait for the file to conclude deployment of the docker image.
5953
60-
61-
Additionally, your use of this deployment process means you are responsible for any costs of these GCP resources fully.
54+
Additionally, your use of this deployment process means you are responsible for any costs of these GCP resources fully.

cloud-deployments/gcp/deployment/gcp_deploy_anything_llm.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ resources:
3434
touch /home/anythingllm/.env
3535
3636
sudo docker pull mintplexlabs/anythingllm:master
37-
sudo docker run -d -p 3001:3001 -v /home/anythingllm:/app/server/storage -v /home/anythingllm/.env:/app/server/.env -e STORAGE_DIR="/app/server/storage" mintplexlabs/anythingllm:master
37+
sudo docker run -d -p 3001:3001 --cap-add SYS_ADMIN -v /home/anythingllm:/app/server/storage -v /home/anythingllm/.env:/app/server/.env -e STORAGE_DIR="/app/server/storage" mintplexlabs/anythingllm:master
3838
echo "Container ID: $(sudo docker ps --latest --quiet)"
3939
4040
export ONLINE=$(curl -Is http://localhost:3001/api/ping | head -n 1|cut -d$' ' -f2)

cloud-deployments/gcp/deployment/generate.mjs

-61
This file was deleted.

collector/.env.example

-1
This file was deleted.

collector/.gitignore

+4-6
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
1-
outputs/*/*.json
21
hotdir/*
3-
hotdir/processed/*
4-
hotdir/failed/*
52
!hotdir/__HOTDIR__.md
6-
!hotdir/processed
7-
!hotdir/failed
8-
3+
yarn-error.log
4+
!yarn.lock
5+
outputs
6+
scripts

collector/.nvmrc

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
v18.13.0

collector/README.md

-62
This file was deleted.

collector/api.py

-32
This file was deleted.

collector/hotdir/__HOTDIR__.md

+1-15
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,3 @@
11
### What is the "Hot directory"
22

3-
This is the location where you can dump all supported file types and have them automatically converted and prepared to be digested by the vectorizing service and selected from the AnythingLLM frontend.
4-
5-
Files dropped in here will only be processed when you are running `python watch.py` from the `collector` directory.
6-
7-
Once converted the original file will be moved to the `hotdir/processed` folder so that the original document is still able to be linked to when referenced when attached as a source document during chatting.
8-
9-
**Supported File types**
10-
- `.md`
11-
- `.txt`
12-
- `.pdf`
13-
14-
__requires more development__
15-
- `.png .jpg etc`
16-
- `.mp3`
17-
- `.mp4`
3+
This is a pre-set file location that documents will be written to when uploaded by AnythingLLM. There is really no need to touch it.

0 commit comments

Comments
 (0)