firecrawl-to-trieve

Demonstration of a Firecrawl-to-Trieve crawling-to-search pipeline.

Here is general approach:

get the results from Firecrawl
transform the results into chunks
load the chunks into Trieve
tentative: suggestions.py to pull suggested queries from Trieve—via /chunk/suggestions—and explore the retrieval results and the data (not discussed in the blog)

Setup

Setup your environment variables
Firecral API key
Trieve API key and dataset ID

cp .env.dist .env

Python

Setup your virtual environment

python3 -m venv .venv
source .venv/bin/activate

Install requirements

pip install -r requirements.txt

Freeze requirements

pip freeze > requirements.txt

Node

Install dependencies

yarn install

Running the scripts

Firecrawl

requires: FIRECRAWL_API_KEY in .env

Use Firecrawl to get the results of a crawl on the crawl_url, here: https://signoz.io/docs/.

Python in python/

python run_firecrawl.py

Node in node/

yarn crawl

This writes a json file (with a timestamp in the name) with the crawl results in a list. Key fields are the markdown itself, and then various metadata fields, including ogUrl, ogTitle, description, pageStatusCode, etc.

Example filename: crawl_results_2024-08-20T16-35-59.json

See the example: example_crawl_results_2024-08-20T16-35-59.json

Transform: Cleaning, Chunking, and Configuring

See cleaning scripts: python/cleaners.py and node/cleaners.js

Run the transform scripts:

In python/

python transform.py

Or in node/

yarn transform

Warning: While exploring the data to determine the chunking approach we noted it had a button click that toggles between contexts, so half the content so half of the content for the page is not in the markdown. We will just flag this for now, and we'll have to see if this issue appears elsewhere.

Loading

We can run it with -c to create chunks and -u to upsert chunks (update by tracking_id, ex. if you want to add chunks with a different split or revise your cleaning approach).

In python/

python load.py [-c | -u]

In node/

yarn load [-c | -u]

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
node		node
python		python
.env.dist		.env.dist
.gitignore		.gitignore
README.md		README.md
example_crawl_results_2024-08-20T16-35-59.json		example_crawl_results_2024-08-20T16-35-59.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

firecrawl-to-trieve

Setup

Python

Node

Running the scripts

Firecrawl

Transform: Cleaning, Chunking, and Configuring

Loading

About

Uh oh!

Releases

Packages

Uh oh!

Languages

devflowinc/firecrawl-to-trieve

Folders and files

Latest commit

History

Repository files navigation

firecrawl-to-trieve

Setup

Python

Node

Running the scripts

Firecrawl

Transform: Cleaning, Chunking, and Configuring

Loading

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages