Web Crawl is a web application that allows you to scrape web pages and use Retrieval-Augmented Generation (RAG) to answer questions based on the scraped content. The application is built with Streamlit and utilizes OpenAI's language models for text generation.
- Scrape web pages in parallel
- Store scraped content in a knowledge base
- Perform similarity search on the stored content
- Use OpenAI's language models to answer questions based on the scraped content
- Configuration options for chunk size, overlap, and crawling depth
- View and manage scraped JSON data and FAISS vector stores
-
Clone the repository:
git clone https://github.com/mhadeli/web-crawler.git cd deep-crawl -
Create a virtual environment and install dependencies:
python -m venv env source env/bin/activate # On Windows, use `env\Scripts\activate` pip install -r requirements.txt
-
Set up the configuration by editing the
settings.jsonfile or using the settings page in the Streamlit app. -
Run the Streamlit app:
streamlit run chat.py
-
Enter your OpenAI API key in the sidebar.
-
Enter the URLs to scrape in the sidebar and click "Scrape and Add to Knowledge Base".
-
Ask questions based on the scraped content using the chat interface.
The configuration options can be set in settings.json or through the Streamlit settings page:
model: The OpenAI model to use (e.g., "gpt-3.5-turbo", "gpt-4o")top_k: The number of similar documents to retrievechunk_size: The size of text chunks for processingchunk_overlap: The overlap between text chunksmin_content_length: The minimum length of HTML content to considermax_depth: The maximum crawling depth
chat.py: Main Streamlit app scriptcrawler.py: Script for scraping web pagessettings.py: Script for configuring the settingsknowledge_base.py: Script for managing the knowledge basesettings.json: JSON file for storing configuration settings
This project is licensed under the MIT License.
Contributions are welcome! Please open an issue or submit a pull request.