Combine and summarize multiple webpages into a single short article.
This script extracts the main text content from a set of web pages and combines them into a single article using ChatGPT's API. There are two ways to provide the desired pages:
- Manually enter the page URLs
- Enter a search query and the script will automatically grab the top 10 Google search results.
This script extracts the main content of web pages using a customized fork of the jusText project.
The customized justext fork currently supports the following languages:
- English (EN)
- Persian (FA)
To add support for a new language, follow these steps:
- Create a new file named new_lang.txt in the /src/justext/stoplists/ directory, where new_lang is the lower-case two-letter country code(ISO 3166-1) for the new language.
- Add your desired stop-words to the new_lang.txt file. Stop-words are words that are commonly used in a language but typically do not carry much meaning (such as "the", "and", or "in").
The configuration file for this project is located at src/configs.yml
. The following options can be configured in this file:
- openaiapikey (Required): Your OpenAI API key. This is required to use the ChatGPT API.
- exclude: A list of websites or domains to exclude from Google search results. By default, there's only the Reddit.com in this list.
- lang: The default language to use for text extraction. This should be a two-letter language code (such as en for English or fr for French). By default, the language is set to en.
To install and run this project, you'll need to have Python 3.x installed on your computer. You can download the latest version of Python from the official website: www.python.org/downloads.
Clone this repository to your local machine using the following command:
git clone https://github.com/srezasm/articlegen
It's recommended to set up a virtual environment for running this project. This will ensure that the project dependencies don't interfere with other Python installations on your system. To set up a virtual environment, navigate to the project directory and run the following commands:
python3 -m venv env
source env/bin/activate
This will create a virtual environment named env
and activate it. If you're on Windows, you can activate the virtual environment using the following command instead:
source ./venv/bin/activate
To install the project dependencies, navigate to the project directory and run the following command:
pip install -r requirements.txt
This will install all the required packages listed in the requirements.txt
file.
Once the dependencies are installed, you can run the project using the following command:
python3 ./src/main.py
- Terminal argument support
- Option to lengthen the generated article