Open the Conda Terminal. (After installation Of Miniconda, it will appear in the Start menu.) Run the following command on Conda Terminal.
git clone https://github.com/showlab/GUI-Thinker.git
cd GUI-Thinker
To create a Conda virtual environment and activate it, follow these steps:
Create a new Conda environment named guithinker
with Python 3.11 installed:
conda create -n guithinker python=3.11
conda activate guithinker
pip install -r requirements.txt
Install the dependencies:
pip install -r requirements.txt
Moreover, you can refer to the files under folder .log
to manually install the corresponding modules.
We recommend running one or more of the following command to set API keys to the environment variables. On Windows Powershell (via the set command if on cmd):
$env:ANTHROPIC_API_KEY="sk-xxxxx" (Replace with your own key) $env:GEMINI_API_KEY="sk-xxxxx" $env:OPENAI_API_KEY="sk-xxxxx"
We implement our GUI parser with the help of google clound vision service. We recommend you following this guidance to save a local file for the identity verification.
gcloud auth activate-service-account --key-file KEY_FILE
$env:GOOGLE_APPLICATION_CREDENTIALS="PATH_TO_KEY_FILE"
(Optional) Set the path of KEY_FILE
in the path agent/gui_parser/server.py#L18
Start with your own query or included query in folder data
.
We implemented a backend and frontend system that separates screenshot capture from agent execution, enabling remote deployment of the agent via API calls. The frontend can run on Windows or other platforms (e.g., mobile devices).
For windows:
.\shells\start_server.bat
You can track the status by checking the files under folder .log
. Every time you change the files under the folder agent
, you need to restart the server.
For windows:
.\shells\end_server.bat
.\shells\start_server.bat
Here, we provide a straightforward example demonstrating how to operate a YouTube video using the Claude-3.5-Sonnet as the base model. Check the configuration of file agent\config\basic.yaml
to edit the base model.
Command:
python test_guithinker_custom.py --userquery "Open the video "https://www.youtube.com/watch?v=uTuuz__8gUM", add to watch later and create a watch list 'work & jazz'." --projfile_path "" --software_name "Youtube"
Initial Screenshot:
(Milestone 1) Task 1: Add video to Watch Later
Subtask 1: Click on "More actions" button [1231, 936]
Subtask 2: Click on "Save" option in the menu that appears Subtask 3: Click on "Watch Later" option in the save menu(Milestone 2) Task 2: Create new playlist "work & jazz"
Subtask 1: Click on "More actions" button again if the menu closed
Output of Step-Check: <Pass>
. Therefore no change in current step. (See the deatail design of Step-Check from paper)
Subtask 2: Click on "Save" option if the save menu is not open
Output of Step-Check: <Pass>
. Therefore no change in current step.
Subtask 3: Click on "+ Create new playlist" option at the bottom of the save menu
Subtask 4: Type "work & jazz" in the playlist name field
Subtask 5: Click "Create" button to confirm the new playlist creation
python test_guithinker_demo.py
User Query: Select all text and apply numbered list for them. Use '1, 2, 3' symbol of numbered list.
Initial Screenshot:
Intermediate Screenshot:
Invoke the Region Search component in the Step-Check Module, which yields the following image:
Reducing the resolution and directing the agent's focus toward highly relevant regions will enhance its critique decisions.
Final Screenshot: