Skip to content

Latest commit

 

History

History
137 lines (94 loc) · 5.29 KB

Get Started.md

File metadata and controls

137 lines (94 loc) · 5.29 KB

1. Clone the Repository 📂

Open the Conda Terminal. (After installation Of Miniconda, it will appear in the Start menu.) Run the following command on Conda Terminal.

git clone https://github.com/showlab/GUI-Thinker.git
cd GUI-Thinker

2. Env setup 🔨

To create a Conda virtual environment and activate it, follow these steps:

Create a new Conda environment named guithinker with Python 3.11 installed:

conda create -n guithinker python=3.11
conda activate guithinker
pip install -r requirements.txt

Install the dependencies:

pip install -r requirements.txt

Moreover, you can refer to the files under folder .log to manually install the corresponding modules.

3. Set API Key ✏️

We recommend running one or more of the following command to set API keys to the environment variables. On Windows Powershell (via the set command if on cmd):

$env:ANTHROPIC_API_KEY="sk-xxxxx" (Replace with your own key)
$env:GEMINI_API_KEY="sk-xxxxx"
$env:OPENAI_API_KEY="sk-xxxxx"

4. Set Google Clound Vision API 🔧

We implement our GUI parser with the help of google clound vision service. We recommend you following this guidance to save a local file for the identity verification.

gcloud auth activate-service-account --key-file KEY_FILE

$env:GOOGLE_APPLICATION_CREDENTIALS="PATH_TO_KEY_FILE"

(Optional) Set the path of KEY_FILE in the path agent/gui_parser/server.py#L18

5. Quick Start ⭐

Start with your own query or included query in folder data.

5.1 Start the server

We implemented a backend and frontend system that separates screenshot capture from agent execution, enabling remote deployment of the agent via API calls. The frontend can run on Windows or other platforms (e.g., mobile devices).

For windows:

.\shells\start_server.bat

You can track the status by checking the files under folder .log. Every time you change the files under the folder agent, you need to restart the server.

Restart the server

For windows:

.\shells\end_server.bat
.\shells\start_server.bat

🎈 5.2 Test with your own user query

Here, we provide a straightforward example demonstrating how to operate a YouTube video using the Claude-3.5-Sonnet as the base model. Check the configuration of file agent\config\basic.yaml to edit the base model.

Command:

python test_guithinker_custom.py --userquery "Open the video "https://www.youtube.com/watch?v=uTuuz__8gUM", add to watch later and create a watch list 'work & jazz'." --projfile_path "" --software_name "Youtube"

Initial Screenshot:

(Milestone 1) Task 1: Add video to Watch Later

Subtask 1: Click on "More actions" button [1231, 936]

Subtask 2: Click on "Save" option in the menu that appears

Subtask 3: Click on "Watch Later" option in the save menu

(Milestone 2) Task 2: Create new playlist "work & jazz"

Subtask 1: Click on "More actions" button again if the menu closed

Output of Step-Check: <Pass>. Therefore no change in current step. (See the deatail design of Step-Check from paper)

Subtask 2: Click on "Save" option if the save menu is not open

Output of Step-Check: <Pass>. Therefore no change in current step.

Subtask 3: Click on "+ Create new playlist" option at the bottom of the save menu

Subtask 4: Type "work & jazz" in the playlist name field

Subtask 5: Click "Create" button to confirm the new playlist creation

💻 5.3 Test with a simple demo case under the folder data:

python test_guithinker_demo.py

User Query: Select all text and apply numbered list for them. Use '1, 2, 3' symbol of numbered list.

Initial Screenshot:

Intermediate Screenshot:

Invoke the Region Search component in the Step-Check Module, which yields the following image:

Reducing the resolution and directing the agent's focus toward highly relevant regions will enhance its critique decisions.

Final Screenshot: