Skip to content

Add support for scheduled task from CLI in studio #1243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 16, 2025

Conversation

amritghimire
Copy link
Contributor

@amritghimire amritghimire commented Jul 16, 2025

This adds two argument start time and cron to specify. These two
arguments are used to schedule a task in studio.

Related to https://github.com/iterative/studio/pull/11886

Summary by Sourcery

Add scheduling support to the CLI by introducing start-time and cron options for one-off and recurring tasks, parsing flexible date/time expressions with dateparser, and forwarding scheduling parameters to the Studio API.

New Features:

  • Add --start-time and --cron options to datachain job run to schedule one-off and recurring tasks

Enhancements:

  • Parse natural language and standard date/time strings using dateparser and convert them to ISO format
  • Set start_after to the current time when only a cron expression is provided
  • Print a scheduling confirmation and exit without streaming logs when a task is scheduled

Build:

  • Add dateparser and types-dateparser dependencies for parsing date/time expressions

Documentation:

  • Update job run documentation with new scheduling flags, usage examples, and behavioral notes

Tests:

  • Add unit tests for parse_start_time covering various formats and error conditions
  • Update CLI integration tests to verify start_after and cron_expression are included in API requests

Copy link

cloudflare-workers-and-pages bot commented Jul 16, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: dbd0732
Status: ✅  Deploy successful!
Preview URL: https://10d09159.datachain-documentation.pages.dev
Branch Preview URL: https://amrit-create-job-api.datachain-documentation.pages.dev

View logs

@amritghimire amritghimire self-assigned this Jul 16, 2025
Copy link
Contributor

sourcery-ai bot commented Jul 16, 2025

Reviewer's Guide

This PR integrates scheduling support into the Studio CLI by adding --start-time and --cron flags, implementing natural‐language datetime parsing, updating the job creation flow to include scheduling parameters, and extending documentation and tests accordingly.

Sequence diagram for job scheduling with --start-time and --cron

sequenceDiagram
    actor User
    participant CLI as CLI Parser
    participant Studio as Studio Logic
    participant Remote as Remote Studio API
    User->>CLI: datachain job run --start-time/--cron ...
    CLI->>Studio: process_jobs_args(args)
    Studio->>Studio: parse_start_time(args.start_time)
    Studio->>Remote: create_job(..., start_time, cron)
    Remote->>Remote: Prepare job data with start_after/cron_expression
    Remote-->>Studio: Response (job scheduled)
    Studio-->>CLI: Print scheduling confirmation
    CLI-->>User: Output: Job scheduled as a task
Loading

Class diagram for updated job creation flow

classDiagram
    class StudioClient {
        +create_job(query, query_type, ... , start_time, cron)
    }
    class process_jobs_args {
        +process_jobs_args(args)
    }
    class create_job {
        +create_job(query_file, ..., start_time, cron)
    }
    class parse_start_time {
        +parse_start_time(start_time_str)
    }
    StudioClient <|-- process_jobs_args
    process_jobs_args o-- create_job
    create_job o-- parse_start_time
    StudioClient <.. create_job : calls create_job()
    create_job <.. process_jobs_args : called by
    parse_start_time <.. create_job : called by
Loading

Class diagram for Remote StudioClient changes

classDiagram
    class StudioClient {
        +create_job(query, query_type, ..., start_time, cron)
    }
    class Response {
        +ok
        +message
        +data
    }
    StudioClient --> Response : returns
Loading

File-Level Changes

Change Details Files
CLI integration of scheduling flags
  • Added --start-time and --cron arguments to the job run parser
  • Forwarded new flags through process_jobs_args to create_job
  • Updated function signatures and dispatch to include start_time and cron
src/datachain/cli/parser/job.py
src/datachain/studio.py
Natural-language datetime parsing utility
  • Introduced parse_start_time to parse various formats via dateparser
  • Handled invalid inputs with DataChainError
  • Converted parsed datetimes to ISO strings
src/datachain/studio.py
tests/func/test_studio_datetime_parsing.py
Enhanced create_job logic for scheduled tasks
  • Integrated parsed start_time and cron into job payload
  • Defaulted start_after to now when cron is provided without start_time
  • Printed a scheduling confirmation and returned exit code 0
src/datachain/studio.py
Remote API mapping for scheduling parameters
  • Mapped start_time to start_after and cron to cron_expression in payload
src/datachain/remote/studio.py
Documentation updates for scheduling support
  • Added descriptions of --start-time and --cron flags
  • Provided usage examples for one-off, recurring, and delayed cron jobs
  • Clarified notes on scheduling behavior and format support
docs/commands/job/run.md
Dependency additions
  • Added dateparser and its typing stub to project dependencies
pyproject.toml
Extended CLI tests for scheduling
  • Added test_studio_run_task to verify start_after and cron_expression in requests
tests/test_cli_studio.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @amritghimire - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `src/datachain/studio.py:269` </location>
<code_context>
+def parse_start_time(start_time_str: Optional[str]) -> Optional[str]:
</code_context>

<issue_to_address>
Consider returning a timezone-aware ISO string for consistency.

If the input lacks timezone info, consider defaulting to UTC to avoid ambiguity in downstream processing.
</issue_to_address>

### Comment 2
<location> `src/datachain/studio.py:364` </location>
<code_context>

+    # Parse start_time if provided
+    parsed_start_time = parse_start_time(start_time)
+    if cron and parsed_start_time is None:
+        parsed_start_time = datetime.now(timezone.utc).isoformat()
+
     response = client.create_job(
</code_context>

<issue_to_address>
Defaulting start_time to now when cron is set may be surprising.

This behavior could cause jobs to start immediately without user intent. Consider requiring explicit start_time input or documenting this default clearly.
</issue_to_address>

### Comment 3
<location> `src/datachain/remote/studio.py:435` </location>
<code_context>
+        start_time: Optional[str] = None,
+        cron: Optional[str] = None,
     ) -> Response[JobData]:
         data = {
             "query": query,
</code_context>

<issue_to_address>
Including None values in the payload may cause issues with some APIs.

Omit start_after and cron_expression from the payload when their values are None to prevent potential API compatibility issues.
</issue_to_address>

### Comment 4
<location> `tests/test_cli_studio.py:448` </location>
<code_context>
+def test_studio_run_task(capsys, mocker, tmp_dir, studio_token):
</code_context>

<issue_to_address>
Missing test for edge cases: invalid or missing --start-time and --cron combinations.

Please add tests for cases where only --start-time, only --cron, neither, or invalid values are provided to ensure correct CLI behavior in all scenarios.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +269 to +278
def parse_start_time(start_time_str: Optional[str]) -> Optional[str]:
if not start_time_str:
return None

try:
# Parse the datetime string using dateparser
parsed_datetime = dateparser.parse(start_time_str)

if parsed_datetime is None:
raise DataChainError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Consider returning a timezone-aware ISO string for consistency.

If the input lacks timezone info, consider defaulting to UTC to avoid ambiguity in downstream processing.

Comment on lines +364 to +365
if cron and parsed_start_time is None:
parsed_start_time = datetime.now(timezone.utc).isoformat()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Defaulting start_time to now when cron is set may be surprising.

This behavior could cause jobs to start immediately without user intent. Consider requiring explicit start_time input or documenting this default clearly.

Copy link
Member

@0x2b3bfa0 0x2b3bfa0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines +59 to +69
"2024-01-15 14:30:00",
"2024-01-15T14:30:00Z",
"2024-01-15T14:30:00+00:00",
"Jan 15, 2024 2:30 PM",
"15/01/2024 14:30",
"2024-01-15",
"tomorrow",
"next week",
"in 2 hours",
"monday 9am",
"tomorrow 3pm",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how useful tests for dateparser functionality are. I'm inclined to think we shouldn't test functionality already provided by a third-party library with a solid test suite.

(I guess this test may have been generated by a language model)

This adds two argument start time and cron to specify. These two
arguments are used to schedule a task in studio.
@amritghimire amritghimire force-pushed the amrit/create-job-api branch from 145a2af to 2668b30 Compare July 16, 2025 05:49
Copy link

codecov bot commented Jul 16, 2025

Codecov Report

Attention: Patch coverage is 90.90909% with 2 lines in your changes missing coverage. Please review.

Project coverage is 88.71%. Comparing base (eb6253d) to head (dbd0732).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/studio.py 88.88% 1 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1243      +/-   ##
==========================================
- Coverage   88.71%   88.71%   -0.01%     
==========================================
  Files         153      153              
  Lines       13820    13841      +21     
  Branches     1932     1936       +4     
==========================================
+ Hits        12261    12279      +18     
- Misses       1104     1105       +1     
- Partials      455      457       +2     
Flag Coverage Δ
datachain 88.64% <90.90%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/datachain/cli/parser/job.py 100.00% <100.00%> (ø)
src/datachain/remote/studio.py 80.80% <ø> (ø)
src/datachain/studio.py 68.21% <88.88%> (+1.55%) ⬆️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@amritghimire amritghimire merged commit 5a5bc3c into main Jul 16, 2025
35 checks passed
@amritghimire amritghimire deleted the amrit/create-job-api branch July 16, 2025 07:04
usage: datachain job run [-h] [-v] [-q] [--team TEAM] [--env-file ENV_FILE] [--env ENV [ENV ...]]
[--workers WORKERS] [--files FILES [FILES ...]] [--python-version PYTHON_VERSION]
[--req-file REQ_FILE] [--req REQ [REQ ...]]
usage: datachain job run [-h] [-v] [-q] [--team TEAM] [--env-file ENV_FILE] [--env ENV [ENV ...]] [--cluster CLUSTER] [--workers WORKERS]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has become unreadable

Screen.Recording.2025-07-16.at.9.59.18.AM.mov

* `--req-file REQ_FILE` - Python requirements file
* `--req REQ` - Python package requirements
* `--priority PRIORITY` - Priority for the job in range 0-5. Lower value is higher priority (default: 5)
* `--repository URL` - Repository URL to clone before running the job.
* `--start-time START_TIME` - Start time in ISO format or natural language for the cron task.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nobody knows what ISO format is

say exactly like MMDDYYY or something, but keep it short

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still very confusing - what is the start time for the cron task? cron is defined by cron expression has exact start time ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, it can be run once it seems ... so it's not only about cron then? (the fact the we still call all these tasks cron is confusing and wrong ... cron - it seems only a particular subtype?)

@@ -99,3 +131,14 @@ datachain job run --cluster-id 1 query.py
* To cancel a running job, use the `datachain job cancel` command
* The job will continue running in Studio even after you stop viewing the logs
* You can get the list of compute clusters using `datachain job clusters` command.
* When using `--start-time` or `--cron` options, the job is scheduled as a task and will not show logs immediately. The job will be executed according to the schedule.
* The `--start-time` option supports natural language parsing using the dateparser library, allowing flexible time expressions like "tomorrow 3pm", "in 2 hours", "monday 9am", etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put a link to it


# Convert to ISO format string
return parsed_datetime.isoformat()
except Exception as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to have this along with if parsed_datetime is None: ? and repeat the message twice? let's refactor it please ...

studio_run_description = "Run a job in Studio. \n"
studio_run_description += (
"When using --start-time or --cron,"
" the job is scheduled as a task and will not show logs immediately."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

job as a task - won't be clear

just say "job is scheduled to run but won't start immediately (can be seen in the Tasks tab in UI)"

file
```

## Description

This command runs a job in Studio using the specified query file. You can configure various aspects of the job including environment variables, Python version, dependencies, and more.
This command runs a job in Studio using the specified query file. You can configure various aspects of the job including environment variables, Python version, dependencies, and more. When using --start-time or --cron, the job is scheduled as a task and will not show logs immediately. The job will be executed according to the schedule.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here also , please see below, improve the message a bit

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amritghimire we need a followup and improve the messaging here

amritghimire added a commit that referenced this pull request Jul 17, 2025
amritghimire added a commit that referenced this pull request Jul 17, 2025
amritghimire added a commit that referenced this pull request Jul 17, 2025
amritghimire added a commit that referenced this pull request Jul 17, 2025
amritghimire added a commit that referenced this pull request Jul 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants