`Checkpoint`s are not serializable/deserializable (?) #18265

InAnYan · 2025-03-25T11:29:07Z

InAnYan
Mar 25, 2025

I tried to use workflow checkpoints for storing and restoring state of workflow in case of errors.

I used checkpoints and serialized them in a file using Pydantic's model_dump and Python's json.dump. However, when I deserialized the Checkpoint later using model_validate, it didn't work.

Buggy example:

import asyncio

from llama_index.core.workflow import Checkpoint, Event, StartEvent, StopEvent, Workflow, WorkflowCheckpointer, step


class MiddleEvent(Event):
    content: str


class TestWorkflow(Workflow):
    @step
    async def start(self, ev: StartEvent) -> MiddleEvent:
        return MiddleEvent(content="foo")

    @step
    async def finish(self, ev: MiddleEvent) -> StopEvent:
        return StopEvent(content=ev.content + " bar")


async def main():
    w = TestWorkflow(verbose=True)
    w = WorkflowCheckpointer(workflow=w)

    print("First run:")
    await w.run()

    checkpoints = list(w.checkpoints.items())[0][1]
    first_check = checkpoints[0]

    first_check = first_check.model_dump()
    first_check = Checkpoint.model_validate(first_check)

    print("From checkpoint:")
    await w.run_from(checkpoint=first_check)


asyncio.run(main())

I expected I could rerun workflow from deserialized Checkpoint, but I have this error:

(Sorry, it's big)

First run:
Running step start
Step start produced event MiddleEvent
Running step finish
Step finish produced event StopEvent
From checkpoint:
Exception in callback Dispatcher.span.<locals>.wrapper.<locals>.handle_future_result()() at [REDACTED]/.venv/lib/python3.13/site-packages/llama_index/core/instrumentation/dispatcher.py:274
handle: <Handle Dispatcher.span.<locals>.wrapper.<locals>.handle_future_result()() at [REDACTED]/.venv/lib/python3.13/site-packages/llama_index/core/instrumentation/dispatcher.py:274>
Traceback (most recent call last):
  File "/usr/lib/python3.13/asyncio/events.py", line 89, in _run
    self._context.run(self._callback, *self._args)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[REDACTED]/.venv/lib/python3.13/site-packages/llama_index/core/instrumentation/dispatcher.py", line 286, in handle_future_result
    raise exception
  File "[REDACTED]/.venv/lib/python3.13/site-packages/llama_index/core/workflow/workflow.py", line 401, in _run_workflow
    raise WorkflowTimeoutError(msg)
llama_index.core.workflow.errors.WorkflowTimeoutError: Operation timed out after 10.0 seconds
Exception in callback Dispatcher.span.<locals>.wrapper.<locals>.handle_future_result()() at [REDACTED]/.venv/lib/python3.13/site-packages/llama_index/core/instrumentation/dispatcher.py:274
handle: <Handle Dispatcher.span.<locals>.wrapper.<locals>.handle_future_result()() at [REDACTED]/.venv/lib/python3.13/site-packages/llama_index/core/instrumentation/dispatcher.py:274>
Traceback (most recent call last):
  File "/usr/lib/python3.13/asyncio/events.py", line 89, in _run
    self._context.run(self._callback, *self._args)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[REDACTED]/.venv/lib/python3.13/site-packages/llama_index/core/instrumentation/dispatcher.py", line 286, in handle_future_result
    raise exception
  File "/usr/lib/python3.13/asyncio/events.py", line 89, in _run
    self._context.run(self._callback, *self._args)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[REDACTED]/.venv/lib/python3.13/site-packages/llama_index/core/instrumentation/dispatcher.py", line 286, in handle_future_result
    raise exception
  File "[REDACTED]/.venv/lib/python3.13/site-packages/llama_index/core/workflow/workflow.py", line 401, in _run_workflow
    raise WorkflowTimeoutError(msg)
llama_index.core.workflow.errors.WorkflowTimeoutError: Operation timed out after 10.0 seconds
Traceback (most recent call last):
  File "[REDACTED]/workflows/test2.py", line 38, in <module>
    asyncio.run(main())
    ~~~~~~~~~~~^^^^^^^^
  File "/usr/lib/python3.13/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ~~~~~~~~~~^^^^^^
  File "/usr/lib/python3.13/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
  File "/usr/lib/python3.13/asyncio/base_events.py", line 725, in run_until_complete
    return future.result()
           ~~~~~~~~~~~~~^^
  File "[REDACTED]/workflows/test2.py", line 35, in main
    await w.run_from(checkpoint=first_check)
  File "[REDACTED]/.venv/lib/python3.13/site-packages/llama_index/core/workflow/workflow.py", line 401, in _run_workflow
    raise WorkflowTimeoutError(msg)
llama_index.core.workflow.errors.WorkflowTimeoutError: Operation timed out after 10.0 seconds

Then... why would anyone use checkpoints?

If I remove serialization and deserialization, code works.

I tried to search for similar issues and found this:

I guess one should serialize and deserialize Context, but not Checkpoints?

Answered by InAnYan

Mar 25, 2025

I'm terribly sorry.

Pydantic model_dump doesn't serialize as I expected.

More information here: pydantic/pydantic#8213.

Closing, as this problem is not relevant to LlamaIndex.

Correct way of serializing/deserializing Context:

from llama_index.core.workflow import JsonPickleSerializer, JsonSerializer

ctx_dict = ctx.to_dict(serializer=JsonSerializer())

restored_ctx = Context.from_dict(
    workflow, ctx_dict, serializer=JsonSerializer()
)

View full answer

@dosu · 2025-03-25T11:31:56Z

dosubot[bot]
bot Mar 25, 2025

Yes, you're correct. In the LlamaIndex framework, it's recommended to serialize and deserialize the Context rather than Checkpoint objects. The Checkpoint class allows arbitrary types in its fields, which can make it unsuitable for serialization using standard methods like JSON or Pydantic's model_dump and model_validate ^[1].

To properly manage workflow state, you should focus on the Context object. Here's how you can do it:

Create and Use a Context: Initialize a Context object and use it when running your workflow to maintain state.

from llama_index.core.workflow import Context

ctx = Context(workflow)
response = await workflow.run(user_msg="Hi, my name is Laurie!", ctx=ctx)

Serialize the Context: Use a serializer like JsonSerializer to convert the Context to a dictionary.

from llama_index.core.workflow import JsonSerializer

ctx_dict = ctx.to_dict(serializer=JsonSerializer())

Save the Serialized Context: Store the serialized context in a file or database.

Deserialize the Context: Restore the Context from the saved dictionary.

restored_ctx = Context.from_dict(workflow, ctx_dict, serializer=JsonSerializer())

Continue the Workflow: Use the restored Context to continue the workflow with the maintained state.
```
response3 = await workflow.run(user_msg="What's my name?", ctx=restored_ctx)
```

This approach ensures that the workflow state is correctly serialized and deserialized, allowing for seamless state restoration ^[2]^[3].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

1 reply

InAnYan Mar 25, 2025
Author

If what dosu said is true, I guess one can make a PR to improve documentation to specify these things:

Don't serialize/deserialize Checkpoints, but Context instead.
What are the uses of Checkpoints then

InAnYan · 2025-03-25T11:52:41Z

InAnYan
Mar 25, 2025
Author

IDK, I tried to serialize/deserialize Context, however it worked.. not how I expected:

import json
import asyncio
from pprint import pprint

from llama_index.core.workflow import Context, Event, StartEvent, StopEvent, Workflow, step


class EventA(Event):
    content: str


class EventB(Event):
    content: str


class TestWorkflow(Workflow):
    @step
    async def start(self, ev: StartEvent) -> EventA:
        return EventA(content="first")

    @step
    async def process(self, ev: EventA) -> EventB:
        return EventB(content="second: " + ev.content)

    @step
    async def finish(self, ctx: Context, ev: EventB) -> StopEvent:
        if await ctx.get('var'):
            return StopEvent(result="end: " + ev.content)
        else:
            raise ValueError()


async def main():
    w = TestWorkflow(verbose=True)
    ctx = Context(w)

    try:
        pprint(await w.run(ctx=ctx))
    except:
        pass

    ctx = json.dumps(ctx.to_dict(), ensure_ascii=False)
    ctx = Context.from_dict(w, json.loads(ctx))
    await ctx.set('var', '1')

    print('-' * 40)
    pprint(await w.run(ctx=ctx))
    print("OK")

asyncio.run(main())

At first run, step throws an error. Which produces this input:

Running step start
Step start produced event EventA
Running step process
Step process produced event EventB
Running step finish
Exception in callback...

Which is what I expected, it should fail on finish.

Then I serialize/deserialize the Context, but the second run produces this:

Running step finish
Step finish produced event StopEvent
Running step start
Step start produced event EventA
Running step process
Step process produced event EventB
'end: second: first'
OK

I expected, that step finish will be called. But then there are strange calls to start and process.

My question is still wide open: how one can store the state of Workflows? This is very important question, as real applications are not built solely in Python code and Python objects. One need to serialize objects to DB, and restore them for later use.

One can have a big workflow, and in case some step throws an exception (for various reasons, network error or bug), user should be able to restore the workflow, instead of rerunning it again and performing the same heavy computations

1 reply

dosubot[bot] bot Mar 25, 2025

To effectively store and restore the state of Workflows in LlamaIndex, you should focus on using the WorkflowCheckpointer and Context classes. Here's a concise guide on how to do this:

Use WorkflowCheckpointer for Checkpoints: The WorkflowCheckpointer is designed to manage checkpoints during a workflow run. It allows you to resume workflows from a specific point, which is useful for long-running workflows or when recovering from failures. However, Checkpoint objects themselves are not meant to be serialized directly for storage in a database or file system ^[1].
Serialize and Deserialize Context: The Context object holds the state of the workflow and can be serialized and deserialized using a serializer like JsonSerializer. This is the recommended approach for persisting the state of a workflow.
- Serialize the Context:
```
from llama_index.core.workflow import JsonSerializer

ctx_dict = ctx.to_dict(serializer=JsonSerializer())
```
- Deserialize the Context:
```
restored_ctx = Context.from_dict(workflow, ctx_dict, serializer=JsonSerializer())
```
Ensure Correct Workflow Execution: When you deserialize the Context, ensure that the workflow is resumed correctly. If you encounter unexpected behavior, such as steps being re-executed, verify that the Context is correctly capturing the state and that the workflow logic is designed to handle resumption appropriately.
Handling Exceptions: If a step throws an exception, you can use the WorkflowCheckpointer to resume from the last successful checkpoint. This avoids re-running the entire workflow and allows you to continue from where it left off.

By focusing on the Context for serialization and using WorkflowCheckpointer to manage checkpoints, you can effectively store and restore the state of workflows in LlamaIndex ^[2]^[3].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

InAnYan · 2025-03-25T12:00:04Z

InAnYan
Mar 25, 2025
Author

I'm terribly sorry.

Pydantic model_dump doesn't serialize as I expected.

More information here: pydantic/pydantic#8213.

Closing, as this problem is not relevant to LlamaIndex.

Correct way of serializing/deserializing Context:

from llama_index.core.workflow import JsonPickleSerializer, JsonSerializer

ctx_dict = ctx.to_dict(serializer=JsonSerializer())

restored_ctx = Context.from_dict(
    workflow, ctx_dict, serializer=JsonSerializer()
)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`Checkpoint`s are not serializable/deserializable (?) #18265

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Checkpoints are not serializable/deserializable (?) #18265

Uh oh!

InAnYan Mar 25, 2025

Replies: 3 comments · 2 replies

Uh oh!

dosubot[bot] bot Mar 25, 2025

Uh oh!

InAnYan Mar 25, 2025 Author

Uh oh!

InAnYan Mar 25, 2025 Author

Uh oh!

dosubot[bot] bot Mar 25, 2025

Uh oh!

Uh oh!

InAnYan Mar 25, 2025 Author

`Checkpoint`s are not serializable/deserializable (?) #18265

InAnYan
Mar 25, 2025

Replies: 3 comments 2 replies

dosubot[bot]
bot Mar 25, 2025

InAnYan Mar 25, 2025
Author

InAnYan
Mar 25, 2025
Author

InAnYan
Mar 25, 2025
Author