Skip to content

[Bug]: Job termination detaches a volume while it's in use by another job when blocks feature is used #3841

@un-def

Description

@un-def

Steps to reproduce

type: volume
name: demo-volume
backend: gcp  # or aws
region: <REGION>
availability_zone: <AZ>
size: 10GB
type: fleet
name: demo-fleet
nodes: 1
backends: [gcp]  # or [aws]
regions: [<REGION>]
availability_zones: [<AZ>]
resources:
  cpu: 4..
  memory: 1GB..
  disk: 1GB..
  gpu: 0
blocks: auto
type: dev-environment
volumes:
  - demo-volume:/volume
init:
  - echo $DSTACK_JOB_ID > /volume/job_id
resources:
  cpu: 1..
  memory: 1GB..
  gpu: 0..
  disk: 1GB..
  • Create a fleet and a volume
  • dstack apply --name devenv-1 --fleet demo-fleet
    
  • ssh devenv-1 cat /volume/job_id
    f5ecabff-61d7-4914-9ca4-bc4043069a66
    
  • dstack apply --name devenv-2 --fleet demo-fleet --reuse
    Error (Volume error)
    Failed to attach volume: unexpected error
    
  • ssh devenv-1 cat /volume/job_id
    cat: /volume/job_ids: Input/output error
    
  • (see server logs section for the error produced by this step)
    dstack stop devenv-1
    

Actual behaviour

The second job fails on Compute.attach_volume() since the volume is in use (already attached to the same instance), then, during failed job termination, the server calls Compute.detach_volume(), successfully detaching the volume from the instance despite it's still used by the first job.

Expected behaviour

No response

dstack version

0.20.19

Server logs

ERROR    dstack._internal.server.background.pipeline_tasks.jobs_terminating:981 Got exception when detaching volume volume-gcp from
                    instance gcp-0
                    Traceback (most recent call last):
                      File "/home/def/dev/dstack/src/dstack/_internal/server/background/pipeline_tasks/jobs_terminating.py", line 940, in
                    _detach_volume_from_job_instance
                        await common.run_async(
                        ...<4 lines>...
                        )
                      File "/home/def/dev/dstack/src/dstack/_internal/utils/common.py", line 50, in run_async
                        return await asyncio.get_running_loop().run_in_executor(None, func_with_args)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                      File "/home/def/.local/share/uv/python/cpython-3.13.9-linux-x86_64-gnu/lib/python3.13/concurrent/futures/thread.py",
                    line 59, in run
                        result = self.fn(*self.args, **self.kwargs)
                      File "/home/def/dev/dstack/src/dstack/_internal/core/backends/gcp/compute.py", line 857, in detach_volume
                        attachment_data = get_or_error(volume.get_attachment_data_for_instance(instance_id))
                      File "/home/def/dev/dstack/src/dstack/_internal/utils/common.py", line 292, in get_or_error
                        raise ValueError("Optional value is None")
                    ValueError: Optional value is None

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingvolumes

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions