Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] FE Warn Log: cannot find task. type: STORAGE_MEDIUM_MIGRATE #47990

Open
3 tasks done
Ryan19929 opened this issue Feb 18, 2025 · 0 comments
Open
3 tasks done

[Bug] FE Warn Log: cannot find task. type: STORAGE_MEDIUM_MIGRATE #47990

Ryan19929 opened this issue Feb 18, 2025 · 0 comments

Comments

@Ryan19929
Copy link

Search before asking

  • I had searched in the issues and found no similar issues.

Version

doris-2.1.7

What's Wrong?

Around 0:00 every day, a large number of WARN logs appear in the FE (Frontend) logs, as shown below. Upon checking the related tasks in the BE (Backend), I found that the migration had already been successfully completed.

Through further investigation of the relevant code, I discovered that in the BE, a report (type=tablet) is sent to the master-FE every minute. The master then handles the task and migrates some tablets from SSD to HDD. However, during the handleMigration process, the task was not added to the AgentTaskQueue.

When the BE executes the migration task, it sends a finishTask at the end node. The master is supposed to find the task in the AgentTaskQueue based on the signature, backendId, and taskType. But since the FE did not add it to the AgentTaskQueue when submitting, this leads to the bug.

Could you please let me know if there is any reason why the task should not be added to the Queue?

fe.log

2025-02-10 00:00:09,189 INFO (report-thread|118) [Env.getPartitionIdToStorageMediumMap ():4118] partition[1054010-25546630-54629893] storage medium changed from SSD to HDD. cooldown time: 2025-02-10 00:00:00. current time: 2025-02-10 00:00:08
......
2025-02-10 00:00:16,080 WARN (thrift-server-pool-46|488) [MasterImpl.finishTask():122] cannot find task. type: STORAGE_MEDIUM_MIGRATE, backendId: 10011, signature: 47671484
2025-02-10 00:00:29,022 WARN (thrift-server-pool-46|488) [MasterImpl.finishTask():122] cannot find task. type: STORAGE_MEDIUM_MIGRATE, backendId: 10010, signature: 47671484

be.log

120250210 00:00:16. 031409 2788019 task_worker-pool.cpp:337] successfully submit task type=STORAGE_MEDIUM_MIGRATE| signature=47671484
120250210 00:00:16.031502 1424282 engine_storage_migration_task.cpp: 200] begin to process tablet migrate. tablet_id=47671484, dest_store=/data04/storage
120250210 00:00:16.031517 1424282 tablet_manager. cpp:1260] add tablet_id= 47671484 to map, reason=disk migrate lock times=1 thread_id_in_map=139637274203904
120250210 00:00:16.079579 1424282 tablet_manager.cpp:906] begin to load tablet from dir. tablet_id=47671484 schema_hash=33389863 path = /data04/storage/data/35/47671484/33389863 force = 0 restore = 0
………
120250210 00: 00:16. 080051 1424282 task_worker_pool.cpp:1785] successfully migrate storage medium|signature=47671484|tablet_id=47671484

fe code

// fe/MasterImpl.java
public TMasterResult finishTask(TFinishTaskRequest request) {
    // ...
    AgentTask task = AgentTaskQueue.getTask(backendId, taskType, signature);
    if (task == null) {
        if (taskType != TTaskType.DROP && taskType != TTaskType.RELEASE_SNAPSHOT
                && taskType != TTaskType.CLEAR_TRANSACTION_TASK) {
            String errMsg = "cannot find task. type: " + taskType + ", backendId: " + backendId
                    + ", signature: " + signature;
            LOG.warn(errMsg);
            tStatus.setStatusCode(TStatusCode.CANCELLED);
            List<String> errorMsgs = new ArrayList<String>();
            errorMsgs.add(errMsg);
            tStatus.setErrorMsgs(errorMsgs);
        }
        return result;
    }
   // ...
}

//  fe/ReportHandler.java, not add task to AgentTaskQueue.
private static void handleMigration(ListMultimap<TStorageMedium, Long> tabletMetaMigrationMap,
                                    long backendId) {
    TabletInvertedIndex invertedIndex = Env.getCurrentInvertedIndex();
    SystemInfoService infoService = Env.getCurrentSystemInfo();
    Backend be = infoService.getBackend(backendId);
    if (be == null) {
        return;
    }
    AgentBatchTask batchTask = new AgentBatchTask();
    for (TStorageMedium storageMedium : tabletMetaMigrationMap.keySet()) {
        List<Long> tabletIds = tabletMetaMigrationMap.get(storageMedium);
        if (!be.hasSpecifiedStorageMedium(storageMedium)) {
            LOG.warn("no specified storage medium {} on backend {}, skip storage migration."
                    + " sample tablet id: {}", storageMedium, backendId, tabletIds.isEmpty()
                    ? "-1" : tabletIds.get(0));
            continue;
        }
        List<TabletMeta> tabletMetaList = invertedIndex.getTabletMetaList(tabletIds);
        for (int i = 0; i < tabletMetaList.size(); i++) {
            long tabletId = tabletIds.get(i);
            TabletMeta tabletMeta = tabletMetaList.get(i);
            if (tabletMeta == TabletInvertedIndex.NOT_EXIST_TABLET_META) {
                continue;
            }
            // always get old schema hash(as effective one)
            int effectiveSchemaHash = tabletMeta.getOldSchemaHash();
            StorageMediaMigrationTask task = new StorageMediaMigrationTask(backendId, tabletId,
                    effectiveSchemaHash, storageMedium);
            batchTask.addTask(task);
        }
    }

    AgentTaskExecutor.submit(batchTask);
}

be code

// be/task_worker_pool.cpp
void storage_medium_migrate_callback(StorageEngine& engine, const TAgentTaskRequest& req) {
    const auto& storage_medium_migrate_req = req.storage_medium_migrate_req;

    // check request and get info
    TabletSharedPtr tablet;
    DataDir* dest_store = nullptr;

    auto status = check_migrate_request(engine, storage_medium_migrate_req, tablet, &dest_store);
    if (status.ok()) {
        EngineStorageMigrationTask engine_task(engine, tablet, dest_store);
        SCOPED_ATTACH_TASK(engine_task.mem_tracker());
        status = engine_task.execute();
    }
    // fe should ignore this err
    if (status.is<FILE_ALREADY_EXIST>()) {
        status = Status::OK();
    }
    if (!status.ok()) {
        LOG_WARNING("failed to migrate storage medium")
                .tag("signature", req.signature)
                .tag("tablet_id", storage_medium_migrate_req.tablet_id)
                .error(status);
    } else {
        LOG_INFO("successfully migrate storage medium")
                .tag("signature", req.signature)
                .tag("tablet_id", storage_medium_migrate_req.tablet_id);
    }

    TFinishTaskRequest finish_task_request;
    finish_task_request.__set_backend(BackendOptions::get_local_backend());
    finish_task_request.__set_task_type(req.task_type);
    finish_task_request.__set_signature(req.signature);
    finish_task_request.__set_task_status(status.to_thrift());

    finish_task(finish_task_request);
    remove_task_info(req.task_type, req.signature);
}

What You Expected?

No Warnning log in FE.

How to Reproduce?

No response

Anything Else?

Is it feasible to add the migration task to the AgentTaskQueue within the handleMigration process?

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant