Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add admin guide for curvebs and curvefs #34

Merged
merged 1 commit into from
Nov 22, 2023
Merged

Conversation

caoxianfei1
Copy link
Contributor

@caoxianfei1 caoxianfei1 commented Oct 11, 2023

close #26

Copy link
Contributor

@aspirer aspirer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 补充客户端相关操作指导
  2. 服务端操作的影响,尤其是metaserver的,需要特别注意,会影响io,风险要提示到位,尤其是各种隐藏风险(比如重启单个ms和全部ms影响肯定不一样的,升级ms的场景,会不会影响io也要确认)
  3. 集群健康状态检查这个步骤有点问题,如果集群在操作的时候就是不健康的,你这个操作能不能做?如能做,那做完之后集群还是不健康的,不能通过OK来判断了;如果不能做,那要写清楚做这些操作的前置条件要求,比如集群状态必须是OK才能做这个操作。
  4. 其他建议参考具体的review评论

10. 参考影响:

时间: 无

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 时间: 无
  • 业务方: 无
  • 影响主题: 无
  • 用户:无

这部分格式稍微改下吧,还有下面的11也类似,其他操作也都改下。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改成无序列表


4. 使用查看当前集群状态是否健康:
$ curve fs status cluster
结果输出有下面字样则集群健康:Cluster health is: ok
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

停掉一个mds集群状态还能保持ok吗?这个确认下?

@@ -0,0 +1,72 @@
## Curve 重启mds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

去掉一个#改成一级标题,否则文档网站上渲染会失败。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其他md文档也一样改改

$ curveadm status

4. 查看集群是否健康(Cluster health is: ok):
$ curve fs status cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的输出和检查方法描述,跟上面几个不一样,保持一致吧

```

10. 参考影响:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

重启全部mds也没有影响吗?


11. 参考风险:

数据面: IO可能有短暂时间抖动
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

具体影响的原因和范围,可以详细描述下


10. 参考影响:

时间: 无
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上


11. 参考风险:

数据面: IO可能有短暂时间抖动,或者集群不健康,导致客户端IO失败
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

$ curveadm stop --host <hostip> --role metaserver

如果要停止集群中所有的MetaServer服务,使用如下命令:
$ curveadm stop --role metaserver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

停止所有metaserver,影响太大了吧?这个不需要特别说明吗?而且下面的影响和风险也没写出来。


4. 使用工具查看当前集群是否健康:
$ curve fs status cluster
结果输出有下面字样则集群健康:Cluster health is: ok
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个检查结果明显应该是集群不健康吧?肯定不是OK才对

@aspirer aspirer changed the title commit fs at sheet commit fs admin guide Oct 24, 2023
@aspirer aspirer changed the title commit fs admin guide add admin guide for curvefs Oct 24, 2023
@caoxianfei1 caoxianfei1 force-pushed the main branch 2 times, most recently from 08da4f7 to 33b3aee Compare October 29, 2023 11:00
$ curveadm upgrade --host 10.0.1.1 --role mds

示例 3:升级集群中所有mds服务
$ curveadm upgrade --role mds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

修改配置中的 curveadm reload --role mds 是滚动操作的吗?还是所有mds同时重启了?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不是滚动的,是同时重启的。
在reload里面加了风险提示和影响。

Comment on lines 47 to 64
10. 参考影响:

* 时间: 无

* 业务方: 无

* 用户:无

11. 参考风险:

* 数据面: 无

* 管控面: 无

* 恢复能力: 无需恢复
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

卸载操作会预期导致文件系统不可用,需要业务方发起需求。

看下怎么备注一下比较合适。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


* 恢复能力: 无需恢复

12. 参考回滚策略: 无
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

重新挂载?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


* 恢复能力: 无需恢复

12. 参考回滚策略: 无
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个有可以参考的回滚操作吗?

- host: ${machine3} # 例:故障机器
config:
log_dir: /mnt/curvefs/logs/${service_role} # 新盘路径
data_dir: /mnt/curvefs/data/${service_role} # 新盘路径
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

换盘之后,会导致目录发生变化吗?比如原来的data_dir是/data/metaserver2,坏盘更换之后,是什么?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不会发生变化,这里不用添加,已修改。

$ curveadm stop --id <Id>

如果要停止某个节点上的所有的mds服务,使用如下命令:
$ curveadm stop --host <hostip> --role mds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的 填 hosts.yaml 中的 host和hostname都可以吗?

Copy link
Contributor Author

@caoxianfei1 caoxianfei1 Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

要填host

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


* 恢复能力:无需恢复

* 情况2:同时停掉所有的mds服务(一般不会做此操作)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunkserver

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

结果输出有下面字样则集群健康:Cluster health is: ok

注:1. 如果集群是健康状态(ok)的话,则继续执行后面步骤。
2 .如果当前集群处于warn状态,则需要使用工具(如下命令)判断是否因为当前mds服务异常导致,如果是chunkserver问题,则不要执行后续步骤。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1)2 . --> 2.
2)当前mds --> 当前chunkserver
3)也有可能是因为这个chunkserver有问题才尝试重启的。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all fixed.

```plaintext
1. 使用工具查看集群是否健康,当前chunkserver是否异常
$ curve bs status cluster
结果输出有下面字样则集群健康:Cluster health is: ok
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可能就是因为某个chunkserver有异常才需要重启的

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

分正常重启和异常重启两种情况吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

注:1. 如果集群是健康状态(ok)的话,则继续执行后面步骤。
2 .如果当前集群处于warn状态,则需要使用工具(如下命令)判断是否因为当前mds服务异常导致,如果是chunkserver问题,则不要执行后续步骤。
$ curve bs status chunkserver
3. 如果集群处于error状态,则重启chunkserver可能是无意义操作,所以不要执行后续步骤。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

也有可能重启chunkserver就是为了解决error问题?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.
其他相似的描述也已修改。


10. 参考影响:

* 情况1:重启部分chunkserver服务
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

还需要考虑故障域问题,如果重启的是3副本域的3个chunkserver,那还是会影响io的(会io hang)。
另外重启一个chunkserver或者一个副本域的chunkserver也会造成短暂的io抖动。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -0,0 +1,70 @@
# Curve 启动mds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参考bs的启动mds修改建议,下面的停止、升级、修改配置,也是类似的参考下bs的。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.
已经对照修改。

```

10. 参考影响:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

影响要更详细,停止、重启、升级ms 不会没有影响的,元数据操作肯定有影响。跟chunkserver类似,参考下。


11. 参考风险:

* 数据面: IO可能有短暂时间抖动
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

写清楚是元数据io?

@@ -0,0 +1,119 @@
# Curve 迁移服务

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参考下bs的建议

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -0,0 +1,122 @@
# Curve 挂载client

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改为挂载文件系统?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

10. 参考步骤:

```plaintext
1. 查看集群状态和指定节点的mds状态:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

集群状态 -> 集群服务状态

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

如果要启动所有的mds,使用如下命令:
$ curveadm start --role mds

3. 再次查看集群状态,查看指定mds服务是否启动成功(Status为Up状态):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto and check all, Distinguish between 集群状态 and 集群服务状态

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


* 管控面: 管控面服务不可用

* 恢复能力: 无需回复
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要拉起mds ?

$ curveadm reload --host server-host1 --role mds

示例 3: 重新加载所有的mds服务(需确认)
提醒:该操作会重启机器中的所有mds服务,所以在执行下列操作时,可能会导致IO短暂时间的抖动。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确认结果是?reload是同时全部实例一起吧?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里只是想提示输入yes进行确认,已修改。
reload是全部实例一起重启。

2. 如果集群异常(warn/error),不建议升级。

2. 备份本地拓扑文件:
$ cp topology.yaml topology.yaml.bak
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议备份后的名称一致,上面叫topology-old.yaml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


* 用户:无

* 情况2:重启某个副本域的所有chunkserver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

情况2就是情况1后面的第二种情况,不会有影响。这里可以补充,重启涉及多个副本域的chunkserver

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


* 恢复能力:无需恢复

* 情况2:重启某个副本域的所有chunkserver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


* 用户:无

* 情况2:回退某个副本域的所有chunkserver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

涉及多个副本域的chunkserver

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


* 恢复能力:无需恢复

* 情况2:回退某个副本域的所有chunkserver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@SeanHai
Copy link
Contributor

SeanHai commented Nov 14, 2023

miss the file 13-*** under BS and FS


10. 参考影响:

* 情况1:停止部分metaserver服务
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不能夸副本域,同一个副本域中的部分是ok的,如果跨域可能造成io卡住

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


11. 参考风险:

* 情况1:停止部分metaserver服务
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


* 用户:无

* 情况2:重启某个副本域的所有metaserver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


* 恢复能力:无需恢复

* 情况2:重启某个副本域的所有metaserver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


10. 参考影响:

* 情况1:修改部分metaserver服务
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


11. 参考风险:

* 情况1:修改部分metaserver服务配置
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto (同时)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


* 用户:无

* 情况2:升级某个副本域的所有metaserver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.
upgrade 升级是滚动升级,所以使用curveadm命令不会出现升级涉及多个副本域的情况。


* 恢复能力:无需恢复

* 情况2:升级某个副本域的所有metaserver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

在坏盘情况下,对应的metaserver会退出,集群会自动做迁移,使用如下命令查看故障的metaserver的ID(Status为Exited状态的metaserver)
$ curveadm status -v

2. 拉起当前服务:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

类似bs也需要检查盘的状态和缓存配置,并且需要将原来的data_dir重新挂载到新盘上

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@SeanHai
Copy link
Contributor

SeanHai commented Nov 14, 2023

Maybe some content about CSI and iscsi target

@aspirer aspirer changed the title add admin guide for curvefs add admin guide for curvebs and curvefs Nov 15, 2023
* 恢复能力:无需恢复

* 情况2:同时修改所有的chunkserver服务配置

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那这个操作就是不建议按role指定重启了,否则会导致io hang,要写清楚

* 情况2:同时修改所有的chunkserver服务配置

* 数据面:可能有短暂的IO抖动。如果所有的chunkserver在同一时刻进入重启,该时刻读写IO错误

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里也是io hang吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


注:上述命令默认是会升级集群中的所有的 chunkserver 服务,如果只需要指定服务,可通过添加以下3个选项来实现:
--id: 升级指定 id 的服务
--host: 升级指定主机的所有服务
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里缩进有问题

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@caoxianfei1
Copy link
Contributor Author

miss the file 13-*** under BS and FS

fixed.

$ curveadm upgrade --role snapshotclone

注:上述命令默认是会升级集群中的所有的snapshotclone服务,如果只需要指定服务,可通过添加以下3个选项来实现:
--id: 升级指定 id 的服务
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个缩进看着不对,其他相似的文档也检查下有没有类似的问题

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

$ curve bs status cluster
结果输出有下面字样则集群健康:Cluster health is: ok

注:如果集群不健康(error/warn状态),请不要回退。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个句话感觉不太合适,集群不健康了,可能才需要回退,改一下表述,例如: 如果集群不健康,确认下是什么引起之后,如果是snapshotclone server引起,尝试回退。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

在执行上述命令之后,需要由用户确认升级页面。 输入 yes 开始升级当前服务:
Upgrade 1/3 service:
+ host=server-host1 role=mds image=quay.io/opencurve/curve/curvebs:latest
Do you want to continue? [yes/no]: (default=no)
Copy link
Member

@xu-chaojie xu-chaojie Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议把这个yes的说明放到上面的 curveadm upgrade --role mds这里去,这条命令才是常规的升级步骤,后面两种只是可选步骤,区分下。这个实例三可以不要了,重复了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

$ curveadm upgrade --role mds

注:上述命令默认是会升级集群中的所有的mds服务,如果只需要指定服务,可通过添加以下3个选项来实现:
--id: 升级指定 id 的服务
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

缩进有问题,其他地方也看下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@caoxianfei1 caoxianfei1 force-pushed the main branch 2 times, most recently from f942eed to 7cbb032 Compare November 20, 2023 09:14
Signed-off-by: caoxianfei <[email protected]>
@h0hmj h0hmj merged commit c6454f9 into opencurve:main Nov 22, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

请补充CurveFS&CurveBS运维目录下相关文档
6 participants