-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add admin guide for curvebs and curvefs #34
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 补充客户端相关操作指导
- 服务端操作的影响,尤其是metaserver的,需要特别注意,会影响io,风险要提示到位,尤其是各种隐藏风险(比如重启单个ms和全部ms影响肯定不一样的,升级ms的场景,会不会影响io也要确认)
- 集群健康状态检查这个步骤有点问题,如果集群在操作的时候就是不健康的,你这个操作能不能做?如能做,那做完之后集群还是不健康的,不能通过OK来判断了;如果不能做,那要写清楚做这些操作的前置条件要求,比如集群状态必须是OK才能做这个操作。
- 其他建议参考具体的review评论
10. 参考影响: | ||
|
||
时间: 无 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 时间: 无
- 业务方: 无
- 影响主题: 无
- 用户:无
这部分格式稍微改下吧,还有下面的11也类似,其他操作也都改下。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改成无序列表
|
||
4. 使用查看当前集群状态是否健康: | ||
$ curve fs status cluster | ||
结果输出有下面字样则集群健康:Cluster health is: ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
停掉一个mds集群状态还能保持ok吗?这个确认下?
@@ -0,0 +1,72 @@ | |||
## Curve 重启mds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
去掉一个#改成一级标题,否则文档网站上渲染会失败。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
其他md文档也一样改改
$ curveadm status | ||
|
||
4. 查看集群是否健康(Cluster health is: ok): | ||
$ curve fs status cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的输出和检查方法描述,跟上面几个不一样,保持一致吧
``` | ||
|
||
10. 参考影响: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
重启全部mds也没有影响吗?
|
||
11. 参考风险: | ||
|
||
数据面: IO可能有短暂时间抖动 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
具体影响的原因和范围,可以详细描述下
|
||
10. 参考影响: | ||
|
||
时间: 无 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
|
||
11. 参考风险: | ||
|
||
数据面: IO可能有短暂时间抖动,或者集群不健康,导致客户端IO失败 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
$ curveadm stop --host <hostip> --role metaserver | ||
|
||
如果要停止集群中所有的MetaServer服务,使用如下命令: | ||
$ curveadm stop --role metaserver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
停止所有metaserver,影响太大了吧?这个不需要特别说明吗?而且下面的影响和风险也没写出来。
|
||
4. 使用工具查看当前集群是否健康: | ||
$ curve fs status cluster | ||
结果输出有下面字样则集群健康:Cluster health is: ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个检查结果明显应该是集群不健康吧?肯定不是OK才对
08da4f7
to
33b3aee
Compare
$ curveadm upgrade --host 10.0.1.1 --role mds | ||
|
||
示例 3:升级集群中所有mds服务 | ||
$ curveadm upgrade --role mds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
修改配置中的 curveadm reload --role mds
是滚动操作的吗?还是所有mds同时重启了?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不是滚动的,是同时重启的。
在reload里面加了风险提示和影响。
10. 参考影响: | ||
|
||
* 时间: 无 | ||
|
||
* 业务方: 无 | ||
|
||
* 用户:无 | ||
|
||
11. 参考风险: | ||
|
||
* 数据面: 无 | ||
|
||
* 管控面: 无 | ||
|
||
* 恢复能力: 无需恢复 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
卸载操作会预期导致文件系统不可用,需要业务方发起需求。
看下怎么备注一下比较合适。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
* 恢复能力: 无需恢复 | ||
|
||
12. 参考回滚策略: 无 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
重新挂载?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
||
* 恢复能力: 无需恢复 | ||
|
||
12. 参考回滚策略: 无 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个有可以参考的回滚操作吗?
- host: ${machine3} # 例:故障机器 | ||
config: | ||
log_dir: /mnt/curvefs/logs/${service_role} # 新盘路径 | ||
data_dir: /mnt/curvefs/data/${service_role} # 新盘路径 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
换盘之后,会导致目录发生变化吗?比如原来的data_dir是/data/metaserver2,坏盘更换之后,是什么?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不会发生变化,这里不用添加,已修改。
$ curveadm stop --id <Id> | ||
|
||
如果要停止某个节点上的所有的mds服务,使用如下命令: | ||
$ curveadm stop --host <hostip> --role mds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的 填 hosts.yaml 中的 host和hostname都可以吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
要填host
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
* 恢复能力:无需恢复 | ||
|
||
* 情况2:同时停掉所有的mds服务(一般不会做此操作) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunkserver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
结果输出有下面字样则集群健康:Cluster health is: ok | ||
|
||
注:1. 如果集群是健康状态(ok)的话,则继续执行后面步骤。 | ||
2 .如果当前集群处于warn状态,则需要使用工具(如下命令)判断是否因为当前mds服务异常导致,如果是chunkserver问题,则不要执行后续步骤。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1)2 . --> 2.
2)当前mds --> 当前chunkserver
3)也有可能是因为这个chunkserver有问题才尝试重启的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all fixed.
```plaintext | ||
1. 使用工具查看集群是否健康,当前chunkserver是否异常 | ||
$ curve bs status cluster | ||
结果输出有下面字样则集群健康:Cluster health is: ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可能就是因为某个chunkserver有异常才需要重启的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
分正常重启和异常重启两种情况吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
注:1. 如果集群是健康状态(ok)的话,则继续执行后面步骤。 | ||
2 .如果当前集群处于warn状态,则需要使用工具(如下命令)判断是否因为当前mds服务异常导致,如果是chunkserver问题,则不要执行后续步骤。 | ||
$ curve bs status chunkserver | ||
3. 如果集群处于error状态,则重启chunkserver可能是无意义操作,所以不要执行后续步骤。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
也有可能重启chunkserver就是为了解决error问题?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
其他相似的描述也已修改。
|
||
10. 参考影响: | ||
|
||
* 情况1:重启部分chunkserver服务 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
还需要考虑故障域问题,如果重启的是3副本域的3个chunkserver,那还是会影响io的(会io hang)。
另外重启一个chunkserver或者一个副本域的chunkserver也会造成短暂的io抖动。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
@@ -0,0 +1,70 @@ | |||
# Curve 启动mds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
参考bs的启动mds修改建议,下面的停止、升级、修改配置,也是类似的参考下bs的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
已经对照修改。
``` | ||
|
||
10. 参考影响: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
影响要更详细,停止、重启、升级ms 不会没有影响的,元数据操作肯定有影响。跟chunkserver类似,参考下。
|
||
11. 参考风险: | ||
|
||
* 数据面: IO可能有短暂时间抖动 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
写清楚是元数据io?
@@ -0,0 +1,119 @@ | |||
# Curve 迁移服务 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
参考下bs的建议
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
@@ -0,0 +1,122 @@ | |||
# Curve 挂载client | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改为挂载文件系统?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
10. 参考步骤: | ||
|
||
```plaintext | ||
1. 查看集群状态和指定节点的mds状态: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
集群状态
-> 集群服务状态
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
如果要启动所有的mds,使用如下命令: | ||
$ curveadm start --role mds | ||
|
||
3. 再次查看集群状态,查看指定mds服务是否启动成功(Status为Up状态): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto and check all, Distinguish between 集群状态
and 集群服务状态
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
||
* 管控面: 管控面服务不可用 | ||
|
||
* 恢复能力: 无需回复 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要拉起mds
?
$ curveadm reload --host server-host1 --role mds | ||
|
||
示例 3: 重新加载所有的mds服务(需确认) | ||
提醒:该操作会重启机器中的所有mds服务,所以在执行下列操作时,可能会导致IO短暂时间的抖动。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
确认结果是?reload是同时全部实例一起吧?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里只是想提示输入yes进行确认,已修改。
reload是全部实例一起重启。
2. 如果集群异常(warn/error),不建议升级。 | ||
|
||
2. 备份本地拓扑文件: | ||
$ cp topology.yaml topology.yaml.bak |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议备份后的名称一致,上面叫topology-old.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
||
* 用户:无 | ||
|
||
* 情况2:重启某个副本域的所有chunkserver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
情况2就是情况1后面的第二种情况,不会有影响。这里可以补充,重启涉及多个副本域的chunkserver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
||
* 恢复能力:无需恢复 | ||
|
||
* 情况2:重启某个副本域的所有chunkserver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
||
* 用户:无 | ||
|
||
* 情况2:回退某个副本域的所有chunkserver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
涉及多个副本域的chunkserver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
||
* 恢复能力:无需恢复 | ||
|
||
* 情况2:回退某个副本域的所有chunkserver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
miss the file 13-*** under BS and FS |
|
||
10. 参考影响: | ||
|
||
* 情况1:停止部分metaserver服务 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不能夸副本域,同一个副本域中的部分是ok的,如果跨域可能造成io卡住
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
||
11. 参考风险: | ||
|
||
* 情况1:停止部分metaserver服务 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
||
* 用户:无 | ||
|
||
* 情况2:重启某个副本域的所有metaserver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
||
* 恢复能力:无需恢复 | ||
|
||
* 情况2:重启某个副本域的所有metaserver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
||
10. 参考影响: | ||
|
||
* 情况1:修改部分metaserver服务 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
||
11. 参考风险: | ||
|
||
* 情况1:修改部分metaserver服务配置 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto (同时)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
||
* 用户:无 | ||
|
||
* 情况2:升级某个副本域的所有metaserver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
upgrade 升级是滚动升级,所以使用curveadm命令不会出现升级涉及多个副本域的情况。
|
||
* 恢复能力:无需恢复 | ||
|
||
* 情况2:升级某个副本域的所有metaserver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
在坏盘情况下,对应的metaserver会退出,集群会自动做迁移,使用如下命令查看故障的metaserver的ID(Status为Exited状态的metaserver) | ||
$ curveadm status -v | ||
|
||
2. 拉起当前服务: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
类似bs也需要检查盘的状态和缓存配置,并且需要将原来的data_dir重新挂载到新盘上
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
Maybe some content about CSI and iscsi target |
* 恢复能力:无需恢复 | ||
|
||
* 情况2:同时修改所有的chunkserver服务配置 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
那这个操作就是不建议按role指定重启了,否则会导致io hang,要写清楚
* 情况2:同时修改所有的chunkserver服务配置 | ||
|
||
* 数据面:可能有短暂的IO抖动。如果所有的chunkserver在同一时刻进入重启,该时刻读写IO错误 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里也是io hang吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
||
注:上述命令默认是会升级集群中的所有的 chunkserver 服务,如果只需要指定服务,可通过添加以下3个选项来实现: | ||
--id: 升级指定 id 的服务 | ||
--host: 升级指定主机的所有服务 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里缩进有问题
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
fixed. |
$ curveadm upgrade --role snapshotclone | ||
|
||
注:上述命令默认是会升级集群中的所有的snapshotclone服务,如果只需要指定服务,可通过添加以下3个选项来实现: | ||
--id: 升级指定 id 的服务 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个缩进看着不对,其他相似的文档也检查下有没有类似的问题
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
$ curve bs status cluster | ||
结果输出有下面字样则集群健康:Cluster health is: ok | ||
|
||
注:如果集群不健康(error/warn状态),请不要回退。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个句话感觉不太合适,集群不健康了,可能才需要回退,改一下表述,例如: 如果集群不健康,确认下是什么引起之后,如果是snapshotclone server引起,尝试回退。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
在执行上述命令之后,需要由用户确认升级页面。 输入 yes 开始升级当前服务: | ||
Upgrade 1/3 service: | ||
+ host=server-host1 role=mds image=quay.io/opencurve/curve/curvebs:latest | ||
Do you want to continue? [yes/no]: (default=no) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议把这个yes的说明放到上面的 curveadm upgrade --role mds这里去,这条命令才是常规的升级步骤,后面两种只是可选步骤,区分下。这个实例三可以不要了,重复了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
$ curveadm upgrade --role mds | ||
|
||
注:上述命令默认是会升级集群中的所有的mds服务,如果只需要指定服务,可通过添加以下3个选项来实现: | ||
--id: 升级指定 id 的服务 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
缩进有问题,其他地方也看下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
f942eed
to
7cbb032
Compare
Signed-off-by: caoxianfei <[email protected]>
close #26