- 什么是 ChaosToolkit?
- 实验准备
- 安装 python3
- 安装 pipenv
- 安装 chaos-toolkit 的 k8s 扩展和报告模块
- 创建虚拟环境
- 实验实操
- chaos discover 探索试验
- chaos init 生成试验
- Chaos Run 执行案例
- 检查结果
- 小结
今天我们来玩一下混沌工程的开源工具 ChaosToolkit ,它的目标是提供一个免费,开放,社区驱动的工具集以及api。
官方源码链接:https://github.com/chaostoolkit/chaostoolkit
要想了解这个工具就必须知道混沌工程原则中提到的要点。如下所示:
记往这里提到的第一个要点,建立稳态假设。
在运行这个工具之前,我们先来看一下它的架构。
简单来解释一下,就是 ChaosToolkit 通过 Drivers 来操作你的被测系统。
它的功能点包括如下部分:
下面我们把工具装起来玩一下。
环境说明:
- CentOS7.8
- k8s 1.19.5
- 示例应用
sudo yum install python3 python3-venv安装 pipenv
gaolou@GaoMacPro ~ % pip3 install pipenv安装 chaos-toolkit 的 k8s 扩展和报告模块
pip3 install -U chaostoolkit pip3 install -U chaostoolkit-kubernetes pip3 install -U chaostoolkit-reporting
如果你需要操作其他平台,也可以安装相应扩展。
创建虚拟环境python3 -m venv .bundler source .bundler/bin/activate
为了不影响其他环境,我们这里用 python 的虚拟环境操作。
注意:以上安装过程是在 k8s 的 master 机器上执行的,如果你不是在 k8s 上安装的,可以配置相应的k8s上下文,具体操作请参考:https://chaostoolkit.org/drivers/kubernetes/。
首先执行 discover 命令,chaostoolkit 会根据 ./kube/config 中的内容生成 discovery.json 文件,这个文件中会包括所有可以对k8s执行的操作集合。执行成功的结果如下:
(.bundler) [root@s5 chaostoolkit_scenarios]# chaos discover chaostoolkit-kubernetes [2021-06-23 12:18:07 INFO] Attempting to download and install package 'chaostoolkit-kubernetes' [2021-06-23 12:18:08 INFO] Package downloaded and installed in current environment [2021-06-23 12:18:09 INFO] Discovering capabilities from chaostoolkit-kubernetes [2021-06-23 12:18:09 INFO] Searching for actions in chaosk8s.actions [2021-06-23 12:18:09 INFO] Searching for probes in chaosk8s.probes [2021-06-23 12:18:09 INFO] Searching for actions in chaosk8s.deployment.actions [2021-06-23 12:18:09 INFO] Searching for actions in chaosk8s.deployment.probes [2021-06-23 12:18:09 INFO] Searching for actions in chaosk8s.node.actions [2021-06-23 12:18:09 INFO] Searching for actions in chaosk8s.node.probes [2021-06-23 12:18:09 INFO] Searching for actions in chaosk8s.pod.actions [2021-06-23 12:18:09 INFO] Searching for probes in chaosk8s.pod.probes [2021-06-23 12:18:09 INFO] Searching for actions in chaosk8s.replicaset.actions [2021-06-23 12:18:09 INFO] Searching for actions in chaosk8s.service.actions [2021-06-23 12:18:09 INFO] Searching for actions in chaosk8s.service.probes [2021-06-23 12:18:09 INFO] Searching for actions in chaosk8s.statefulset.actions [2021-06-23 12:18:09 INFO] Searching for probes in chaosk8s.statefulset.probes [2021-06-23 12:18:09 INFO] Searching for actions in chaosk8s.crd.actions [2021-06-23 12:18:09 INFO] Searching for probes in chaosk8s.crd.probes [2021-06-23 12:18:09 INFO] Discovery outcome saved in ./discovery.json (.bundler) [root@s5 chaostoolkit_scenarios]#
chaos init 生成试验执行初始化命令,可以根据提示创建一个混沌试验。
(.bundler) [root@s5 chaostoolkit_scenarios]# chaos init You are about to create an experiment. This wizard will walk you through each step so that you can build the best experiment for your needs. An experiment is made up of three elements: - a steady-state hypothesis [OPTIONAL] - an experimental method - a set of rollback activities [OPTIONAL] only the method is required. Also your experiment will not run unless you define at least one activity (probe or action) within it Experiment's title: E2 #这里是配置一个试验名 A steady state hypothesis defines what 'normality' looks like in your system The steady state hypothesis is a collection of conditions that are used, at the beginning of an experiment, to decide if the system is in a recognised 'normal' state. The steady state conditions are then used again when your experiment is complete to detect where your system may have deviated in an interesting, weakness-detecting way Initially you may not know what your steady state hypothesis is and so instead you might create an experiment without one This is why the stead state hypothesis is optional. Do you want to define a steady state hypothesis now? [y/N]: y # 创建稳态假说,请注意,这个是混沌工程中的重要概念,但是在其他的大部分混沌工具中都看不到这一步 Hypothesis's title: H2 You may now define probes that will determine the steady-state of your system. Add an activity 1) all_microservices_healthy 2) deployment_is_fully_available 3) deployment_is_not_fully_available 4) microservice_available_and_healthy 5) microservice_is_not_available 6) read_microservices_logs 7) service_endpoint_is_initialized 8) count_pods 9) pod_is_not_available 10) pods_in_conditions 11) pods_in_phase 12) pods_not_in_phase 13) read_pod_logs 14) statefulset_fully_available 15) statefulset_not_fully_available 16) get_cluster_custom_object 17) get_custom_object 18) list_cluster_custom_objects 19) list_custom_objects Activity (0 to escape): 1 # 选择稳态假说的判断点,简单来说,这里就是创建一个预期结果 !!!DEPRECATED!!! 1) kill_microservice 2) remove_service_endpoint Do you want to use this probe? [y/N]: y # 确定是否使用上面选择的探针 A steady-state probe requires a tolerance value, within which your system is in a reognised `normal` state. What is the tolerance for this probe?: normal You now need to fill the arguments for this activity. Default values will be shown between brackets. You may simply press return to use it or not set any value. Argument's value for 'ns' [default]: chaosnamespace # 输入k8s中要操作的命名空间 Do you want to select another activity? [y/N]: y # 是否选择一个的操作动作 Add an activity 1) all_microservices_healthy 2) deployment_is_fully_available 3) deployment_is_not_fully_available 1) kill_microservice 4) microservice_available_and_healthy 5) microservice_is_not_available 6) read_microservices_logs 7) service_endpoint_is_initialized 8) count_pods 9) pod_is_not_available 10) pods_in_conditions 11) pods_in_phase 12) pods_not_in_phase 13) read_pod_logs 14) statefulset_fully_available 15) statefulset_not_fully_available 16) get_cluster_custom_object 17) get_custom_object 18) list_cluster_custom_objects 19) list_custom_objects Activity (0 to escape): 1 # 选择具体的动作 !!!DEPRECATED!!! Do you want to use this probe? [y/N]: y # 确定使用上面选择的动作 You now need to fill the arguments for this activity. Default values will be shown between brackets. You may simply press return to use it or not set any value. Argument's value for 'ns' [default]: Do you want to select another activity? [y/N]: N # 是否要添加另一个试验动作,这里我不再添加了 An experiment's method contains actions and probes. Actions vary real-world events in your system to determine if your steady-state hypothesis is maintained when those events occur. An experimental method can also contain probes to gather additional information about your system as your method is executed. Do you want to define an experimental method? [y/N]: y # 选择一个试验具体方法 Add an activity 1) kill_microservice 2) remove_service_endpoint 3) scale_microservice 4) start_microservice 5) all_microservices_healthy 6) deployment_is_fully_available 7) deployment_is_not_fully_available 8) microservice_available_and_healthy 9) microservice_is_not_available 10) read_microservices_logs 11) service_endpoint_is_initialized 12) create_deployment 13) delete_deployment 14) scale_deployment 15) deployment_available_and_healthy 16) deployment_fully_available 17) deployment_not_fully_available 18) cordon_node 19) create_node 20) delete_nodes 21) drain_nodes 22) uncordon_node 23) get_nodes 24) delete_pods 25) exec_in_pods 26) terminate_pods 27) count_pods 28) pod_is_not_available 29) pods_in_conditions 30) pods_in_phase 31) pods_not_in_phase 32) read_pod_logs 33) delete_replica_set 34) create_service_endpoint 35) delete_service 36) service_is_initialized 37) create_statefulset 38) remove_statefulset 39) scale_statefulset 40) statefulset_fully_available 41) statefulset_not_fully_available 42) create_cluster_custom_object 43) create_custom_object 44) delete_cluster_custom_object 45) delete_custom_object 46) patch_cluster_custom_object 47) patch_custom_object 48) replace_cluster_custom_object 49) replace_custom_object 50) get_cluster_custom_object 51) get_custom_object 52) list_cluster_custom_objects 53) list_custom_objects Activity (0 to escape): 24 # 这里我选择第24个方法:删除一个POD !!!DEPRECATED!!! Do you want to use this action? [y/N]: y # 确认选择 You now need to fill the arguments for this activity. Default values will be shown between brackets. You may simply press return to use it or not set any value. Argument's value for 'name': DeleteRedisPOD # 给这个方法命名 Argument's value for 'ns' [default]: chaosnamespace # 确定要操作的k8s命名空间 Argument's value for 'label_selector' [name in ({name})]: app=redis # 输入要操作对象的标签,以便可以找到操作对象 Do you want to select another activity? [y/N]: N # 是否添加另一个动作,这里我不再添加 An experiment may optionally define a set of remedial actions that are used to rollback the system to a given state. Do you want to add some rollbacks now? [y/N]: N # 是否添加回滚动作,这里我是要删除redis的POD,因为k8s会自动拉起来,所以我不用回滚动作 Experiment created and saved in './experiment.json' # 生成了试验文件 (.bundler) [root@s5 chaostoolkit_scenarios]#Chaos Run 执行案例
(.bundler) [root@s5 chaostoolkit_scenarios]# chaos run experiment.json [2021-06-28 23:03:23 INFO] Validating the experiment's syntax [2021-06-28 23:03:24 INFO] Experiment looks valid [2021-06-28 23:03:24 INFO] Running experiment: E2 [2021-06-28 23:03:24 INFO] Steady-state strategy: default [2021-06-28 23:03:24 INFO] Rollbacks strategy: default [2021-06-28 23:03:24 INFO] Steady state hypothesis: H2 [2021-06-28 23:03:24 INFO] Probe: all_microservices_healthy [2021-06-28 23:03:24 WARNING] all_microservices_healthy function is DEPRECATED and will be removed in the next releases, please use all_pods_healthy instead [2021-06-28 23:03:24 INFO] Steady state hypothesis is met! [2021-06-28 23:03:24 INFO] Playing your experiment's method now... [2021-06-28 23:03:24 INFO] Action: delete_pods [2021-06-28 23:03:24 INFO] Steady state hypothesis: H2 [2021-06-28 23:03:24 INFO] Probe: all_microservices_healthy [2021-06-28 23:03:24 WARNING] all_microservices_healthy function is DEPRECATED and will be removed in the next releases, please use all_pods_healthy instead [2021-06-28 23:03:24 INFO] Steady state hypothesis is met! [2021-06-28 23:03:24 INFO] Let's rollback... [2021-06-28 23:03:24 INFO] No declared rollbacks, let's move on. [2021-06-28 23:03:24 INFO] Experiment ended with status: completed (.bundler) [root@s5 chaostoolkit_scenarios]#检查结果
执行试验前: [root@s5 ~]# kubectl get pods -n chaosnamespace -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ........................... redis-master-b96c9795b-nqzmr 1/1 Running 0 3d9h 10.100.220.84 s6redis-slave-6b8d456947-6r42k 1/1 Running 0 3d9h 10.100.220.86 s6 redis-slave-6b8d456947-z55m5 1/1 Running 0 3d9h 10.100.53.206 s7 执行试验后: [root@s5 ~]# kubectl get pods -n chaosnamespace -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ............................... redis-master-b96c9795b-92rc6 0/1 ContainerCreating 0 3s s6 redis-master-b96c9795b-nqzmr 0/1 Terminating 0 3d9h 10.100.220.84 s6 redis-slave-6b8d456947-5m2xt 0/1 ContainerCreating 0 2s s6 redis-slave-6b8d456947-6r42k 1/1 Terminating 0 3d9h 10.100.220.86 s6 redis-slave-6b8d456947-fj4xc 0/1 ContainerCreating 0 3s s7 redis-slave-6b8d456947-z55m5 1/1 Terminating 0 3d9h 10.100.53.206 s7 POD完全启动后: [root@s5 ~]# kubectl get pods -n chaosnamespace -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ....................... redis-master-b96c9795b-92rc6 1/1 Running 0 5m43s 10.100.220.89 s6 redis-slave-6b8d456947-5m2xt 1/1 Running 0 5m42s 10.100.220.90 s6 redis-slave-6b8d456947-fj4xc 1/1 Running 0 5m43s 10.100.53.211 s7 [root@s5 ~]#
从上面的结果可以看到,试验是执行成功的,几个redisPOD都被杀掉并被k8s拉起来了。
今天我们就写这一个试验,你可以根据同样的步骤去生成其他试验。