tags:
– Kubernetes
– K8s系列
– DevOps
– 调度策略
– Pod调度
– 污点容忍度

K8s 系列 | 第 15 天：污点与容忍度：掌控 Pod 调度

第 15/30 天

引言

在 Kubernetes 集群中，Pod 的调度是一个核心问题。我们之前学习了 NodeSelector、NodeAffinity 等吸引 Pod 到特定节点的手段，但现实场景中往往需要排斥某些 Pod 远离特定节点——这就是污点（Taint）与容忍度（Toleration）机制发挥作用的地方。

想象这样的场景：集群中有 GPU 节点专供 AI 训练任务使用，普通 Web 服务不应该无意中调度到这些节点上；或者某个节点需要临时下线维护，需要驱逐所有 Pod 但又不希望新 Pod 调度上来。污点与容忍度正是解决这类问题的核心机制。

本文将深入解析 Taint 和 Toleration 的工作原理，并通过实战案例展示如何在生产环境中灵活运用这一调度利器。

核心概念

什么是污点（Taint）？

污点是标记在节点上的一个属性，它告诉调度器：除非 Pod 明确容忍这个污点，否则不要将 Pod 调度到这个节点上。

污点由三个字段组成：

字段	说明	示例
`key`	污点的键	`gpu`
`value`	污点的值（可选）	`nvidia`
`effect`	污点的效果	`NoSchedule`

三种 Effect 效果

污点的 effect 决定了调度器对不匹配 Pod 的行为：

Effect	行为描述
`NoSchedule`	不匹配的 Pod 不会被调度到该节点（已有 Pod 不受影响）
`PreferNoSchedule`	调度器尽量不将不匹配的 Pod 调度到此节点（软限制）
`NoExecute`	不匹配的 Pod 不会被调度，且已在该节点运行的 Pod 会被驱逐

NoExecute 是最强力的污点效果，它不仅阻止新调度，还会将现有不匹配的 Pod 全部赶走。

什么是容忍度（Toleration）？

容忍度是 Pod 的一个属性，表示 Pod 愿意容忍节点上的特定污点。只有 Pod 的容忍度与节点的污点匹配时，Pod 才能被调度到该节点（或被允许继续运行）。

一个容忍度对应一个污点：

tolerations:
- key: "gpu"
  operator: "Equal"
  value: "nvidia"
  effect: "NoSchedule"

容忍度匹配规则

容忍度与污点的匹配通过 operator 字段控制：

Equal（默认）：key + value + effect 全部匹配
Exists：只需 key + effect 匹配，忽略 value

特殊用法：
– operator: "Exists" 且不指定 key——匹配所有污点（全容忍）
– operator: "Exists" 且指定 key——匹配该 key 下的所有 effect

实战步骤

1. 为节点添加污点

使用 kubectl taint 命令为节点添加污点：

# 为节点添加 NoSchedule 污点
kubectl taint nodes node1 gpu=nvidia:NoSchedule

# 添加 PreferNoSchedule 污点（软限制）
kubectl taint nodes node2 disk=hdd:PreferNoSchedule

# 添加 NoExecute 污点（驱逐已有 Pod）
kubectl taint nodes node3 maintenance=true:NoExecute

查看节点污点状态：

kubectl describe node node1 | grep Taints

输出示例：

Taints:             gpu=nvidia:NoSchedule

2. 在 Pod 中配置容忍度

下面是一个配置了容忍度的 Pod 定义，可以调度到带有 gpu=nvidia:NoSchedule 污点的节点上：

apiVersion: v1
kind: Pod
metadata:
  name: gpu-training-pod
spec:
  containers:
  - name: trainer
    image: tensorflow/tensorflow:latest-gpu
    resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - key: "gpu"
    operator: "Equal"
    value: "nvidia"
    effect: "NoSchedule"

3. 使用 Exists 操作符

当只需要关心污点的存在（而不关心具体的 value）时，使用 Exists 更简洁：

tolerations:
- key: "gpu"
  operator: "Exists"
  effect: "NoSchedule"

这段配置表示：Pod 容忍所有 key 为 gpu、effect 为 NoSchedule 的污点，无论 value 是什么。

4. NoExecute 容忍度的 TolerationSeconds

NoExecute 效果支持一个特殊参数 tolerationSeconds，指定 Pod 在被驱逐前可以继续在节点上运行的时间（秒）：

tolerations:
- key: "maintenance"
  operator: "Equal"
  value: "true"
  effect: "NoExecute"
  tolerationSeconds: 300

上述配置表示：如果节点被打上 maintenance=true:NoExecute 污点，Pod 还可以在该节点上运行 300 秒（5 分钟）后再被驱逐。这对优雅下线场景非常有用。

5. 综合案例：专用 GPU 节点池

假设集群中有两类节点：GPU 节点给 AI 团队，普通节点给 Web 团队。以下是一个完整配置：

步骤一：标记 GPU 节点

# 打污点，禁止普通 Pod 调度上来
kubectl taint nodes gpu-node-1 gpu=dedicated:NoSchedule

# 加标签，便于后续通过 NodeAffinity 选择
kubectl label nodes gpu-node-1 node-type=gpu

步骤二：AI 训练 Deployment 配置

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-training
  namespace: ai-team
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ai-training
  template:
    metadata:
      labels:
        app: ai-training
    spec:
      containers:
      - name: trainer
        image: pytorch/pytorch:latest
        resources:
          limits:
            nvidia.com/gpu: 1
      tolerations:
      - key: "gpu"
        operator: "Exists"
        effect: "NoSchedule"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node-type
                operator: In
                values:
                - gpu

步骤三：验证调度结果

# 查看 Pod 调度到了哪个节点
kubectl get pods -n ai-team -o wide

# 查看节点上实际运行的 Pod
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=gpu-node-1

6. 管理污点

# 删除节点上的污点（在 key:effect 后加减号）
kubectl taint nodes node1 gpu=nvidia:NoSchedule-

# 查看所有节点的污点
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

# 清空一个节点的所有污点（通过 patch）
kubectl patch node node1 -p '{"spec":{"taints":[]}}'

7. 节点故障自动打污点

Kubernetes 控制面会自动为异常节点添加 NoExecute 污点：

污点 Key	触发条件	effect
`node.kubernetes.io/not-ready`	节点未就绪	NoExecute
`node.kubernetes.io/unreachable`	节点不可达	NoExecute
`node.kubernetes.io/out-of-disk`	磁盘空间不足	NoSchedule
`node.kubernetes.io/memory-pressure`	内存压力大	PreferNoSchedule
`node.kubernetes.io/disk-pressure`	磁盘压力大	PreferNoSchedule
`node.kubernetes.io/network-unavailable`	网络不可用	NoSchedule
`node.kubernetes.io/unschedulable`	节点不可调度	NoSchedule
`node.cloudprovider.kubernetes.io/uninitialized`	云提供商初始化中	NoSchedule

这些系统级污点的默认 tolerationSeconds 通常为 300 秒，即在节点异常 5 分钟后开始驱逐 Pod。

常见问题

Q1：NodeSelector / NodeAffinity 与 Taint / Toleration 的区别？

特性	NodeAffinity	Taint / Toleration
角色	吸引 Pod 到节点	排斥 Pod 离开节点
定义位置	Pod spec	节点 + Pod spec
粒度	正向选择	反向过滤
组合使用	✅ 可以一起用	✅ 可以一起用

最佳实践：用 Taint 做排除策略，用 NodeAffinity 做选择策略，两者配合使用效果最佳。

Q2：Toleration 保证 Pod 一定被调度到特定节点吗？

不保证。 容忍度只表示 Pod 能容忍该节点的污点，但不强制调度到该节点。要想实现强制调度，需要结合 NodeAffinity 或 NodeName。

Q3：多个污点如何匹配？

Pod 只需要容忍其中一个污点吗？不是。Pod 必须容忍节点上的每一个污点才能被调度。如果节点有 3 个 NoSchedule 污点，Pod 需要有对应的 3 个容忍度。

Q4：NoExecute 驱逐是否立即生效？

默认立即生效。但可以通过 tolerationSeconds 设置宽限期。此外，kube-controller-manager 有 node-eviction-rate 参数控制驱逐速率，防止节点大规模宕机导致 Pod 雪崩式重建。

Q5：Pod 上默认有哪些容忍度？

查看任意运行中的 Pod：

kubectl describe pod <pod-name> | grep -A10 Tolerations

大多数 Pod 会被自动注入以下容忍度（由 kube-controller-manager 管理）：

Tolerations:   node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
               node.kubernetes.io/unreachable:NoExecute op=Exists for 300s

这些默认容忍度确保节点短暂故障时 Pod 不会立即被驱逐，给了修复时间。

生产最佳实践

1. 专用节点隔离策略

场景	Taint	Toleration
GPU 节点	`gpu=dedicated:NoSchedule`	仅在 AI 负载上配置
高 IO 节点	`ssd=high-io:NoSchedule`	仅在数据库负载上配置
运维节点	`ops=true:NoSchedule`	仅有监控/日志组件配置
节点下线维护	`maintenance=true:NoExecute` + `tolerationSeconds` 配合

2. 优雅节点下线流程

# 1. 先添加 NoSchedule 阻止新 Pod
kubectl taint nodes node-to-maintain maintenance=true:NoSchedule

# 2. 等待已有 Pod 完成请求（业务配置了 preStop hook）
# 3. 然后驱逐 Pod
kubectl drain node-to-maintain --ignore-daemonsets

# 4. 完成维护操作
# 5. 恢复节点调度
kubectl uncordon node-to-maintain
kubectl taint nodes node-to-maintain maintenance=true:NoSchedule-

3. 监控污点与调度事件

# 查看调度器日志，排查调度失败
kubectl logs -n kube-system $(kubectl get pods -n kube-system -l component=kube-scheduler -o name) | grep -i "taint|toleration"

# 检查 Pod 调度事件
kubectl describe pod <unscheduled-pod> | grep -A20 Events

总结

本文深入介绍了 Kubernetes 调度体系中的污点（Taint）与容忍度（Toleration）机制：

污点是节点维度的排斥标记，三种 effect（NoSchedule / PreferNoSchedule / NoExecute）控制不同强度的调度行为
容忍度是 Pod 维度的”通行证”，通过 Equal / Exists 操作符匹配节点污点
NoExecute + tolerationSeconds 实现优雅 Pod 驱逐
配合 NodeAffinity 使用可同时完成正向选择和反向过滤
生产环境用于 GPU 节点隔离、运维节点保护、节点平滑下线等场景

掌握了 Taint 与 Toleration，你就掌握了 K8s 调度策略中”排除”的一面——结合我们已经学过的 NodeAffinity（吸引的一面），你已经可以对 Pod 调度行为进行全方位的精细控制。

下期预告

第 16 天：Node Affinity 与 Pod Affinity——我们将深入解析亲和性调度策略，包括 Node Affinity 的硬性/软性约束、Pod 间亲和性（Pod Affinity）与反亲和性（Pod Anti-Affinity），以及如何在多 AZ 环境中实现高可用部署。

📚 系列目录：K8s 系列文章汇总

文章版权归作者所有，未经允许请勿转载。

THE END