我的k8s随笔:Kubernetes部署-问题篇

作者注:本文仅供参考,请谨慎阅读

本文集中记录k8s集群部署过程的问题。由于各人环境不同,限于经验,本文仅供参考。
注:本文会不定时更新。

源、key问题

使用国内中科大源:

1
2
3
cat <<EOF > /etc/apt/sources.list.d/kubernetes.list
deb http://mirrors.ustc.edu.cn/kubernetes/apt kubernetes-xenial main
EOF

更新:

1
apt-get update

但出错:

1
2
3
4
5
6
7
8
Ign:7 http://mirrors.ustc.edu.cn/kubernetes/apt kubernetes-xenial/main amd64 Packages
Get:7 http://mirrors.ustc.edu.cn/kubernetes/apt kubernetes-xenial/main amd64 Packages [31.3 kB]
Err:7 http://mirrors.ustc.edu.cn/kubernetes/apt kubernetes-xenial/main amd64 Packages
Hash Sum mismatch
Fetched 38.9 kB in 1s (20.2 kB/s)
Reading package lists... Done
E: Failed to fetch http://mirrors.ustc.edu.cn/kubernetes/apt/dists/kubernetes-xenial/main/binary-amd64/Packages.gz Hash Sum mismatch
E: Some index files failed to download. They have been ignored, or old ones used instead.

原因及解决:
添加key:

1
2
gpg --keyserver keyserver.ubuntu.com --recv-keys 6A030B21BA07F4FB
gpg --export --armor 6A030B21BA07F4FB | sudo apt-key add -

结果:失败

使用k8s官方提供的国外源:

1
2
3
cat <<EOF | sudo tee /etc/apt/sources.list.d/kubernetes.list
deb https://apt.kubernetes.io/ kubernetes-xenial main
EOF

更新apt-get update,会卡住,失败

使用阿里云源:

1
2
3
cat <<EOF > /etc/apt/sources.list.d/kubernetes.list
deb https://mirrors.aliyun.com/kubernetes/apt/ kubernetes-xenial main
EOF

添加key:

1
cat https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -

如果不成功,先通过一些方法下载:https://packages.cloud.google.com/apt/doc/apt-key.gpg, 保存到当前目录。
再执行:

1
cat apt-key.gpg | sudo apt-key add -

再执行更新apt-get update成功

如果不添加 key,更新阿里云源出错:

1
2
3
4
W: GPG error: https://mirrors.aliyun.com/kubernetes/apt kubernetes-xenial InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 6A030B21BA07F4FB
W: The repository 'https://mirrors.aliyun.com/kubernetes/apt kubernetes-xenial InRelease' is not signed.
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use.
N: See apt-secure(8) manpage for repository creation and user configuration details.

查询k8s相应配置包:

1
2
3
4
W1214 08:46:14.303158    8461 version.go:101] could not fetch a Kubernetes version from the internet: unable to get URL "https://dl.k8s.io/release/stable-1.txt": Get https://dl.k8s.io/release/stable-1.txt: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
W1214 08:46:14.303772 8461 version.go:102] falling back to the local client version: v1.17.0
W1214 08:46:14.304223 8461 validation.go:28] Cannot validate kube-proxy config - no validator is available
W1214 08:46:14.304609 8461 validation.go:28] Cannot validate kubelet config - no validator is available

原因及解决:
外网无法访问。不用理会,因为会根据执行的版本使用默认的版本。

脚本执行

1
pullk8s.sh: 3: pullk8s.sh: Syntax error: "(" unexpected

原因及解决:
脚本开头需为#!/bin/bash,如非,则用 bash pullk8s.sh执行。

初始化环境 kubeadm init

提示:

1
[ERROR Swap]: running with swap on is not supported. Please disable swap

原因及解决:
不支持 swap 需要禁止。

提示:

1
[ERROR Port-10250]: Port 10250 is in use

需要停止 kubelet 的运行:systemctl stop kubelet

提示WARNING IsDockerSystemdCheck

1
2
3
4
5
6
7
8
[init] Using Kubernetes version: v1.17.0
[preflight] Running pre-flight checks
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR NumCPU]: the number of available CPUs 1 is less than the required 2
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher

原因及解决:
docker使用cgroupfs,与k8s不一致。先查看:

1
2
3
# docker info | grep -i cgroup
Cgroup Driver: cgroupfs // !!! 此处为cgroupfs
WARNING: No swap limit support

需要修改,先停止docker:

1
systemctl stop docker

更改 /etc/docker/daemon.json,添加:

1
"exec-opts": ["native.cgroupdriver=systemd"]

重启docker:

1
systemctl start docker

查看 cgroup:

1
2
# docker info | grep -i cgroup
Cgroup Driver: systemd

已改。
(!!!!!!
注:
修改kubeadm配置文件:

1
vim /etc/systemd/system/kubelet.service.d/10-kubeadm.conf

在 Environment 后再新加:

1
Environment="KUBELET_CGROUP_ARGS=--cgroup-driver=cgroupfs"

另一个指定的pod源的:

1
Environment="KUBELET_CGROUP_ARGS=--cgroup-driver=cgroupfs --pod-infra-container-image=registry.cn-hangzhou.aliyuncs.com/google_containers/pause-amd64:3.1"

重启:

1
2
systemctl daemon-reload
systemctl restart kubelet

此方法实践不成功
!!!!!!)

提示ERROR NumCPU

1
2
3
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR NumCPU]: the number of available CPUs 1 is less than the required 2
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...

原因解决:CPU要双核以上,改虚拟机cpu为2个核心或以上即可。

运行时

查看状态:

1
kubectl get pods -n kube-system

出错:

1
The connection to the server localhost:8080 was refused - did you specify the right host or port?

原因及解决:
没有执行:

1
2
3
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

执行后,成功。

1
The connection to the server 192.168.0.102:6443 was refused - did you specify the right host or port?
1
2
3
4
5
6
7
8
9
# kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-6955765f44-j7lvd 0/1 CrashLoopBackOff 14 51m
coredns-6955765f44-kmhfc 0/1 CrashLoopBackOff 14 51m
etcd-ubuntu 1/1 Running 0 52m
kube-apiserver-ubuntu 1/1 Running 0 52m
kube-controller-manager-ubuntu 1/1 Running 0 52m
kube-proxy-qlhfs 1/1 Running 0 51m
kube-scheduler-ubuntu 1/1 Running 0 52m

也可用 kubectl get pod –all-namespaces 查看所有命名空间。

如果不设置网络,则coredns为Pending状态。
部署flannel:

1
kubectl apply -f kube-flannel.yml

提示:

1
error: unable to recognize "kube-flannel-aliyun-0.11.0.yml": no matches for kind "DaemonSet" in version "extensions/v1beta1"

换calico,也是一样的。应该是命名的问题。
解决:找master对应的文件。
https://github.com/coreos/flannel/blob/master/Documentation/kube-flannel.yml

kube-flannel-aliyun.yml的mster和其它tag使用了”extensions/v1beta1”。kube-flannel.yml的Tag使用了,但master又恢复了。

未部署flannel前,

1
2

[FATAL] plugin/loop: Loop (127.0.0.1:60825 -> :53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 7805087528265218508.4857814245207702505."

部署flannel后,

1、coredns出现 CrashLoopBackOff,kube-flannel出现 Init:ImagePullBackOff,

1
2
# kubectl logs kube-flannel-ds-amd64-n55rf -n kube-system
Error from server (BadRequest): container "kube-flannel" in pod "kube-flannel-ds-amd64-n55rf" is waiting to start: PodInitializing

使用 kubectl describe pod 查看:

1
2
3
4
5
6
7
8
# kubectl describe pod kube-flannel-ds-amd64-n55rf -n kube-system
...
Normal Scheduled 13m default-scheduler Successfully assigned kube-system/kube-flannel-ds-amd64-n55rf to ubuntu
Normal Pulling 4m21s (x4 over 13m) kubelet, ubuntu Pulling image "quay.io/coreos/flannel:v0.11.0-amd64"
Warning Failed 3m6s (x4 over 10m) kubelet, ubuntu Failed to pull image "quay.io/coreos/flannel:v0.11.0-amd64": rpc error: code = Unknown desc = context canceled
Warning Failed 3m6s (x4 over 10m) kubelet, ubuntu Error: ErrImagePull
Normal BackOff 2m38s (x7 over 10m) kubelet, ubuntu Back-off pulling image "quay.io/coreos/flannel:v0.11.0-amd64"
Warning Failed 2m27s (x8 over 10m) kubelet, ubuntu Error: ImagePullBackOff

原因:flannel:v0.11.0-amd64 无法下载,通过其它方式下载。注意,名称 quay.io/coreos/flannel:v0.11.0-amd64 一定要对。

下载flannel后,有2种情况:
2、
coredns 变成 ContainerCreating 状态:

1
2
# kubectl logs coredns-6955765f44-4csvn -n kube-system
Error from server (BadRequest): container "coredns" in pod "coredns-6955765f44-r96qk" is waiting to start: ContainerCreating

3、
coredns 变成 CrashLoopBackOff 状态:

1
2
3
4
5
6
# kubectl logs coredns-6955765f44-4csvn -n kube-system
.:53
[INFO] plugin/reload: Running configuration MD5 = 4e235fcc3696966e76816bcd9034ebc7
CoreDNS-1.6.5
linux/amd64, go1.13.4, c2fd1b2
[FATAL] plugin/loop: Loop (127.0.0.1:41252 -> :53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 1746539958269975925.3391392736060997773."

查看详细信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# kubectl describe pod coredns-6955765f44-4csvn -n kube-system 
Name: coredns-6955765f44-r96qk
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: ubuntu/192.168.0.102
Start Time: Sun, 15 Dec 2019 22:45:15 +0800
Labels: k8s-app=kube-dns
pod-template-hash=6955765f44
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/coredns-6955765f44
Containers:
coredns:
Container ID:
Image: k8s.gcr.io/coredns:1.6.5
Image ID:
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
Readiness: http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/coredns from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from coredns-token-qq7qf (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
coredns-token-qq7qf:
Type: Secret (a volume populated by a Secret)
SecretName: coredns-token-qq7qf
Optional: false
QoS Class: Burstable
Node-Selectors: beta.kubernetes.io/os=linux
Tolerations: CriticalAddonsOnly
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 7m21s (x3 over 8m32s) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
Normal Scheduled 6m55s default-scheduler Successfully assigned kube-system/coredns-6955765f44-r96qk to ubuntu
Warning FailedCreatePodSandBox 6m52s kubelet, ubuntu Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "9a2d45536097d22cc6b10f338b47f1789869f45f4b12f8a202aa898295dc80a4" network for pod "coredns-6955765f44-r96qk": networkPlugin cni failed to set up pod "coredns-6955765f44-r96qk_kube-system" network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.0.1/24

安装flannel后,删除出问题的pod:

1
kubectl delete pod coredns-6955765f44-4csvn -n kube-system

会自动重启一个新的pod,但问题依然。查看 ifconfig,发现有 cni0 。
网上解决方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#在master节点之外的节点进行操作
kubeadm reset
systemctl stop kubelet
systemctl stop docker
rm -rf /var/lib/cni/
rm -rf /var/lib/kubelet/*
rm -rf /etc/cni/
ifconfig cni0 down
ifconfig flannel.1 down
ifconfig docker0 down
ip link delete cni0
ip link delete flannel.1
##重启kubelet
systemctl restart kubelet
##重启docker
systemctl restart docker

尝试,失败!

又一次部署的提示:

1
2
3
4
5
6
7
8
9
Warning  FailedScheduling        77s (x5 over 5m53s)  default-scheduler  0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
Normal Scheduled 76s default-scheduler Successfully assigned kube-system/coredns-9d85f5447-4jwf2 to ubuntu
Warning FailedCreatePodSandBox 73s kubelet, ubuntu Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "5c109baa51b8d97e75c6b35edf108ca4f2f56680b629140c8b477b9a8a03d97c" network for pod "coredns-9d85f5447-4jwf2": networkPlugin cni failed to set up pod "coredns-9d85f5447-4jwf2_kube-system" network: open /run/flannel/subnet.env: no such file or directory
Warning FailedCreatePodSandBox 71s kubelet, ubuntu Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "3f8c5b704fb1dc4584a2903b2ecff329e717e5c2558c9f761501fab909d32133" network for pod "coredns-9d85f5447-4jwf2": networkPlugin cni failed to set up pod "coredns-9d85f5447-4jwf2_kube-system" network: open /run/flannel/subnet.env: no such file or directory
Normal SandboxChanged 70s (x2 over 72s) kubelet, ubuntu Pod sandbox changed, it will be killed and re-created.
Normal Pulled 29s (x4 over 69s) kubelet, ubuntu Container image "registry.aliyuncs.com/google_containers/coredns:1.6.5" already present on machine
Normal Created 29s (x4 over 69s) kubelet, ubuntu Created container coredns
Normal Started 29s (x4 over 69s) kubelet, ubuntu Started container coredns
Warning BackOff 10s (x9 over 67s) kubelet, ubuntu Back-off restarting failed container

原因及解决:
网上说初始化时要添加 –pod-network-cidr=10.244.0.0/16,但已添加。注:稍等片刻即产生此文件。

1
2
3
4
5
6
7
8
9
Warning  FailedScheduling        56m (x5 over 60m)    default-scheduler  0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
Normal Scheduled 56m default-scheduler Successfully assigned kube-system/coredns-9d85f5447-4jwf2 to ubuntu
Warning FailedCreatePodSandBox 56m kubelet, ubuntu Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "5c109baa51b8d97e75c6b35edf108ca4f2f56680b629140c8b477b9a8a03d97c" network for pod "coredns-9d85f5447-4jwf2": networkPlugin cni failed to set up pod "coredns-9d85f5447-4jwf2_kube-system" network: open /run/flannel/subnet.env: no such file or directory
Warning FailedCreatePodSandBox 55m kubelet, ubuntu Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "3f8c5b704fb1dc4584a2903b2ecff329e717e5c2558c9f761501fab909d32133" network for pod "coredns-9d85f5447-4jwf2": networkPlugin cni failed to set up pod "coredns-9d85f5447-4jwf2_kube-system" network: open /run/flannel/subnet.env: no such file or directory
Normal SandboxChanged 55m (x2 over 55m) kubelet, ubuntu Pod sandbox changed, it will be killed and re-created.
Normal Pulled 55m (x4 over 55m) kubelet, ubuntu Container image "registry.aliyuncs.com/google_containers/coredns:1.6.5" already present on machine
Normal Created 55m (x4 over 55m) kubelet, ubuntu Created container coredns
Normal Started 55m (x4 over 55m) kubelet, ubuntu Started container coredns
Warning BackOff 59s (x270 over 55m) kubelet, ubuntu Back-off restarting failed container

log信息:

1
2
3
4
5
.:53
[INFO] plugin/reload: Running configuration MD5 = 4e235fcc3696966e76816bcd9034ebc7
CoreDNS-1.6.5
linux/amd64, go1.13.4, c2fd1b2
[FATAL] plugin/loop: Loop (127.0.0.1:48100 -> :53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 639535139534040434.6569166625322327450."

原因及解决:
ConfigMap里使用了 /etc/resolv.conf,里面的DNS为127.0.1.1,此导致问题。
执行:

1
kubectl edit cm coredns -n kube-system

删除 loop 字段,保存,退出(vim编辑器)。
删除出问题的所有的 coredns:

1
kubectl delete pod coredns-9d85f5447-4jwf2 -n kube-system

coredns ConfigMap内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
apiVersion: v1
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
kind: ConfigMap
metadata:
creationTimestamp: "2019-12-21T09:50:31Z"
name: coredns
namespace: kube-system
resourceVersion: "171"
selfLink: /api/v1/namespaces/kube-system/configmaps/coredns
uid: 62485b55-3de6-4dee-b24a-8440052bdb66

注:理论上修改 /etc/resolv.conf 为8.8.8.8 应该能解决,但该文件手动修改重启后恢复为127网段,无效。删除 loop 字段可解决问题。

加入集群失败

1
2
3
4
5
6
7
8
9
10
11
12
[preflight] WARNING: JoinControlPane.controlPlane settings will be ignored when control-plane flag is not set.
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.17" ConfigMap in the kube-system namespace
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
[kubelet-check] Initial timeout of 40s passed.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.

原因及解决:
猜测可能是主机名与 master 一致导致的,但没有实证。

TLS 超时

执行 kubectl apply -f xxx时,

1
Unable to connect to the server: net/http: TLS handshake timeout

可能原因:master 分配内存过小。加大即可。(已加到4GB,依然出错,重启一次正常)

网上收集的

WARNING FileExisting-socat
socat是一个网络工具, k8s 使用它来进行 pod 的数据交互,出现这个问题直接安装socat即可:

1
apt-get install socat

工作节点加入失败
在子节点执行kubeadm join命令后返回超时错误,如下:

1
2
3
root@worker2:~# kubeadm join 192.168.56.11:6443 --token wbryr0.am1n476fgjsno6wa --discovery-token-ca-cert-hash sha256:7640582747efefe7c2d537655e428faa6275dbaff631de37822eb8fd4c054807
[preflight] Running pre-flight checks
error execution phase preflight: couldn't validate the identity of the API Server: abort connecting to API servers after timeout of 5m0s

在master节点上执行 kubeadm token create –print-join-command 重新生成加入命令,并使用输出的新命令在工作节点上重新执行即可。

master节点的token 24小时过期后,可以通过命令产生新的token:

1
kubeadm token list

创建永不过期的token

1
kubeadm token create --ttl 0

master节点上运行命令,可查询discovery-token-ca-cert-hash值:

1
openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'

重新加入节点

1
2
kubeadm join 192.168.124.195:6443 --token 8xwg8u.lkj382k9ox58qkw9 \
--discovery-token-ca-cert-hash sha256:86291bed442dd1dcd6c26f2213208e10cab0f87763f44e0edf01fa670cd9e8b