我的k8s随笔：Kubernetes部署-问题篇

2019-12-21 00:45 kubernetes 3.3k 字次

作者注：本文仅供参考，请谨慎阅读

本文集中记录k8s集群部署过程的问题。由于各人环境不同，限于经验，本文仅供参考。
注：本文会不定时更新。

源、key问题

使用国内中科大源：

1
2
3

cat <<EOF > /etc/apt/sources.list.d/kubernetes.list
deb http://mirrors.ustc.edu.cn/kubernetes/apt kubernetes-xenial main
EOF

更新：

1	apt-get update

但出错：

Ign:7 http://mirrors.ustc.edu.cn/kubernetes/apt kubernetes-xenial/main amd64 Packages
Get:7 http://mirrors.ustc.edu.cn/kubernetes/apt kubernetes-xenial/main amd64 Packages [31.3 kB]
Err:7 http://mirrors.ustc.edu.cn/kubernetes/apt kubernetes-xenial/main amd64 Packages
  Hash Sum mismatch
Fetched 38.9 kB in 1s (20.2 kB/s)                            
Reading package lists... Done
E: Failed to fetch http://mirrors.ustc.edu.cn/kubernetes/apt/dists/kubernetes-xenial/main/binary-amd64/Packages.gz  Hash Sum mismatch
E: Some index files failed to download. They have been ignored, or old ones used instead.

原因及解决：
添加key:

1 2	gpg --keyserver keyserver.ubuntu.com --recv-keys 6A030B21BA07F4FB gpg --export --armor 6A030B21BA07F4FB \| sudo apt-key add -

结果：失败。

使用k8s官方提供的国外源：

1
2
3

cat <<EOF | sudo tee /etc/apt/sources.list.d/kubernetes.list
deb https://apt.kubernetes.io/ kubernetes-xenial main
EOF

更新apt-get update，会卡住，失败。

使用阿里云源：

1
2
3

cat <<EOF > /etc/apt/sources.list.d/kubernetes.list
deb https://mirrors.aliyun.com/kubernetes/apt/ kubernetes-xenial main
EOF

添加key：

1	cat https://packages.cloud.google.com/apt/doc/apt-key.gpg \| sudo apt-key add -

如果不成功，先通过一些方法下载：保存到当前目录。
再执行：

1	cat apt-key.gpg \| sudo apt-key add -

再执行更新apt-get update，成功。

如果不添加 key，更新阿里云源出错：

W: GPG error: https://mirrors.aliyun.com/kubernetes/apt kubernetes-xenial InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 6A030B21BA07F4FB
W: The repository 'https://mirrors.aliyun.com/kubernetes/apt kubernetes-xenial InRelease' is not signed.
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use.
N: See apt-secure(8) manpage for repository creation and user configuration details.

查询k8s相应配置包：

W1214 08:46:14.303158    8461 version.go:101] could not fetch a Kubernetes version from the internet: unable to get URL "https://dl.k8s.io/release/stable-1.txt": Get https://dl.k8s.io/release/stable-1.txt: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
W1214 08:46:14.303772    8461 version.go:102] falling back to the local client version: v1.17.0
W1214 08:46:14.304223    8461 validation.go:28] Cannot validate kube-proxy config - no validator is available
W1214 08:46:14.304609    8461 validation.go:28] Cannot validate kubelet config - no validator is available

原因及解决：
外网无法访问。不用理会，因为会根据执行的版本使用默认的版本。

脚本执行

1	pullk8s.sh: 3: pullk8s.sh: Syntax error: "(" unexpected

原因及解决：
脚本开头需为#!/bin/bash，如非，则用 bash pullk8s.sh执行。

初始化环境 kubeadm init

提示：

1	[ERROR Swap]: running with swap on is not supported. Please disable swap

原因及解决：
不支持 swap 需要禁止。

提示：

1	[ERROR Port-10250]: Port 10250 is in use

需要停止 kubelet 的运行： systemctl stop kubelet。

提示WARNING IsDockerSystemdCheck。

[init] Using Kubernetes version: v1.17.0
[preflight] Running pre-flight checks
        [WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: [preflight] Some fatal errors occurred:
        [ERROR NumCPU]: the number of available CPUs 1 is less than the required 2
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
）

原因及解决：
docker使用cgroupfs，与k8s不一致。先查看：

1
2
3

# docker info | grep -i cgroup
Cgroup Driver: cgroupfs   // !!! 此处为cgroupfs
WARNING: No swap limit support

需要修改，先停止docker：

1	systemctl stop docker

更改 /etc/docker/daemon.json，添加：

1	"exec-opts": ["native.cgroupdriver=systemd"]

重启docker:

1	systemctl start docker

查看 cgroup：

1 2	# docker info \| grep -i cgroup Cgroup Driver: systemd

已改。
（！！！！！！
注：
修改kubeadm配置文件：

1	vim /etc/systemd/system/kubelet.service.d/10-kubeadm.conf

在 Environment 后再新加：

1	Environment="KUBELET_CGROUP_ARGS=--cgroup-driver=cgroupfs"

另一个指定的pod源的：

1	Environment="KUBELET_CGROUP_ARGS=--cgroup-driver=cgroupfs --pod-infra-container-image=registry.cn-hangzhou.aliyuncs.com/google_containers/pause-amd64:3.1"

重启：

1 2	systemctl daemon-reload systemctl restart kubelet

此方法实践不成功
！！！！！！）

提示ERROR NumCPU：

1
2
3

error execution phase preflight: [preflight] Some fatal errors occurred:
        [ERROR NumCPU]: the number of available CPUs 1 is less than the required 2
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...

原因解决：CPU要双核以上，改虚拟机cpu为2个核心或以上即可。

运行时

查看状态：

1	kubectl get pods -n kube-system

出错：

1	The connection to the server localhost:8080 was refused - did you specify the right host or port?

原因及解决：
没有执行：

1
2
3

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

执行后，成功。

1	The connection to the server 192.168.0.102:6443 was refused - did you specify the right host or port?

# kubectl get pods -n kube-system
NAME                             READY   STATUS             RESTARTS   AGE
coredns-6955765f44-j7lvd         0/1     CrashLoopBackOff   14         51m
coredns-6955765f44-kmhfc         0/1     CrashLoopBackOff   14         51m
etcd-ubuntu                      1/1     Running            0          52m
kube-apiserver-ubuntu            1/1     Running            0          52m
kube-controller-manager-ubuntu   1/1     Running            0          52m
kube-proxy-qlhfs                 1/1     Running            0          51m
kube-scheduler-ubuntu            1/1     Running            0          52m

也可用 kubectl get pod –all-namespaces 查看所有命名空间。

如果不设置网络，则coredns为Pending状态。
部署flannel：

1	kubectl apply -f kube-flannel.yml

提示：

1	error: unable to recognize "kube-flannel-aliyun-0.11.0.yml": no matches for kind "DaemonSet" in version "extensions/v1beta1"

换calico，也是一样的。应该是命名的问题。
解决：找master对应的文件。

kube-flannel-aliyun.yml的mster和其它tag使用了”extensions/v1beta1”。kube-flannel.yml的Tag使用了，但master又恢复了。

未部署flannel前，

1 2	[FATAL] plugin/loop: Loop (127.0.0.1:60825 -> :53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 7805087528265218508.4857814245207702505."

部署flannel后，

1、coredns出现 CrashLoopBackOff，kube-flannel出现 Init:ImagePullBackOff，

1 2	# kubectl logs kube-flannel-ds-amd64-n55rf -n kube-system Error from server (BadRequest): container "kube-flannel" in pod "kube-flannel-ds-amd64-n55rf" is waiting to start: PodInitializing

使用 kubectl describe pod 查看：

# kubectl describe pod kube-flannel-ds-amd64-n55rf -n kube-system
...
  Normal   Scheduled  13m                  default-scheduler  Successfully assigned kube-system/kube-flannel-ds-amd64-n55rf to ubuntu
  Normal   Pulling    4m21s (x4 over 13m)  kubelet, ubuntu    Pulling image "quay.io/coreos/flannel:v0.11.0-amd64"
  Warning  Failed     3m6s (x4 over 10m)   kubelet, ubuntu    Failed to pull image "quay.io/coreos/flannel:v0.11.0-amd64": rpc error: code = Unknown desc = context canceled
  Warning  Failed     3m6s (x4 over 10m)   kubelet, ubuntu    Error: ErrImagePull
  Normal   BackOff    2m38s (x7 over 10m)  kubelet, ubuntu    Back-off pulling image "quay.io/coreos/flannel:v0.11.0-amd64"
  Warning  Failed     2m27s (x8 over 10m)  kubelet, ubuntu    Error: ImagePullBackOff

原因：flannel:v0.11.0-amd64 无法下载，通过其它方式下载。注意，名称 quay.io/coreos/flannel:v0.11.0-amd64 一定要对。

下载flannel后，有2种情况：
2、
coredns 变成 ContainerCreating 状态：

1 2	# kubectl logs coredns-6955765f44-4csvn -n kube-system Error from server (BadRequest): container "coredns" in pod "coredns-6955765f44-r96qk" is waiting to start: ContainerCreating

3、
coredns 变成 CrashLoopBackOff 状态：

# kubectl logs coredns-6955765f44-4csvn -n kube-system
.:53
[INFO] plugin/reload: Running configuration MD5 = 4e235fcc3696966e76816bcd9034ebc7
CoreDNS-1.6.5
linux/amd64, go1.13.4, c2fd1b2
[FATAL] plugin/loop: Loop (127.0.0.1:41252 -> :53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 1746539958269975925.3391392736060997773."

查看详细信息：

# kubectl describe pod coredns-6955765f44-4csvn -n kube-system 
Name:                 coredns-6955765f44-r96qk
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 ubuntu/192.168.0.102
Start Time:           Sun, 15 Dec 2019 22:45:15 +0800
Labels:               k8s-app=kube-dns
                      pod-template-hash=6955765f44
Annotations:          <none>
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/coredns-6955765f44
Containers:
  coredns:
    Container ID:  
    Image:         k8s.gcr.io/coredns:1.6.5
    Image ID:      
    Ports:         53/UDP, 53/TCP, 9153/TCP
    Host Ports:    0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-qq7qf (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  coredns-token-qq7qf:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  coredns-token-qq7qf
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     CriticalAddonsOnly
                 node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age                     From               Message
  ----     ------                  ----                    ----               -------
  Warning  FailedScheduling        7m21s (x3 over 8m32s)   default-scheduler  0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
  Normal   Scheduled               6m55s                   default-scheduler  Successfully assigned kube-system/coredns-6955765f44-r96qk to ubuntu
  Warning  FailedCreatePodSandBox  6m52s                   kubelet, ubuntu    Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "9a2d45536097d22cc6b10f338b47f1789869f45f4b12f8a202aa898295dc80a4" network for pod "coredns-6955765f44-r96qk": networkPlugin cni failed to set up pod "coredns-6955765f44-r96qk_kube-system" network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.0.1/24

安装flannel后，删除出问题的pod：

1	kubectl delete pod coredns-6955765f44-4csvn -n kube-system

会自动重启一个新的pod，但问题依然。查看 ifconfig，发现有 cni0 。
网上解决方法：

#在master节点之外的节点进行操作
kubeadm reset
systemctl stop kubelet
systemctl stop docker
rm -rf /var/lib/cni/
rm -rf /var/lib/kubelet/*
rm -rf /etc/cni/
ifconfig cni0 down
ifconfig flannel.1 down
ifconfig docker0 down
ip link delete cni0
ip link delete flannel.1
##重启kubelet
systemctl restart kubelet
##重启docker
systemctl restart docker

尝试，失败！

又一次部署的提示：

Warning  FailedScheduling        77s (x5 over 5m53s)  default-scheduler  0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
Normal   Scheduled               76s                  default-scheduler  Successfully assigned kube-system/coredns-9d85f5447-4jwf2 to ubuntu
Warning  FailedCreatePodSandBox  73s                  kubelet, ubuntu    Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "5c109baa51b8d97e75c6b35edf108ca4f2f56680b629140c8b477b9a8a03d97c" network for pod "coredns-9d85f5447-4jwf2": networkPlugin cni failed to set up pod "coredns-9d85f5447-4jwf2_kube-system" network: open /run/flannel/subnet.env: no such file or directory
Warning  FailedCreatePodSandBox  71s                  kubelet, ubuntu    Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "3f8c5b704fb1dc4584a2903b2ecff329e717e5c2558c9f761501fab909d32133" network for pod "coredns-9d85f5447-4jwf2": networkPlugin cni failed to set up pod "coredns-9d85f5447-4jwf2_kube-system" network: open /run/flannel/subnet.env: no such file or directory
Normal   SandboxChanged          70s (x2 over 72s)    kubelet, ubuntu    Pod sandbox changed, it will be killed and re-created.
Normal   Pulled                  29s (x4 over 69s)    kubelet, ubuntu    Container image "registry.aliyuncs.com/google_containers/coredns:1.6.5" already present on machine
Normal   Created                 29s (x4 over 69s)    kubelet, ubuntu    Created container coredns
Normal   Started                 29s (x4 over 69s)    kubelet, ubuntu    Started container coredns
Warning  BackOff                 10s (x9 over 67s)    kubelet, ubuntu    Back-off restarting failed container

原因及解决：
网上说初始化时要添加 –pod-network-cidr=10.244.0.0/16，但已添加。注：稍等片刻即产生此文件。

Warning  FailedScheduling        56m (x5 over 60m)    default-scheduler  0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
Normal   Scheduled               56m                  default-scheduler  Successfully assigned kube-system/coredns-9d85f5447-4jwf2 to ubuntu
Warning  FailedCreatePodSandBox  56m                  kubelet, ubuntu    Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "5c109baa51b8d97e75c6b35edf108ca4f2f56680b629140c8b477b9a8a03d97c" network for pod "coredns-9d85f5447-4jwf2": networkPlugin cni failed to set up pod "coredns-9d85f5447-4jwf2_kube-system" network: open /run/flannel/subnet.env: no such file or directory
Warning  FailedCreatePodSandBox  55m                  kubelet, ubuntu    Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "3f8c5b704fb1dc4584a2903b2ecff329e717e5c2558c9f761501fab909d32133" network for pod "coredns-9d85f5447-4jwf2": networkPlugin cni failed to set up pod "coredns-9d85f5447-4jwf2_kube-system" network: open /run/flannel/subnet.env: no such file or directory
Normal   SandboxChanged          55m (x2 over 55m)    kubelet, ubuntu    Pod sandbox changed, it will be killed and re-created.
Normal   Pulled                  55m (x4 over 55m)    kubelet, ubuntu    Container image "registry.aliyuncs.com/google_containers/coredns:1.6.5" already present on machine
Normal   Created                 55m (x4 over 55m)    kubelet, ubuntu    Created container coredns
Normal   Started                 55m (x4 over 55m)    kubelet, ubuntu    Started container coredns
Warning  BackOff                 59s (x270 over 55m)  kubelet, ubuntu    Back-off restarting failed container

log信息：

.:53
[INFO] plugin/reload: Running configuration MD5 = 4e235fcc3696966e76816bcd9034ebc7
CoreDNS-1.6.5
linux/amd64, go1.13.4, c2fd1b2
[FATAL] plugin/loop: Loop (127.0.0.1:48100 -> :53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 639535139534040434.6569166625322327450."

原因及解决：
ConfigMap里使用了 /etc/resolv.conf，里面的DNS为127.0.1.1，此导致问题。
执行：

1	kubectl edit cm coredns -n kube-system

删除 loop 字段，保存，退出（vim编辑器）。
删除出问题的所有的 coredns：

1	kubectl delete pod coredns-9d85f5447-4jwf2 -n kube-system

coredns ConfigMap内容如下：

apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }
kind: ConfigMap
metadata:
  creationTimestamp: "2019-12-21T09:50:31Z"
  name: coredns
  namespace: kube-system
  resourceVersion: "171"
  selfLink: /api/v1/namespaces/kube-system/configmaps/coredns
  uid: 62485b55-3de6-4dee-b24a-8440052bdb66

注：理论上修改 /etc/resolv.conf 为8.8.8.8 应该能解决，但该文件手动修改重启后恢复为127网段，无效。删除 loop 字段可解决问题。

加入集群失败

[preflight] WARNING: JoinControlPane.controlPlane settings will be ignored when control-plane flag is not set.
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.17" ConfigMap in the kube-system namespace
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
[kubelet-check] Initial timeout of 40s passed.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.

原因及解决：
猜测可能是主机名与 master 一致导致的，但没有实证。

TLS 超时

执行 kubectl apply -f xxx时，

1	Unable to connect to the server: net/http: TLS handshake timeout

可能原因：master 分配内存过小。加大即可。（已加到4GB，依然出错，重启一次正常）

网上收集的

WARNING FileExisting-socat
socat是一个网络工具， k8s 使用它来进行 pod 的数据交互，出现这个问题直接安装socat即可：

1	apt-get install socat

工作节点加入失败
在子节点执行kubeadm join命令后返回超时错误，如下：

1
2
3

root@worker2:~# kubeadm join 192.168.56.11:6443 --token wbryr0.am1n476fgjsno6wa --discovery-token-ca-cert-hash sha256:7640582747efefe7c2d537655e428faa6275dbaff631de37822eb8fd4c054807
[preflight] Running pre-flight checks
error execution phase preflight: couldn't validate the identity of the API Server: abort connecting to API servers after timeout of 5m0s

在master节点上执行 kubeadm token create –print-join-command 重新生成加入命令，并使用输出的新命令在工作节点上重新执行即可。

master节点的token 24小时过期后，可以通过命令产生新的token：

1	kubeadm token list

创建永不过期的token

1	kubeadm token create --ttl 0

master节点上运行命令，可查询discovery-token-ca-cert-hash值：

1	openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt \| openssl rsa -pubin -outform der 2>/dev/null \| openssl dgst -sha256 -hex \| sed 's/^.* //'

重新加入节点

1 2	kubeadm join 192.168.124.195:6443 --token 8xwg8u.lkj382k9ox58qkw9 \ --discovery-token-ca-cert-hash sha256:86291bed442dd1dcd6c26f2213208e10cab0f87763f44e0edf01fa670cd9e8b

本文作者：李迟
版权声明：原创文章，版权归署名作者，转载建议注明出处（当然不注明亦可）。
本文链接：/kubernetes/k8s-deploy-issue.html