V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
caicaiwoshishui
V2EX  ›  Kubernetes

kubeadm 部署多 master 节点问题,高可用必须 2 台在线才行吗?

  •  
  •   caicaiwoshishui · 65 天前 · 995 次点击
    这是一个创建于 65 天前的主题,其中的信息可能已经有所发展或是发生改变。

    折腾一天了,一共三台 master 节点机器 用 keepalived 做虚拟 ip ,开启了 lvsf ,测试关闭其中任意一台,另外两台都没问题,但是只要关闭 2 台,服务就不可用了.

    • 错误如下
    [[email protected] ~]# kubectl get nodes
    
    The connection to the server 192.168.0.8:6443 was refused - did you specify the right host or port?
    [[email protected] ~]# netstat -ntlp |grep 6443
    
    

    具体日志

    • kube-apiserver
    [[email protected] ~]# docker ps -a |grep kube-api|grep -v pause
    0c1c0042b8c2   53224b502ea4                                        "kube-apiserver --ad…"   About a minute ago   Exited (1) 54 seconds ago                 k8s_kube-apiserver_kube-apiserver-master-1.host.com_kube-system_464df844856c9d5461cb184edc4974c9_45
    [[email protected] ~]# docker logs -f 0c1c0042b8c2
    I1120 14:25:26.120729       1 server.go:553] external host was not specified, using 192.168.0.11
    I1120 14:25:26.122152       1 server.go:161] Version: v1.22.3
    I1120 14:25:26.836619       1 shared_informer.go:240] Waiting for caches to sync for node_authorizer
    I1120 14:25:26.838689       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
    I1120 14:25:26.838721       1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
    I1120 14:25:26.840979       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
    I1120 14:25:26.841003       1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
    Error: context deadline exceeded
    
    • etcd 错误是 RAFT NO LEADER
    [[email protected] ~]# docker ps -a |grep etcd
    dfd6026ae3fd   004811815584                                        "etcd --advertise-cl…"   3 minutes ago    Up 3 minutes                          k8s_etcd_etcd-master-1.host.com_kube-system_a23c864b52d59788909994fe31a97f5e_8
    13c6e65046d6   004811815584                                        "etcd --advertise-cl…"   7 minutes ago    Exited (2) 3 minutes ago              k8s_etcd_etcd-master-1.host.com_kube-system_a23c864b52d59788909994fe31a97f5e_7
    5ca2f134f743   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 22 minutes ago   Up 22 minutes                         k8s_POD_etcd-master-1.host.com_kube-system_a23c864b52d59788909994fe31a97f5e_1
    [[email protected] ~]# docker logs -n 10 13c6e65046d6
    {"level":"warn","ts":"2021-11-20T14:24:39.911Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"ad7fc708963cf6f3","rtt":"0s","error":"dial tcp 192.168.0.9:2380: i/o timeout"}
    {"level":"warn","ts":"2021-11-20T14:24:39.915Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"c68a49f4a0c3cea9","rtt":"0s","error":"dial tcp 192.168.0.10:2380: connect: no route to host"}
    {"level":"warn","ts":"2021-11-20T14:24:39.915Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"c68a49f4a0c3cea9","rtt":"0s","error":"dial tcp 192.168.0.10:2380: connect: no route to host"}
    {"level":"info","ts":"2021-11-20T14:24:40.658Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"cb18584c4f4dbfc is starting a new election at term 7"}
    {"level":"info","ts":"2021-11-20T14:24:40.658Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"cb18584c4f4dbfc became pre-candidate at term 7"}
    {"level":"info","ts":"2021-11-20T14:24:40.658Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"cb18584c4f4dbfc received MsgPreVoteResp from cb18584c4f4dbfc at term 7"}
    {"level":"info","ts":"2021-11-20T14:24:40.658Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"cb18584c4f4dbfc [logterm: 7, index: 3988] sent MsgPreVote request to ad7fc708963cf6f3 at term 7"}
    {"level":"info","ts":"2021-11-20T14:24:40.658Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"cb18584c4f4dbfc [logterm: 7, index: 3988] sent MsgPreVote request to c68a49f4a0c3cea9 at term 7"}
    {"level":"warn","ts":"2021-11-20T14:24:41.729Z","caller":"etcdhttp/metrics.go:166","msg":"serving /health false; no leader"}
    {"level":"warn","ts":"2021-11-20T14:24:41.729Z","caller":"etcdhttp/metrics.go:78","msg":"/health error","output":"{\"health\":\"false\",\"reason\":\"RAFT NO LEADER\"}","status-code":503}
    

    结论

    etcd 没有选出 leader 节点?单个 etcd 不能用吗?求大佬支招

    11 条回复    2021-12-13 21:53:34 +08:00
    suifengdang666
        1
    suifengdang666  
       65 天前   ❤️ 1
    etcd 为了避免脑裂,采用了 raft 算法,规定只有过半数节点在线才能提供服务,即 N/2+1 节点在线才能选出 Leader
    cs419
        2
    cs419  
       65 天前   ❤️ 1
    高可用集群就是这么个设计方案
    集群节点都活着的时候 轮询受理请求 分摊压力
    挂掉的节点超过一半 就拒绝服务

    原因很简单 高可用机制被破坏了
    此时拒绝服务 在你修好节点后 集群可以正常工作

    但如果提供继续提供服务 然后请求把剩下的节点打爆掉
    则没法完整的修复数据

    想要单节点可用 那就一开始用单节点启动 别创建集群
    limao693
        3
    limao693  
       65 天前 via iPhone   ❤️ 1
    Raft 过半数量,可正常工作
    chih758
        4
    chih758  
       65 天前 via Android
    测试环境 etcdctl member remove ,从集群里面删掉两个节点,就可以单点运行了
    caicaiwoshishui
        5
    caicaiwoshishui  
    OP
       64 天前
    @cs419 感谢大佬,想问下如果节点过半挂了,并且重启不能恢复,是否可以添加新的机器加入到集群中,但是问题是 kubectl 都不能用了,kubeadm 也连不上 master 节点呀,这怎么搞
    caicaiwoshishui
        6
    caicaiwoshishui  
    OP
       64 天前
    @chih758 刚测试了下,关闭 2 台机器,剩下一台,我 docker exec it 进入后台

    配置 etcdctl 证书
    sh-5.0# export ETCDCTL_API=3
    sh-5.0# alias etcdctl='etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key'
    sh-5.0# etcdctl member list

    执行
    sh-5.0# `etcdctl member list`

    {"level":"warn","ts":"2021-11-21T02:11:18.722Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003a4700/#initially=[https://127.0.0.1:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
    Error: context deadline exceeded

    超时,也就是剩下一台机器的 etcd 会超时并且 docker 会 exit 掉
    caicaiwoshishui
        7
    caicaiwoshishui  
    OP
       64 天前 via iPhone
    @suifengdang666 想问下生产环境中.kubeadm 创建的 k8s 集群,etcd 是独立出来的吗?还是用 kubeadm 自带的 etcd
    suifengdang666
        8
    suifengdang666  
       64 天前
    @caicaiwoshishui kubeadm 创建的就行,如果怕 master 负载太高导致 etcd 异常,可以独立几个 vm 组一个 etcd 集群
    pmispig
        9
    pmispig  
       64 天前
    etcd 和 kubei api 分开放到不同的服务器部署
    0x208
        10
    0x208  
       42 天前
    楼主找工作吗 可以看看我招聘贴
    caicaiwoshishui
        11
    caicaiwoshishui  
    OP
       42 天前
    @0x208 可以远程吗 不在北京哦
    关于   ·   帮助文档   ·   API   ·   FAQ   ·   我们的愿景   ·   广告投放   ·   感谢   ·   实用小工具   ·   1518 人在线   最高记录 5497   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 26ms · UTC 17:51 · PVG 01:51 · LAX 09:51 · JFK 12:51
    ♥ Do have faith in what you're doing.