任务执行龟速,原因竟然是......
 · 阅读需 9 分钟
介绍在 k8s 环境中一次任务执行龟速的排查过程

问题背景
某天,业务同学反馈生产环境k8s集群中由核心服务创建的Job任务执行速度奇慢......
通过分析服务日志发现,该服务运行前期主要是执行请求数据交换服务,获取到oss对象存储的文件url后进行下载,下载完成后再执行其他任务
分析和复盘
“服务好好的,怎么用着用着就慢了呢?” 旁边的xx开始发起了灵魂拷问
由于此问题偏故障型,首先想到的当然是秉承着“有报错,看日志”的宗旨,去看各方服务的日志
通过排查日志,均无错误,但现象就是日志慢而且卡顿
于是先判断是不是服务之间的网络出问题了
简单思考了下,与网络因素相关,再加上排除法,最小化可能的相关原因有如下
pod网卡- 节点和
pod网络检查 - 调度到不同节点的网卡对比
 - 不同场景下网卡出入站带宽
 dns解析- 节点资源综合对比
 oss服务端限流等策略核查- 服务本身代码是否变更等等
 
对照可能原因开始一一排查,如下列举一些相关的具体排查方法,其余就不再赘述了
网络带宽测试
对于网络带宽的测试,可以选用ethtool、iperf等工具,可以很方便的帮我们查看网卡相关信息,测试网络出站入站的带宽,顺便加上抓包工具
# ethtool
Settings for eth0:
	Supported ports: [ ]
	Supported link modes:   Not reported
	Supported pause frame use: No
	Supports auto-negotiation: No
	Advertised link modes:  Not reported
	Advertised pause frame use: No
	Advertised auto-negotiation: No
	Speed: 10000Mb/s
	Duplex: Full
	Port: Twisted Pair
	PHYAD: 0
	Transceiver: internal
	Auto-negotiation: off
	MDI-X: Unknown
Cannot get wake-on-lan settings: Operation not permitted
	Link detected: yes
# iperf
Server listening on TCP port 5001
TCP window size: 12.0 MByte (default)
------------------------------------------------------------
[  4] local 10.244.155.34 port 5001 connected with 10.244.0.196 port 42148
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-2.0000 sec  1.62 GBytes  6.97 Gbits/sec
[  4] 2.0000-4.0000 sec  1.15 GBytes  4.93 Gbits/sec
[  4] 4.0000-6.0000 sec  1.15 GBytes  4.93 Gbits/sec
[  4] 6.0000-8.0000 sec  1.14 GBytes  4.91 Gbits/sec
[  4] 8.0000-10.0000 sec  1.14 GBytes  4.91 Gbits/sec
[  4] 10.0000-12.0000 sec  1.14 GBytes  4.92 Gbits/sec
[  4] 12.0000-14.0000 sec  1.14 GBytes  4.89 Gbits/sec
[  4] 14.0000-16.0000 sec  1.14 GBytes  4.90 Gbits/sec
[  4] 16.0000-18.0000 sec  1.14 GBytes  4.88 Gbits/sec
[  4] 18.0000-20.0000 sec  1.14 GBytes  4.88 Gbits/sec
[  4] 20.0000-22.0000 sec  1.14 GBytes  4.89 Gbits/sec
[  4] 22.0000-24.0000 sec  1.14 GBytes  4.89 Gbits/sec
[  4] 24.0000-26.0000 sec  1.13 GBytes  4.87 Gbits/sec
[  4] 26.0000-28.0000 sec  1.14 GBytes  4.88 Gbits/sec
[  4] 28.0000-30.0000 sec  1.14 GBytes  4.91 Gbits/sec
[  4] 30.0000-32.0000 sec  1.14 GBytes  4.88 Gbits/sec
[  4] 32.0000-34.0000 sec  1.14 GBytes  4.89 Gbits/sec
[  4] 34.0000-36.0000 sec  1.14 GBytes  4.91 Gbits/sec
[  4] 36.0000-38.0000 sec  1.14 GBytes  4.88 Gbits/sec
[  4] 38.0000-40.0000 sec  1.14 GBytes  4.91 Gbits/sec
[  4] 40.0000-42.0000 sec  1.14 GBytes  4.90 Gbits/sec
[  4] 42.0000-44.0000 sec  1.14 GBytes  4.90 Gbits/sec
[  4] 44.0000-46.0000 sec  1.14 GBytes  4.90 Gbits/sec
[  4] 46.0000-48.0000 sec  1.14 GBytes  4.90 Gbits/sec
[  4] 48.0000-50.0000 sec  1.15 GBytes  4.93 Gbits/sec
[  4] 50.0000-52.0000 sec  1.14 GBytes  4.91 Gbits/sec
[  4] 52.0000-54.0000 sec  1.14 GBytes  4.92 Gbits/sec
[  4] 54.0000-56.0000 sec  1.14 GBytes  4.90 Gbits/sec
[  4] 56.0000-58.0000 sec  1.14 GBytes  4.88 Gbits/sec
[  4] 58.0000-60.0000 sec  1.14 GBytes  4.89 Gbits/sec
[  4] 60.0000-60.0201 sec  13.6 MBytes  5.69 Gbits/sec
[  4] 0.0000-60.0201 sec  34.7 GBytes  4.97 Gbits/sec
结果:无果
dns解析测试
对于dns解析的测试,利用dig、nslookup工具分别选取了公网域名,内网域名,集群内域名分别测试进行对比,例如
www.baidu.com
data.ssgeek.com
data-download.default.svc.cluster.local
结果:无果