背景
k8s版本1.25.6,业务k8s容器化,虚机里进程迁移到容器里后,运维在执行free -m top等命令排查问题时一脸迷惑,显示内存还有很多结果pod的容器被oom或CPU资源显示很多核且空闲很多资源进程却运行很慢,我们看到的资源视图是物理机的而非我们做了限定pod里容器的资源,这给研发和运维排查问题带来一定的干扰。
那是什么原因导致运维看到的资源视图还是物理机的呢?
我们知道容器通过cgroup对CPU、内存、交换空间等资源进行限制,但是容器并不是完全独立隔离的,它与主机共享内核,因此可以访问主机上的一些信息。在Linux系统中,/proc目录下存放了许多虚拟文件,它们提供了对系统内核和运行时信息的访问。/proc/meminfo文件包含了关于内存使用和状态的信息,例如总内存大小、可用内存、已使用内存等。当在容器里执行free -m时,实际上是在访问主机上的/proc/meminfo文件的信息,所以展示的是物理机的内存信息。
我们知道什么原因导致的容器资源视图没有隔离的问题,在实际的使用过程中除了有迷惑还会有一些痛点:
- 1. 比如nginx 根据CPU核数自动设置worker数量。
- 2. jvm程序内存根据系统内存大小自动设置jvm大小,导致进程启动不了或者运行过程中经常oom。
- 3. 信息的过度泄露可能会危害物理机的安全等。
那怎么解决容器资源视图隔离的问题? Linux容器(LXC)社区早就意识到上述问题,他们开发了LXCFS(Linux Containers File System)来解决容器资源视图隔离的问题。
下面来看看LXCFS的工作原理。
LXCFS工作原理
LXCFS是一个使用FUSE(Filesystem in Userspace)实现的小型虚拟文件系统,旨在让Linux容器感觉更像一个虚拟机。它最初是LXC的一个附带项目,但可由任何运行时使用。
LXCFS确保procfs中关键文件提供的信息是针对容器的,例如:
/proc/cpuinfo
/proc/diskstats
/proc/meminfo
/proc/stat
/proc/swaps
/proc/uptime
/proc/slabinfo
/sys/devices/system/cpu/online
LXCFS将这些信息适配到容器内,以便显示的值(例如/proc/uptime)真正反映容器的运行时间,而不是主机的运行时间。
即LXCFS在容器内部创建了一个虚拟的文件系统,通过挂载主机上的一些关键目录(如/proc和/sys等)到容器内部的对应目录下,使得容器内的进程可以看到主机上的资源信息,同时,LXCFS通过自己的逻辑和计算,提供了对这些资源信息的虚拟视图,使得容器内部能够看到主机上实际的资源使用情况。
- 1. 容器里执行free -m,读取文件/proc/meminfo
- 2. 因为/proc/meminfo文 件是挂载的,所以会读取/var/lib/lxcfs/proc/meminfo文件见下文,这就触发了LXCFS的工作机制
- 3. LXCFS文件系通过gblic系统调用vfs接口然后转向Fuse内核模块
- 4. FUSE回调用户空间LXCFS文件系统实现接口,获取容器的cgroup信息
- 5. LXCFS实现根据容器id获取并计算cgroup下被限制容器的实际mem、cpu等信息,最终返回给用户看到的结果就是cgroup 限制的资源视图。
LXCFS机器上部署
a. 安装lxcfs
yum install meson fuse-devel fuse cmake help2man fuse3 fuse3-devel -y
git clone git://github.com/lxc/lxcfs
cd lxcfs
meson setup -Dinit-script=systemd --prefix=/usr build/
meson compile -C build/
meson install -C build/
b. 启动lxcfs
mkdir -p /var/lib/lxcfs
lxcfs /var/lib/lxcfs
c. 测试运行容器
docker run -it -m 256m --memory-swap 256m --cpus=1 \
-v /var/lib/lxcfs/proc/cpuinfo:/proc/cpuinfo:rw \
-v /var/lib/lxcfs/proc/diskstats:/proc/diskstats:rw \
-v /var/lib/lxcfs/proc/meminfo:/proc/meminfo:rw \
-v /var/lib/lxcfs/proc/stat:/proc/stat:rw \
-v /var/lib/lxcfs/proc/swaps:/proc/swaps:rw \
-v /var/lib/lxcfs/proc/uptime:/proc/uptime:rw \
-v /var/lib/lxcfs/proc/slabinfo:/proc/slabinfo:rw \
-v /var/lib/lxcfs/sys/devices/system/cpu:/sys/devices/system/cpu:rw \
ubuntu:18.04 /bin/bash
启动容器后,执行如下命令确认是否生效
1. uptime #容器启动时间
2. free -m #内存情况
3. lscpu #看online cpu 核数 或者 cat /proc/cpuinfo
k8s 环境下怎么为pod加上资源视图隔离呢?下面我们来看一看
LXCFSk8s 环境运行
解决步骤:
- 1. 首先要使lxcfs进程在所有的node上运行,这个我们使用damonset解决
- 2. 其次挂载node上的/sys/fs/cgroup、/usr/lib64和/usr/local 到lxcfs里,把lxcfs 容器里虚拟文件系统/var/lib/lxcfs/通过hostPath挂载到物理机上
- 3. 最后创建podyaml,通过hostPath形式把node上/var/lib/lxcfs/ 挂载到pod的容器里,这样就完成了lxcfs 解决k8s 容器资源视图隔离的问题。
a. 构建lxcfs镜像
a.1 目录结构
tree .
.
├── Dockerfile
├── build.sh
└── lxcfs-lxcfs-5.0.4.tar.gz
a.2 Dockerfile
FROM centos:7.9 #或者制定你的基础镜像
#安装
RUN yum install meson fuse-devel fuse cmake help2man fuse3 fuse3-devel git -y
RUN git clone git://github.com/lxc/lxcfs && cd lxcfs
RUN meson setup -Dinit-script=systemd --prefix=/usr build/
RUN meson compile -C build/
RUN meson install -C build/
#运行
RUN mkdir -p /var/lib/lxcfs
CMD ["sh", "-c", "lxcfs /var/lib/lxcfs"]
a.3 build.sh 构建镜像
#!/bin/bash
source /etc/profile
docker build -t yourharbor.domain.com/centos/7.9/lxcfs/5.0.4/lxcfs .
docker push yourharbor.domain.com/centos/7.9/lxcfs/5.0.4/lxcfs
到这里lxcfs镜像就构建完了,下面看看怎么用此镜像
b. 运行lxcfsdaemonsetyaml
使用构建的lxcfs镜像,挂载node文件到pod同时挂载/var/lib/lxcfs/ 到node上,见下述yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
labels:
app: lxcfs
name: lxcfs
namespace: default
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: lxcfs
template:
metadata:
labels:
app: lxcfs
spec:
containers:
- yourharbor.domain.com/centos/7.9/lxcfs/5.0.4/lxcfs
imagePullPolicy: Always
name: lxcfs
resources: {}
securityContext:
privileged: true
volumeMounts:
- mountPath: /sys/fs/cgroup
name: cgroup
- mountPath: /var/lib/lxcfs
mountPropagation: Bidirectional
name: lxcfs
- mountPath: /usr/local
name: usr-local
- mountPath: /usr/lib64
name: usr-lib64
hostPID: true
imagePullSecrets:
- name: your-docker-token
restartPolicy: Always
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
- effect: NoSchedule
key: your-taint-key
operator: Exists
volumes:
- hostPath:
path: /sys/fs/cgroup
type: ""
name: cgroup
- hostPath:
path: /usr/local
type: ""
name: usr-local
- hostPath:
path: /usr/lib64
type: ""
name: usr-lib64
- hostPath:
path: /var/lib/lxcfs
type: DirectoryOrCreate
name: lxcfs
apply上述yaml后可能个别node上lxcfs daemonset pod 启动保如下错误
Error: failed to generate container "974c6c0465adae1a244e3416b3e053ba2dccb0cbd123c2d02317c9301e3f83d0" spec: failed to apply OCI options: failed to stat "/var/lib/lxcfs": stat /var/lib/lxcfs: transport endpoint is not connected
解决办法
umount /var/lib/lxcfs
c. 验证 deployment pod yaml 定义
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
replicas: 2
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
volumes:
- hostPath:
path: /var/lib/lxcfs/proc/cpuinfo
type: ""
name: lxcfs-proc-cpuinfo
- hostPath:
path: /var/lib/lxcfs/proc/diskstats
type: ""
name: lxcfs-proc-diskstats
- hostPath:
path: /var/lib/lxcfs/proc/meminfo
type: ""
name: lxcfs-proc-meminfo
- hostPath:
path: /var/lib/lxcfs/proc/stat
type: ""
name: lxcfs-proc-stat
- hostPath:
path: /var/lib/lxcfs/proc/swaps
type: ""
name: lxcfs-proc-swaps
- hostPath:
path: /var/lib/lxcfs/proc/uptime
type: ""
name: lxcfs-proc-uptime
- hostPath:
path: /var/lib/lxcfs/proc/loadavg
type: ""
name: lxcfs-proc-loadavg
- hostPath:
path: /var/lib/lxcfs/sys/devices/system/cpu/online
type: ""
name: lxcfs-sys-devices-system-cpu-online
containers:
- name: web
image: httpd:2.4.32
imagePullPolicy: Always
resources:
requests:
memory: "256Mi"
cpu: "500m"
limits:
memory: "256Mi"
cpu: "500m"
volumeMounts:
- mountPath: /proc/cpuinfo
name: lxcfs-proc-cpuinfo
readOnly: true
- mountPath: /proc/meminfo
name: lxcfs-proc-meminfo
readOnly: true
- mountPath: /proc/diskstats
name: lxcfs-proc-diskstats
readOnly: true
- mountPath: /proc/stat
name: lxcfs-proc-stat
readOnly: true
- mountPath: /proc/swaps
name: lxcfs-proc-swaps
readOnly: true
- mountPath: /proc/uptime
name: lxcfs-proc-uptime
readOnly: true
- mountPath: /proc/loadavg
name: lxcfs-proc-loadavg
readOnly: true
- mountPath: /sys/devices/system/cpu/online
name: lxcfs-sys-devices-system-cpu-online
readOnly: true
这样pod通过lxcfs实现了容器资源视图隔离。
但这里有一个问题一个两个容器这样复制粘贴设置还能接受,成千上万和容器这种重复操作,作为追求KISS原则的你肯定不能忍。
那有没有办法解决呢?我们可以通过实现 admission-webhook (准入控制 Admission Control)在授权后对请求做进一步的验证或添加默认参数。我们想到的前辈们都已经实现,就不用重复造轮子了。可以参考 lxcfs-admission-webhook
lxcfs-admission-webhook 注入实现容器自动挂载/proc、/sys/
lxcfs-admission-webhook实现了一个动态的准入webhook,更准确的讲是实现了一个修改性质的webhook,即监听pod的创建,然后对pod执行patch的操作,从而将lxcfs与容器内的目录映射关系植入到pod创建的yaml中从而实现自动挂载。
使用上也比较KISS,只用在资源文件里加一条注解即可。
下面我们看看怎么玩
1. 准备lxcfs-admission-webhook镜像
go build 二进制
git clone git@github.com:denverdino/lxcfs-admission-webhook.git
cd lxcfs-admission-webhook
# build lxcfs-admission-webhook,因为是老的go项目需要转成支持go mod
export GOPROXY=https://goproxy.cn,direct
go mod init v1
go mody tidy
CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o lxcfs-admission-webhook
chmod +x lxcfs-admission-webhook
Dockerfile
FROM alpine:latest
ADD lxcfs-admission-webhook /lxcfs-admission-webhook
ENTRYPOINT ["./lxcfs-admission-webhook"]
构建镜像
docker build -t yourharbor.domain.com/alpine/lxcfs-admission-webhook:v1 .
docker push yourharbor.domain.com/alpine/lxcfs-admission-webhook:v1
2. 运行lxcfs-admission-webhookpod
每个集群都有自己的CA证书,所以不同集群部署lxcfs-admission-webhook,先做如下操作再应用yaml
2.1 目录结构
tree .
.
├── dp.yaml #lxcfs-admission-webhook deployment
├── mutatingwebhook.yaml #MutatingWebhookConfiguration
└── svc.yaml #webhook svc
└── webhook-create-signed-cert.sh #创建`lxcfs-admission-webhook`依赖证书
2.2 修改webhook-create-signed-cert.sh
注:由于k8s版本较新,lxcfs-admission-webhook近几年没有更新,所以适配新版本k8s修改了github上的k8s的证书生成脚本webhook-create-signed-cert.sh
#!/bin/bash
set -e
usage() {
cat <<EOF
Generate certificate suitable for use with an sidecar-injector webhook service.
This script uses k8s' CertificateSigningRequest API to a generate a
certificate signed by k8s CA suitable for use with sidecar-injector webhook
services. This requires permissions to create and approve CSR. See
https://kubernetes.io/docs/tasks/tls/managing-tls-in-a-cluster for
detailed explantion and additional instructions.
The server key/cert k8s CA cert are stored in a k8s secret.
usage: ${0} [OPTIONS]
The following flags are required.
--service Service name of webhook.
--namespace Namespace where webhook service and secret reside.
--secret Secret name for CA certificate and server certificate/key pair.
EOF
exit 1
}
while [[ $# -gt 0 ]]; do
case ${1} in
--service)
service="$2"
shift
;;
--secret)
secret="$2"
shift
;;
--namespace)
namespace="$2"
shift
;;
*)
usage
;;
esac
shift
done
[ -z ${service} ] && service=lxcfs-admission-webhook-svc
[ -z ${secret} ] && secret=lxcfs-admission-webhook-certs
[ -z ${namespace} ] && namespace=default
if [ ! -x "$(command -v openssl)" ]; then
echo "openssl not found"
exit 1
fi
csrName=${service}.${namespace}
tmpdir=$(mktemp -d)
echo "creating certs in tmpdir ${tmpdir} "
cat <<EOF >> ${tmpdir}/csr.conf
[req]
req_extensions = v3_req
distinguished_name = req_distinguished_name
[req_distinguished_name]
[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
extendedKeyUsage = serverAuth
subjectAltName = @alt_names
[alt_names]
DNS.1 = ${service}
DNS.2 = ${service}.${namespace}
DNS.3 = ${service}.${namespace}.svc
EOF
openssl genrsa -out ${tmpdir}/server-key.pem 2048
#openssl req -new -key ${tmpdir}/server-key.pem -subj "/CN=${service}.${namespace}.svc" -out ${tmpdir}/server.csr -config ${tmpdir}/csr.conf
openssl req -new -key ${tmpdir}/server-key.pem -subj "/CN=system:node:${service}.${namespace}.svc;/O=system:nodes" -out ${tmpdir}/server.csr -config ${tmpdir}/csr.conf
# clean-up any previously created CSR for our service. Ignore errors if not present.
kubectl delete csr ${csrName} -n ${namespace} 2>/dev/null || true
# create server cert/key CSR and send to k8s API
cat <<EOF | kubectl -n ${namespace} create -f -
apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
metadata:
name: ${csrName}
spec:
groups:
- system:authenticated
signerName: kubernetes.io/kubelet-serving
request: $(cat ${tmpdir}/server.csr | base64 | tr -d '\n')
usages:
- digital signature
- key encipherment
- server auth
EOF
# verify CSR has been created
while true; do
kubectl get csr ${csrName}
if [ "$?" -eq 0 ]; then
break
fi
done
# approve and fetch the signed certificate
kubectl certificate approve ${csrName}
# verify certificate has been signed
for x in $(seq 10); do
serverCert=$(kubectl get csr ${csrName} -o jsonpath='{.status.certificate}')
if [[ ${serverCert} != '' ]]; then
break
fi
sleep 1
done
if [[ ${serverCert} == '' ]]; then
echo "ERROR: After approving csr ${csrName}, the signed certificate did not appear on the resource. Giving up after 10 attempts." >&2
exit 1
fi
echo ${serverCert} | openssl base64 -d -A -out ${tmpdir}/server-cert.pem
# create the secret with CA cert and server cert/key
kubectl create secret generic ${secret} \
--from-file=key.pem=${tmpdir}/server-key.pem \
--from-file=cert.pem=${tmpdir}/server-cert.pem \
--dry-run -o yaml |
kubectl -n ${namespace} apply -f -
修改了证书请求命令/CN=system:node:${service}.${namespace}.svc;/O=system:nodes 和 修改了--namespace 的bug
然后在k8s master 节点上运行 kubectl create ns lxcfs ; sh webhook-create-signed-cert.sh --namespace lxcfs
2.2 获取集群CA证书内容
kubectl config view --raw --flatten --minify -o jsonpath='{.clusters[].cluster.certificate-authority-data}'
2.3 更新CA证书内容到mutatingwebhook.yamlcaBundle字段
apiVersion: admissionregistration.k8s.io/v1beta1
kind: MutatingWebhookConfiguration
metadata:
name: mutating-lxcfs-admission-webhook-cfg
labels:
app: lxcfs-admission-webhook
webhooks:
- name: mutating.lxcfs-admission-webhook.aliyun.com
clientConfig:
service:
name: lxcfs-admission-webhook-svc
namespace: default
path: "/mutate"
caBundle: LS0tLS1CRUdJxiBDRVJUSUZJQ0FURS0tLS0tCk1JSUMvakNDQWVhZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJek1EY3hOekEwTXpNek5Gb1hEVE16TURjeE5EQTBNek16TkZvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBTlVZCjd4SThpcXZtbEtNN0FDTUFDY0huRWxxTXgyakR1b3JkWk81cUNGYTBNalROOXNqZHhUbHNNTlMrUHpuOUxPSkMKZ2d5TW90MGNPaW0zQTd2bllRYzFCY2I3UHFLOGpjS0U2a0E5MWVyNlpNSHU0c3ZXRXEybjVyMlIvcnY5NUR2eQpIRzlzTUJnenQrWUFJNlR6OGJNazhnMzJZR1BJejEvTTJmalBCa292bVJ3U0c1UkVIYWVFNW1TdDBRMnJheGJQCmtEU0pDSEErVlV3QThuekpFRVpwdkIxbUZ6MytXKzhrOUpIYlFtSW40TzhNaCtYYXlGc2Vab2g5SC9kVERkSXUKN0JXVG5pcmg5YkNWZzJhSDJidG03ZVpSY2s1V3IrM0QxcmUrc1FxWnpVdlhFSzBQYTk4MENGd3BYTVhsenlFdQpqNkhQRjZzOUhmV0gxOVdJMUdrQ0F3RUFBYU5aTUZjd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0hRWURWUjBPQkJZRUZBQVVicWVyaklyUDRmOFV0ZjErUzRERzVSWStNQlVHQTFVZEVRUU8KTUF5Q0NtdDFZbVZ5Ym1WMFpYTXdEUVlKS29aSWh2Y05BUUVMQlFBRGdnRUJBTGx0OHBELzVtMnhVclJSdUJIdQpaODFKbnpDSzB6Y2ZhbHRROXFiWkFQb2syT1R6eTQrclh6SHQ4VzVHN01YVmN6TXVoZnh0OXFSeWVLekM3bmtICnpJSnIxcmxPbkkwaXdNcHJFeDlNQkpBTnBNdWNwN3ljaE82RGlOQ01ocFAwMXdDbWVENTBsVUladlIrMHhUbHEKaGVZdTFZS3Eza3Q0dzNuWVUxUGszUGU1Q3NweFNqd0NKNVF0RHpyUFY4bE5JaHNMZjRHV2U2bDN0N2J5ck9wWApsUWJiMXovazNRTDRTU3pqcEdkQVRmUnVmRmsrbk1RVkFCSmJwVWp5aHNFMlg1TjRvLzlKWFVpZVhLNlYxOHNiCnVtVUlLYlkySGIyTHNISXEveTBHeHpITnpGTndEeEdGNnNSWFF5SkFYVS9tekNWRWczbEhaWUlpUU9wdkc2VdfsZXFVPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0twx==
rules:
- operations: [ "CREATE" ]
apiGroups: ["core", ""]
apiVersions: ["v1"]
resources: ["pods"]
namespaceSelector:
matchLabels:
lxcfs-admission-webhook: enabled
2.4lxcfs-admission-webhook的dp.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: lxcfs-admission-webhook-deployment
labels:
app: lxcfs-admission-webhook
namespace: lxcfs
spec:
replicas: 1
selector:
matchLabels:
app: lxcfs-admission-webhook
template:
metadata:
labels:
app: lxcfs-admission-webhook
spec:
imagePullSecrets:
- name: your-docker-token
containers:
- name: lxcfs-admission-webhook
image: yourharbor.domain.com/alpine/lxcfs-admission-webhook:v1
imagePullPolicy: IfNotPresent
args:
- -tlsCertFile=/etc/webhook/certs/cert.pem
- -tlsKeyFile=/etc/webhook/certs/key.pem
- -alsologtostderr
- -v=4
- 2>&1
volumeMounts:
- name: webhook-certs
mountPath: /etc/webhook/certs
readOnly: true
volumes:
- name: webhook-certs
secret:
secretName: lxcfs-admission-webhook-certs
2.5 svc.yaml
apiVersion: v1
kind: Service
metadata:
namespace: lxcfs
name: lxcfs-admission-webhook-svc
labels:
app: lxcfs-admission-webhook
spec:
ports:
- port: 443
targetPort: 443
selector:
app: lxcfs-admission-webhook
3.验证,应用注解能力
给default namespace 开启lxcfs能力
kubectl label namespace default lxcfs-admission-webhook=enabled
部署deployment
cd lxcfs-admission-webhook
kubectl apply -f deployment/web.yaml
登录容器执行free
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
lxcfs-admission-webhook-deployment-f4bdd6f66-5wrlg 1/1 Running 0 8m29s
lxcfs-pqs2d 1/1 Running 0 55m
lxcfs-zfh99 1/1 Running 0 55m
web-7c5464f6b9-6zxdf 1/1 Running 0 8m10s
web-7c5464f6b9-nktff 1/1 Running 0 8m10s
$ kubectl exec -ti web-7c5464f6b9-6zxdf sh
# free
total used free shared buffers cached
Mem: 262144 2744 259400 0 0 312
-/+ buffers/cache: 2432 259712
Swap: 0 0 0
#
总结
这里强调一下,我们实现的是容器资源视图和物理机资源视图的隔离,而非pod的。
容器资源视图隔离后,视觉上舒服很多,对定位问题,服务启动,网络安全上都有很大帮助,行动起来吧。欢迎关注DevOpSec每周分享干货内容,我们一起进步。