Skip to main content

· 4 min read

Kubernetes 節點中有一個資訊,紀錄當前 Image FS 的使用狀況,裡面包含 available, capacity 以及 used

# kubectl get --raw "/api/v1/nodes/kind-worker/proxy/stats/summary" | grep imageFs -A 5
"imageFs": {
"time": "2024-02-26T14:40:12Z",
"availableBytes": 21507072000,
"capacityBytes": 31025332224,
"usedBytes": 541495296,
"inodesFree": 3668005,

上圖可以看到 imageFS 目前顯示

  1. availableBytes: 21507072000
  2. capacityBytes: 31025332224
  3. usedBytes: 541495296
  4. inodesFree: 3668005

Kubelet 本身是沒有去紀錄以及計算這些,而是透過 CRI 的標準去問底下 contaienr runtime 來處理 https://github.com/kubernetes/cri-api/blob/c75ef5b/pkg/apis/runtime/v1/api.proto#L120-L136

service ImageService {
// ListImages lists existing images.
rpc ListImages(ListImagesRequest) returns (ListImagesResponse) {}
// ImageStatus returns the status of the image. If the image is not
// present, returns a response with ImageStatusResponse.Image set to
// nil.
rpc ImageStatus(ImageStatusRequest) returns (ImageStatusResponse) {}
// PullImage pulls an image with authentication config.
rpc PullImage(PullImageRequest) returns (PullImageResponse) {}
// RemoveImage removes the image.
// This call is idempotent, and must not return an error if the image has
// already been removed.
rpc RemoveImage(RemoveImageRequest) returns (RemoveImageResponse) {}
// ImageFSInfo returns information of the filesystem that is used to store images.
rpc ImageFsInfo(ImageFsInfoRequest) returns (ImageFsInfoResponse) {}
}

既然 CRI 有提供,就可以使用 crictl 嘗試挖掘看看,果然有找到一個 imagefsinfo 的資訊

# crictl  imagefsinfo
{
"status": {
"timestamp": "1708958572632331985",
"fsId": {
"mountpoint": "/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs"
},
"usedBytes": {
"value": "541495296"
},
"inodesUsed": {
"value": "18150"
}
}
}

該指令回報了目前使用了 "541495296" Bytes,與 K8s 回報的一樣,但是並沒有解釋怎麼計算 available 以及 capacity。 其中還有提到一個 fsId(FilesystemIdentifier)

接下來從 kubelet 的原始碼可以抓到

https://github.com/kubernetes/kubernetes/blob/cc5362ebc17e1376fa79b510f7f354dbffe7f92e/pkg/kubelet/stats/cri_stats_provider.go#L388-L425

...
imageFsInfo, err := p.getFsInfo(fs.GetFsId())
if err != nil {
return nil, nil, fmt.Errorf("get filesystem info: %w", err)
}
if imageFsInfo != nil {
// The image filesystem id is unknown to the local node or there's
// an error on retrieving the stats. In these cases, we omit those
// stats and return the best-effort partial result. See
// https://github.com/kubernetes/heapster/issues/1793.
imageFsRet.AvailableBytes = &imageFsInfo.Available
imageFsRet.CapacityBytes = &imageFsInfo.Capacity
imageFsRet.InodesFree = imageFsInfo.InodesFree
imageFsRet.Inodes = imageFsInfo.Inodes
}
...

透過 imageFsInfo 內的 GetFsId 獲得相關資訊,往下去翻 getFsInfo 函式

https://github.com/kubernetes/kubernetes/blob/cc5362ebc17e1376fa79b510f7f354dbffe7f92e/pkg/kubelet/stats/cri_stats_provider.go#L449

func (p *criStatsProvider) getFsInfo(fsID *runtimeapi.FilesystemIdentifier) (*cadvisorapiv2.FsInfo, error) {
if fsID == nil {
klog.V(2).InfoS("Failed to get filesystem info: fsID is nil")
return nil, nil
}
mountpoint := fsID.GetMountpoint()
fsInfo, err := p.cadvisor.GetDirFsInfo(mountpoint)
if err != nil {
msg := "Failed to get the info of the filesystem with mountpoint"
if errors.Is(err, cadvisorfs.ErrNoSuchDevice) ||
errors.Is(err, cadvisorfs.ErrDeviceNotInPartitionsMap) ||
errors.Is(err, cadvisormemory.ErrDataNotFound) {
klog.V(2).InfoS(msg, "mountpoint", mountpoint, "err", err)
} else {
klog.ErrorS(err, msg, "mountpoint", mountpoint)
return nil, fmt.Errorf("%s: %w", msg, err)
}
return nil, nil
}
return &fsInfo, nil
}

透過 fsID.GetMountpoint() 來取得對應的 mountPoint。 https://github.com/kubernetes/cri-api/blob/v0.25.16/pkg/apis/runtime/v1alpha2/api.pb.go#L7364

func (m *FilesystemIdentifier) GetMountpoint() string {
if m != nil {
return m.Mountpoint
}
return ""
}

由於上述的路徑是 '/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs',搭配我的 'df' 結果去比對

# df -BKB
Filesystem 1kB-blocks Used Available Use% Mounted on
overlay 31025333kB 9502487kB 21506069kB 31% /
tmpfs 67109kB 0kB 67109kB 0% /dev
shm 67109kB 0kB 67109kB 0% /dev/shm
/dev/root 31025333kB 9502487kB 21506069kB 31% /var
tmpfs 16794874kB 9552kB 16785322kB 1% /run

將上述 /var 的大小與之前去比對,幾乎吻合,所以看起來就是根據路徑找到 mountPoint 並且得到目前的使用量以及用量。

"availableBytes": 21507072000, "capacityBytes": 31025332224,

· One min read

若 Nginx 內使用 proxy_pass 來轉發,並且該目標是透過 DNS 指向的話,沒有處理好就只會查詢一次,查詢一次就意味若該 DNS 之後有轉變過,整個 nginx 都會指向舊的位置

解決方式就是加入 resolver 並且透過變數的方式去設定 proxy_pass

使用情境特別是 k8s 內的 headless

參考: https://rajrajhans.com/2022/06/force-dns-resolution-nginx-proxy/

之後再來寫一篇長篇文章記錄 source code 的閱讀心得

· 3 min read

由於 Multus 下會透過多組 CNI 讓 Pod 內去呼叫多個 CNI 最後產生多個網卡,而 NetworkPolicy 這種情況下其實有點危險 當安裝的 CNI 數量夠多且每個都支援時也有可能讓這些 controller 太忙 另外大部分的 Multus 都是使用 SRIOV, Bridge, Macvlan 等本來就沒有實作 Network Policy 的 CNI,若有需求時就有點麻煩

Multus 那有相關的專案來解決這個問題,以下專案提供介面 https://github.com/k8snetworkplumbingwg/multi-networkpolicy

該專案被用於 openshift 環境內,實作的專案(iptables)如下 https://github.com/openshift/multus-networkpolicy

其會動態的進入到目標 Pod 內去下 iptables 的規則來控管封包的進出

專案內的 deploy.yaml 可以直接安裝,不過下列參數需要修改

  1. 修改參數 args:
    - "--host-prefix=/host"
    # uncomment this if runtime is docker
    # - "--container-runtime=docker"
    - "--network-plugins=bridge"
    - "--v=9"
    - "--container-runtime-endpoint=/run/containerd/containerd.sock"
  2. 若不需要可以移除 custom iptavles 相關的 volume

(1) 的部分要特別注意 --networks-plugins=bridge 以及 --container-runtime-endpoint 前者要跟 multus 串連的 multus 一致,這樣才會運作

接者就要部署專屬的 MultiNetworkPolicy 的物件,用法與傳統的 Network Policy 一樣

apiVersion: k8s.cni.cncf.io/v1beta1
kind: MultiNetworkPolicy
metadata:
name: test-network-policy
namespace: default
annotations:
k8s.v1.cni.cncf.io/policy-for: bridge-network
spec:
podSelector:
matchLabels:
app: debug
policyTypes:
- Ingress
- Egress
ingress:
- from:
- ipBlock:
cidr: 10.10.0.5/24
egress:
- to:
- ipBlock:
cidr: 10.10.0.7/32

設定完成後就有機會於符合規則的 container 內看到下列規則

[142:11928] -A INPUT -i net1 -j MULTI-INGRESS
[478:40152] -A OUTPUT -o net1 -j MULTI-EGRESS
[0:0] -A MULTI-0-EGRESS -j MARK --set-xmark 0x0/0x30000
[0:0] -A MULTI-0-EGRESS -j MULTI-0-EGRESS-0-PORTS
[0:0] -A MULTI-0-EGRESS -j MULTI-0-EGRESS-0-TO
[0:0] -A MULTI-0-EGRESS -m mark --mark 0x30000/0x30000 -j RETURN
[0:0] -A MULTI-0-EGRESS -j DROP
[0:0] -A MULTI-0-EGRESS-0-PORTS -m comment --comment "no egress ports, skipped" -j MARK --set-xmark 0x10000/0x10000
[0:0] -A MULTI-0-EGRESS-0-TO -d 10.10.0.7/32 -o net1 -j MARK --set-xmark 0x20000/0x20000
[0:0] -A MULTI-0-INGRESS -j MARK --set-xmark 0x0/0x30000
[0:0] -A MULTI-0-INGRESS -j MULTI-0-INGRESS-0-PORTS
[0:0] -A MULTI-0-INGRESS -j MULTI-0-INGRESS-0-FROM
[0:0] -A MULTI-0-INGRESS -m mark --mark 0x30000/0x30000 -j RETURN
[0:0] -A MULTI-0-INGRESS -j DROP
[0:0] -A MULTI-0-INGRESS-0-FROM -s 10.10.0.0/24 -i net1 -j MARK --set-xmark 0x20000/0x20000
[0:0] -A MULTI-0-INGRESS-0-PORTS -m comment --comment "no ingress ports, skipped" -j MARK --set-xmark 0x10000/0x10000
[0:0] -A MULTI-EGRESS -o net1 -m comment --comment "policy:test-network-policy net-attach-def:default/bridge-network" -j MULTI-0-EGRESS
[0:0] -A MULTI-INGRESS -i net1 -m comment --comment "policy:test-network-policy net-attach-def:default/bridge-network" -j MULTI-0-INGRESS
COMMIT

其透過 mark 的方式來標示封包是否需要被 DROP,同時也支援針對 ip & port 的方式去判斷

· 3 min read

Linux Bridge 的 MTU 設定不如一般網卡簡單設定,其 MTU 預設情況下會自動調整,會自動使用所有 slave 網卡上最小的值來取代 以下列程式碼來看,剛有任何 slave 網卡加入到 bridge 上後

int br_add_if(struct net_bridge *br, struct net_device *dev,
struct netlink_ext_ack *extack)
{
struct net_bridge_port *p;
int err = 0;
unsigned br_hr, dev_hr;
bool changed_addr, fdb_synced = false;

/* Don't allow bridging non-ethernet like devices. */
if ((dev->flags & IFF_LOOPBACK) ||
dev->type != ARPHRD_ETHER || dev->addr_len != ETH_ALEN ||
!is_valid_ether_addr(dev->dev_addr))
return -EINVAL;

/* No bridging of bridges */
if (dev->netdev_ops->ndo_start_xmit == br_dev_xmit) {
NL_SET_ERR_MSG(extack,
"Can not enslave a bridge to a bridge");
return -ELOOP;
}

/* Device has master upper dev */
if (netdev_master_upper_dev_get(dev))
return -EBUSY;

/* No bridging devices that dislike that (e.g. wireless) */
if (dev->priv_flags & IFF_DONT_BRIDGE) {
NL_SET_ERR_MSG(extack,
"Device does not allow enslaving to a bridge");
return -EOPNOTSUPP;
}

p = new_nbp(br, dev);
if (IS_ERR(p))
return PTR_ERR(p);

call_netdevice_notifiers(NETDEV_JOIN, dev);

err = dev_set_allmulti(dev, 1);
if (err) {
br_multicast_del_port(p);
netdev_put(dev, &p->dev_tracker);
kfree(p); /* kobject not yet init'd, manually free */
goto err1;
}

err = kobject_init_and_add(&p->kobj, &brport_ktype, &(dev->dev.kobj),
SYSFS_BRIDGE_PORT_ATTR);
if (err)
goto err2;

err = br_sysfs_addif(p);
if (err)
goto err2;

err = br_netpoll_enable(p);
if (err)
goto err3;

err = netdev_rx_handler_register(dev, br_get_rx_handler(dev), p);
if (err)
goto err4;

dev->priv_flags |= IFF_BRIDGE_PORT;

err = netdev_master_upper_dev_link(dev, br->dev, NULL, NULL, extack);
if (err)
goto err5;

dev_disable_lro(dev);

list_add_rcu(&p->list, &br->port_list);

nbp_update_port_count(br);
if (!br_promisc_port(p) && (p->dev->priv_flags & IFF_UNICAST_FLT)) {
/* When updating the port count we also update all ports'
* promiscuous mode.
* A port leaving promiscuous mode normally gets the bridge's
* fdb synced to the unicast filter (if supported), however,
* `br_port_clear_promisc` does not distinguish between
* non-promiscuous ports and *new* ports, so we need to
* sync explicitly here.
*/
fdb_synced = br_fdb_sync_static(br, p) == 0;
if (!fdb_synced)
netdev_err(dev, "failed to sync bridge static fdb addresses to this port\n");
}

netdev_update_features(br->dev);

br_hr = br->dev->needed_headroom;
dev_hr = netdev_get_fwd_headroom(dev);
if (br_hr < dev_hr)
update_headroom(br, dev_hr);
else
netdev_set_rx_headroom(dev, br_hr);

if (br_fdb_add_local(br, p, dev->dev_addr, 0))
netdev_err(dev, "failed insert local address bridge forwarding table\n");

if (br->dev->addr_assign_type != NET_ADDR_SET) {
/* Ask for permission to use this MAC address now, even if we
* don't end up choosing it below.
*/
err = dev_pre_changeaddr_notify(br->dev, dev->dev_addr, extack);
if (err)
goto err6;
}

err = nbp_vlan_init(p, extack);
if (err) {
netdev_err(dev, "failed to initialize vlan filtering on this port\n");
goto err6;
}

spin_lock_bh(&br->lock);
changed_addr = br_stp_recalculate_bridge_id(br);

if (netif_running(dev) && netif_oper_up(dev) &&
(br->dev->flags & IFF_UP))
br_stp_enable_port(p);
spin_unlock_bh(&br->lock);

br_ifinfo_notify(RTM_NEWLINK, NULL, p);

if (changed_addr)
call_netdevice_notifiers(NETDEV_CHANGEADDR, br->dev);

br_mtu_auto_adjust(br);
br_set_gso_limits(br);

kobject_uevent(&p->kobj, KOBJ_ADD);

return 0;

err6:
if (fdb_synced)
br_fdb_unsync_static(br, p);
list_del_rcu(&p->list);
br_fdb_delete_by_port(br, p, 0, 1);
nbp_update_port_count(br);
netdev_upper_dev_unlink(dev, br->dev);
err5:
dev->priv_flags &= ~IFF_BRIDGE_PORT;
netdev_rx_handler_unregister(dev);
err4:
br_netpoll_disable(p);
err3:
sysfs_remove_link(br->ifobj, p->dev->name);
err2:
br_multicast_del_port(p);
netdev_put(dev, &p->dev_tracker);
kobject_put(&p->kobj);
dev_set_allmulti(dev, -1);
err1:
return err;
}

其中上述的重點是 br_mtu_auto_adjust,該 function 的內容如下,基本上就去找出最小MTU並且設定

void br_mtu_auto_adjust(struct net_bridge *br)
{
ASSERT_RTNL();

/* if the bridge MTU was manually configured don't mess with it */
if (br_opt_get(br, BROPT_MTU_SET_BY_USER))
return;

/* change to the minimum MTU and clear the flag which was set by
* the bridge ndo_change_mtu callback
*/
dev_set_mtu(br->dev, br_mtu_min(br));
br_opt_toggle(br, BROPT_MTU_SET_BY_USER, false);
}

· 8 min read

本文紀錄如何於 Linux(Ubuntu 22.04) 環境上簡易搭建 Kubevirt 的環境

環境搭建

KVM

安裝指令來檢查 qemu 相關狀態

sudo apt install libvirt-clients

使用 virt-host-validate 檢查相關

$ virt-host-validate qemu
QEMU: Checking for hardware virtualization : PASS
QEMU: Checking if device /dev/kvm exists : PASS
QEMU: Checking if device /dev/kvm is accessible : FAIL (Check /dev/kvm is world writable or you are in a group that is allowed to access it)
QEMU: Checking if device /dev/vhost-net exists : PASS
QEMU: Checking if device /dev/net/tun exists : PASS
QEMU: Checking for cgroup 'cpu' controller support : PASS
QEMU: Checking for cgroup 'cpuacct' controller support : PASS
QEMU: Checking for cgroup 'cpuset' controller support : PASS
QEMU: Checking for cgroup 'memory' controller support : PASS
QEMU: Checking for cgroup 'devices' controller support : WARN (Enable 'devices' in kernel Kconfig file or mount/enable cgroup controller in your system)
QEMU: Checking for cgroup 'blkio' controller support : PASS
QEMU: Checking for device assignment IOMMU support : WARN (No ACPI DMAR table found, IOMMU either disabled in BIOS or not supported by this hardware platform)
QEMU: Checking for secure guest support : WARN (Unknown if this platform has Secure Guest support)

可以看到中間有一個錯誤,這時候需要安裝 sudo apt install qemu-kvm 並且調整權限 sudo usermod -aG kvm $USER.

$ virt-host-validate qemu
QEMU: Checking for hardware virtualization : PASS
QEMU: Checking if device /dev/kvm exists : PASS
QEMU: Checking if device /dev/kvm is accessible : PASS
QEMU: Checking if device /dev/vhost-net exists : PASS
QEMU: Checking if device /dev/net/tun exists : PASS
QEMU: Checking for cgroup 'cpu' controller support : PASS
QEMU: Checking for cgroup 'cpuacct' controller support : PASS
QEMU: Checking for cgroup 'cpuset' controller support : PASS
QEMU: Checking for cgroup 'memory' controller support : PASS
QEMU: Checking for cgroup 'devices' controller support : WARN (Enable 'devices' in kernel Kconfig file or mount/enable cgroup controller in your system)
QEMU: Checking for cgroup 'blkio' controller support : PASS
QEMU: Checking for device assignment IOMMU support : WARN (No ACPI DMAR table found, IOMMU either disabled in BIOS or not supported by this hardware platform)
QEMU: Checking for secure guest support : WARN (Unknown if this platform has Secure Guest support)

Kubernetes

透過 minikube 搭建一個 k8s (provider採用 docker 減少第二層虛擬化)

$ minikube start --cni=flannel

叢集準備好後,安裝 kubevirt-operator

$ export VERSION=$(curl -s https://api.github.com/repos/kubevirt/kubevirt/releases | grep tag_name | grep -v -- '-rc' | sort -r | head -1 | awk -F': ' '{print $2}' | sed 's/,//' | xargs)
$ echo $VERSION
$ kubectl create -f https://github.com/kubevirt/kubevirt/releases/download/${VERSION}/kubevirt-operator.yaml
namespace/kubevirt created
customresourcedefinition.apiextensions.k8s.io/kubevirts.kubevirt.io created
priorityclass.scheduling.k8s.io/kubevirt-cluster-critical created
clusterrole.rbac.authorization.k8s.io/kubevirt.io:operator created
serviceaccount/kubevirt-operator created
role.rbac.authorization.k8s.io/kubevirt-operator created
rolebinding.rbac.authorization.k8s.io/kubevirt-operator-rolebinding created
clusterrole.rbac.authorization.k8s.io/kubevirt-operator created
clusterrolebinding.rbac.authorization.k8s.io/kubevirt-operator created
deployment.apps/virt-operator created

實驗當下使用的版本是 v1.1.0-alpha.0,安裝完畢後檢查 kubevirt namespace 的資源

$ kubectl -n kubevirt get pods
NAME READY STATUS RESTARTS AGE
virt-operator-57f9fb965d-5lnqf 1/1 Running 0 46m
virt-operator-57f9fb965d-f5zg4 1/1 Running 0 46m

接下來安裝 CRD 物件

$ kubectl create -f https://github.com/kubevirt/kubevirt/releases/download/${VERSION}/kubevirt-cr.yaml

安裝完畢後可以看到有一個名為 kubevirt 的物件(CRD為 kubevirt,簡稱 kv)被創立,因此 operator 就會針對該物件去創立 kubevirt 相關的服務 Pod

$ kubectl -n kubevirt get kv kubevirt -o yaml
apiVersion: kubevirt.io/v1
kind: KubeVirt
metadata:
annotations:
kubevirt.io/latest-observed-api-version: v1
kubevirt.io/storage-observed-api-version: v1
creationTimestamp: "2023-10-10T14:35:55Z"
finalizers:
- foregroundDeleteKubeVirt
generation: 2
name: kubevirt
namespace: kubevirt
resourceVersion: "1490"
uid: bc621d93-4910-4b1f-b3c8-f8f1f4e27a38
spec:
certificateRotateStrategy: {}
configuration:
developerConfiguration: {}
customizeComponents: {}
imagePullPolicy: IfNotPresent workloadUpdateStrategy: {}
$ kubectl -n kubevirt get pods
NAME READY STATUS RESTARTS AGE
virt-api-77f8d679fc-hntws 1/1 Running 0 49m
virt-controller-6689488456-4jtv8 1/1 Running 0 48m
virt-controller-6689488456-68hnz 1/1 Running 0 48m
virt-handler-psc4w 1/1 Running 0 48m

基本上就是預設的設定檔案,然後對應的 API, Controller 以及 Handler 都被創建出來處理後續的操作。

Virtctl

透過官方指令直接抓取對應版本的 virtctl

VERSION=$(kubectl get kubevirt.kubevirt.io/kubevirt -n kubevirt -o=jsonpath="{.status.observedKubeVirtVersion}")
ARCH=$(uname -s | tr A-Z a-z)-$(uname -m | sed 's/x86_64/amd64/') || windows-amd64.exe
echo ${ARCH}
curl -L -o virtctl https://github.com/kubevirt/kubevirt/releases/download/${VERSION}/virtctl-${VERSION}-${ARCH}
chmod +x virtctl
sudo install virtctl /usr/local/bin
-> % virtctl version
Client Version: version.Info{GitVersion:"v1.1.0-alpha.0", GitCommit:"67902ed9de43d7a0b94aa72b8fd7f48f31ca4285", GitTreeState:"clean", BuildDate:"2023-09-18T10:45:14Z", GoVersion:"go1.19.9", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{GitVersion:"v1.1.0-alpha.0", GitCommit:"67902ed9de43d7a0b94aa72b8fd7f48f31ca4285", GitTreeState:"clean", BuildDate:"2023-09-18T12:03:45Z", GoVersion:"go1.19.9", Compiler:"gc", Platform:"linux/arm64"}

'''info 官方文件有說明可以透過 kubectl krew 的平台來安裝 virtctl 指令,透過 kubectl krew install virt 來安裝並使用,但是目前並沒有支援 darwin-arm64 (MacOS M1/M2) '''

安裝 VM

透過官方示範檔案部署第一個 VM

$ kubectl apply -f https://kubevirt.io/labs/manifests/vm.yaml
virtualmachine.kubevirt.io/testvm created
$ kubectl get vm
NAME AGE STATUS READY
testvm 7s Stopped False

預設情況下,創建好 VM 並不代表 VM 已經啟動,這時候可以透過 virtctl 將該 VM 給運行起來

$ virtctl start testvm
VM testvm was scheduled to start

當 VM 啟動後,對應的 Pod 就會正式被部署到環境內

$ kubectl get pods -o wide

這時候來研究一下該 Pod 的一些架構

先透過 virtctl console testvm 登入後觀察一下 VM IP

$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast qlen 1000
link/ether 52:54:00:0c:00:55 brd ff:ff:ff:ff:ff:ff
inet 10.0.2.2/24 brd 10.0.2.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:fe0c:55/64 scope link
valid_lft forever preferred_lft forever
$ ip r
default via 10.0.2.1 dev eth0
10.0.2.0/24 dev eth0 src 10.0.2.2

IP 是 10.0.2.2 並且 Gateway 是 10.0.2.1 這時候進入到對應的 Pod 去觀察

$ kubectl exec -it virt-launcher-testvm-pnn4j -- bash
bash-5.1$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: eth0@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
link/ether 12:37:77:cf:6d:63 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.244.0.10/24 brd 10.244.0.255 scope global eth0
valid_lft forever preferred_lft forever
3: k6t-eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
link/ether 02:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
inet 10.0.2.1/24 brd 10.0.2.255 scope global k6t-eth0
valid_lft forever preferred_lft forever
4: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel master k6t-eth0 state UP group default qlen 1000
link/ether d6:0e:c5:6f:41:f1 brd ff:ff:ff:ff:ff:ff
bash-5.1$

這邊可以看到 Pod 上面的 k6t-eth0 是有 IP 10.0.2.1 同時可以看到下方有一個 tap0 的網卡,該網卡有設定 master k6t-eth0 因此可以推斷 k6t-eth0 是 Linux Bridge, tap0 則是 bridge 下的一個 Port,透過下列指令可以確認


bash-5.1$ ls /sys/class/net/k6t-eth0/brif
tap0
bash-5.1$ ls /sys/class/net/k6t-eth0/bridge/
ageing_time group_fwd_mask multicast_last_member_count multicast_query_response_interval nf_call_arptables root_port vlan_protocol
bridge_id hash_elasticity multicast_last_member_interval multicast_query_use_ifaddr nf_call_ip6tables stp_state vlan_stats_enabled
default_pvid hash_max multicast_membership_interval multicast_router nf_call_iptables tcn_timer vlan_stats_per_port
flush hello_time multicast_mld_version multicast_snooping no_linklocal_learn topology_change
forward_delay hello_timer multicast_querier multicast_startup_query_count priority topology_change_detected
gc_timer max_age multicast_querier_interval multicast_startup_query_interval root_id topology_change_timer
group_addr multicast_igmp_version multicast_query_interval multicast_stats_enabled root_path_cost vlan_filtering
bash-5.1$

k6t-eth0 底下有眾多 bridge 的設定,並且 brif 底下有 tap0,而實務上該 tap0 則是 kvm 創建 vm 後將其綁到 VM 內,因此會與 VM 內的 eth0 掛勾,可以想成是一條大水管,一邊進去另外一邊出來 看來詳細細節還是需要閱讀interface networks,似乎提供不同網路模式來達成不同功能,有空來玩看看彼此差異研究下實作細節。

· One min read

刪除特定一行

sed '/^keywords:/d' input > output

刪除符合字串後的所有行數

sed '/^keywords/,$d' input > output

搭配 Find 達到大量修改所有檔案

統一刪除所有檔案

find . -type f -exec sed -i '' '/^authors:/d' {} +

Append 一行新的,換行要特別注意處理

find . -type f -exec sed -i '' '/^title/a\
authors: hwchiu\
' {} +

大量換名稱 https://hackmd.io/_uploads 變成 ./assets/

find *.md -type f -exec sed -i '' -e 's/https:\/\/hackmd\.io\/_uploads\//\.\/assets\//g' {} +

假設環境中有大量檔案需要改名稱,透過 rename 這個工具可以快速達成 譬如以下範例會先用正規表達式找尋所有符合的檔案名稱,接者將所有 read-notes 都改名為 reading-notes

rename 's/read-notes/reading-notes/' *read-notes-*

· One min read

根據文件將 ladning page 移除並且使用 blog 做為首頁後,發現上方的連結永遠都會顯示反白,仔細檢查後發現連結被加上一個 navbar__link--active 的屬性。 仔細研究後發現官方有相關 Issue,根據 issue 所述針對 items 內補上 activeBaseRegex: '^/$', 即可。

最後呈現

      {
to: '/',
label: '短篇筆記',
position: 'left',
activeBaseRegex: '^/$',
},

· One min read

當環境中有多組 GCP 帳號要登入存取時,譬如一組個人帳號,一組 service account,這時後可以透過 gcloud 的一些指令才切換與檢視

$ gcloud config configurations list

NAME IS_ACTIVE ACCOUNT PROJECT COMPUTE_DEFAULT_ZONE COMPUTE_DEFAULT_REGION
default False [email protected] first-project
name2 False [email protected]
name3 True [email protected]

如果要切換可以使用

$ gcloud config configurations activate name2

進行切換,這時候 gcloud command 就可以轉移過去了。 如果有用 GKE 的,還要額外呼叫一次 gcloud container clusters update 去更新 KUBECONFIG

· 3 min read

如何投過 Bitnami 的 Helm Chart 來安裝 Redis-Cluster

採用 Kustomize + Helm 的方式 (ArgoCD)

$ cat kustomization.yaml
helmCharts:
- name: redis-cluster
includeCRDs: false
valuesFile: redis.yaml
releaseName: redis-cluster
namespace: production
version: 9.0.5
repo: https://charts.bitnami.com/bitnami

$ cat redis.yaml

global:
storageClass: standard-rwo
existingSecret: redis-cluster
existingSecretPasswordKey: password
redis:
resources:
limits:
cpu: 2
memory: 2Gi
requests:
cpu: 0.1
memory: 0.5Gi
podAnnotations:
'xxxxx': 'yyyyy'

這種部署方式就會啟動密碼驗證機制,而密碼則來自於同 namespace 的 secret 物件 redis-cluster 內的 key password.

如果想要無密碼驗證,可以使用下列方式

usePassword: false

'''note 如果已經先行有密碼驗證,則修改此欄位對於運行中的 Redis Cluster 不會有任何效果,這部分需要更多額外操作來完成,砍掉 PVC 重新部署是一個簡單暴力的方式。 '''

由於此架構會部署 Redis-cluster,預設情況下會部署三組(master+worker)的 statefulset,並且給兩組 servicee

redis-cluster                       ClusterIP      10.2.5.248    <none>        6379/TCP                                                                                          213d
redis-cluster-headless ClusterIP None <none> 6379/TCP,16379/TCP 213d

另外要注意的是,redis-cluster 並不適用 kubectl port-forward 的方式連接,因為 redis-cluster 的溝通過程會回傳其他 Pod 的 IP 給你,而每個 Pod 的 IP 都用相同的 Port,因此除非你有辦法於本地產生一些虛擬網卡並且搭配多組 kubectl port-forawrd 幫每個 Pod 都打通,否則存取上會出問題。

舉例來說

$ kubectl port-forward --address 0.0.0.0 pod/redis-cluster-0 6379:6379
Forwarding from 0.0.0.0:6379 -> 6379
$ redis-benchmark -n 1 --cluster
Cluster has 3 master nodes:

Master 0: 1020525d25c7ad16e786a98e1eb7419d609b8847 10.4.0.119:6379
Could not connect to Redis at 10.4.0.119:6379: Connection refused

可以看到透過 port-forward 打進去後,接下來的連線就會轉到其他的 pod 然後就會失敗,因此這種情況要簡單使用還是部署一個包含 redis 指令的 Pod 到同樣的 namespace 並且用 kubectl exec 進去操作會比較順

$ kubectl run --image=hwchiu/netutils test
$ kubectl exec -it test -- bash
root@test:/# redis-benchmark -h redis-cluster -q -c 20 -n 30000
PING_INLINE: 44977.51 requests per second
PING_BULK: 48154.09 requests per second
SET: 45317.22 requests per second
GET: 47169.81 requests per second
INCR: 50251.26 requests per second
LPUSH: 48465.27 requests per second
LPOP: 41265.48 requests per second
SADD: 37878.79 requests per second
SPOP: 49833.89 requests per second
LPUSH (needed to benchmark LRANGE): 51724.14 requests per second
LRANGE_100 (first 100 elements): 43041.61 requests per second
LRANGE_300 (first 300 elements): 35842.29 requests per second
LRANGE_500 (first 450 elements): 36014.41 requests per second
LRANGE_600 (first 600 elements): 33259.42 requests per second
MSET (10 keys): 42796.01 requests per second

· 2 min read

Helm Chart 中可以透過各種 if/else 的語法將物件給包裹起來,這個操作會影響最後安裝過程中 該物件會不會被產出並且安裝到目標叢集中,因此大部分都是都過 Values 裡面的 enable/disable 等參數來調整。

但是如果今天該物件的安裝條件則是根據 K8s 版本而定,特別是當某些 API 於新版被移除時,這時候要如何撰寫一個兼容兩個版本的 Helm Chart。 舉例來說,以最近被移除的 PSP(PodSecurityPolicy) 物件為範例。

  1. 第一個做法就是維護兩個版本的 Helm Chart,針對新版的 Kubernetes 推進新版本,移除 PSP 物件並且針對 k8s 版本限制最低版本,舊 k8s 叢集不支援
  2. 使用 Helm 內建語法 .Capabilities.APIVersions.Has 去判斷目標 K8s API Resource 是否有包含目標版本

kube-prometheus-stack 為範例 其 psp-clusterorle.yaml 中的開頭使用了下列語法

{{- if and .Values.prometheus.enabled .Values.global.rbac.create .Values.global.rbac.pspEnabled }}
{{- if .Capabilities.APIVersions.Has "policy/v1beta1/PodSecurityPolicy" }}
kind: ClusterRole
...
{{- end }}
{{- end }}

透過 .Capabilities.APIVersions.Has 語法去判斷該物件是否支援,若支援則安裝否則跳掉,這機制帶來的好處就是可以打造出一個兼容更多版本 K8s 叢集的 Helm Chart,但是實務上真的需要這樣控管?還是應該要用不同版本的來管理會更好應該就見仁見智。