项目作者: JK-97

项目描述 :
A k8s pod leveal k8s GPU exporter
高级语言: Go
项目地址: git://github.com/JK-97/k8s-gpu-exporter.git
创建时间: 2020-07-12T14:10:33Z
项目社区:https://github.com/JK-97/k8s-gpu-exporter

开源协议:Apache License 2.0

下载


k8s-gpu-exporter

Command flags

  • address : Address to listen on for web interface and telemetry.
  • kubeconfig : Absolute path to the kubeconfig file, default get config from pod binding ServiceAccount.

Docker Build

Tips :
By default, after the nvidia-docker container is started, there will be a symbolic link : /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 in the container by default, and destination of the symbolic link is a correct version libnvidia-ml.so. you can jump to step 4

If you don’t have a correct version libnvidia-ml.so in container, you can solve it according to the following steps.

  1. Find out the libnvidia-ml.so on you host which you want run this k8s-gpu-exporter.

    1. find /usr/ -name libnvidia-ml.so
  2. Copy the libnvidia-ml.so under project-dir/lib directory

  3. Add line COPY lib/libnvidia-ml.so /usr/lib/x86_64-linux-gnu/libnvidia-ml.so in docker/dockerfile, for example:

    1. ...
    2. COPY --from=build-env /build/k8s-gpu-exporter /app/k8s-gpu-exporter
    3. COPY lib/libnvidia-ml.so /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
    4. ...
  4. Run MakeFile

    1. VERSION={YOU_VERSION} make docker

Best Practices

If you already have Arena, use it to submit a training task.

  1. # Preparation
  2. # Label the GPU Node
  3. $ kubectl lebel node {YOU_NODE} k8s-node/nvidia_count={GPU_NUM}
  4. # First
  5. $ kubectl apply -f k8s-gpu-exporter.yaml
  6. # Second
  7. # Submit a deeplearn job
  8. $ arena submit tf --name=style-transfer \
  9. --gpus=1 \
  10. --workers=1 \
  11. --workerImage=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/neural-style:gpu \
  12. --workingDir=/neural-style \
  13. --ps=1 \
  14. --psImage=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/style-transfer:ps \
  15. "python neural_style.py --styles /neural-style/examples/1-style.jpg --iterations 1000"
  16. # Third
  17. $ curl {HOST_IP}:{PORT}/metrics
  18. ...Omit...
  19. # HELP nvidia_gpu_used_memory Graphics used memory
  20. # TYPE nvidia_gpu_used_memory gauge
  21. nvidia_gpu_used_memory{gpu_node="dev-ms-7c22",gpu_pod_name="",minor_number="0",name="GeForce GTX 1660 SUPER",namepace_name="",uuid="GPU-a1460327-d919-1478-a68f-ef4cbb8515ac"} 3.0769152e+08
  22. nvidia_gpu_used_memory{gpu_node="dev-ms-7c22",gpu_pod_name="style-transfer-worker-0",minor_number="0",name="GeForce GTX 1660 SUPER",namepace_name="default",uuid="GPU-a1460327-d919-1478-a68f-ef4cbb8515ac"} 8.912896e+07
  23. ...Omit...
  24. # Fourth
  25. $ kubectl logs {YOUR_K8S_GPU_EXPORTER_POD}
  26. SystemGetDriverVersion: 450.36.06
  27. Not specify a config ,use default svc
  28. :9445
  29. We have 1 cards
  30. GPU-0 DeviceGetComputeRunningProcesses: 1
  31. pid: 3598, usedMemory: 89128960
  32. node: dev-ms-7c22 pod: style-transfer-worker-0, pid: 3598 usedMemory: 89128960

Prometheus

Add Annotation prometheus.io/scrape: 'true' to k8s-gpu-exporter pod so that prometheus can automatically discover metrics services

And you can use PromQL query statement nvidia_gpu_used_memory/nvidia_gpu_total_memory to see gpu memory usage

gpu_memory_usage