项目作者: nbuchwitz

项目描述 :
Icinga check command for Proxmox VE via API
高级语言: Python
项目地址: git://github.com/nbuchwitz/check_pve.git
创建时间: 2018-01-09T16:45:57Z
项目社区:https://github.com/nbuchwitz/check_pve

开源协议:GNU General Public License v2.0

下载


check_pve

Icinga check command for Proxmox VE via API

Linter

Setup

Requirements

This check command depends on Python 3 and the following modules:

  • requests
  • argparse
  • packaging

Installation on Debian / Ubuntu

  1. apt install python3 python3-requests python3-packaging

Installation on Rocky / Alma Linux 9

  1. yum install python3 python3-requests python3-packaging

Installation on FreeBSD

  1. pkg install python3 py39-requests py39-packaging

Installation from requirements file

  1. pip3 install -r requirements.txt

Installation as Docker container

  1. docker build -t check_pve .

After this, you can start the container like so:

  1. docker run -d --name check_pve --rm check_pve

The container will keep running without having the need for any of the requirements listed above (for environments that do not support this).
Running a check is as simple as:

  1. docker exec check_pve python check_pve.py ....rest of the default arguments listed below....

Create a API user in Proxmox VE

Create a role named Monitoring and assign necessary privileges:

  1. pveum roleadd Monitoring
  2. pveum rolemod Monitoring --privs VM.Monitor,Sys.Audit,Sys.Modify,Datastore.Audit,VM.Audit

Create a user named monitoring and set password:

  1. pveum useradd monitoring@pve --comment "The ICINGA 2 monitoring user"

Create an API token named monitoring for the user monitoring with backend pve:

  1. pveum user token add monitoring@pve monitoring

Please save the token secret as there isn’t any way to fetch it at a later point.

Assign role Monitoring to token monitoring and the user monitoring@pve:

  1. pveum acl modify / --roles Monitoring --user 'monitoring@pve'
  2. pveum acl modify / --roles Monitoring --tokens 'monitoring@pve!monitoring'

You can now use the check command like this: ./check_pve.py -u monitoring@pve -t monitoring=abcdef12-3456-7890-abcd-deadbeef1234 ...

Use password based authorization

Set password for the user monitoring:

  1. pveum passwd monitoring@pve

Assign monitoring role to user monitoring

  1. pveum acl modify / --users monitoring@pve --roles Monitoring

For further information about the Proxmox VE privilege system have a look into the documentation.

Usage

The icinga2 folder contains the command definition and service examples for use with Icinga2.

  1. usage: check_pve.py [-h] [--version] [-e API_ENDPOINT] [--api-port API_PORT] [-u API_USER] [-p API_PASSWORD | -t API_TOKEN] [-k]
  2. [-m {cluster,version,cpu,memory,swap,storage,io_wait,io-wait,updates,services,subscription,vm,vm_status,vm-status,replication,disk-health,ceph-health,zfs-health,zfs-fragmentation,backup}]
  3. [-n NODE] [--name NAME] [--vmid VMID] [--expected-vm-status {running,stopped,paused}] [--ignore-vmid VMID] [--ignore-vm-status] [--ignore-service NAME] [--ignore-disk NAME]
  4. [--ignore-pools NAME] [-w THRESHOLD_WARNING] [-c THRESHOLD_CRITICAL] [-M] [-V MIN_VERSION] [--unit {GB,MB,KB,GiB,MiB,KiB,B}]
  5. Check command for PVE hosts via API
  6. options:
  7. -h, --help show this help message and exit
  8. --version Show version of check command
  9. API Options:
  10. -e API_ENDPOINT, -H API_ENDPOINT, --api-endpoint API_ENDPOINT
  11. PVE api endpoint hostname or ip address (no additional data like paths)
  12. --api-port API_PORT PVE api endpoint port
  13. -u API_USER, --username API_USER
  14. PVE api user (e.g. icinga2@pve or icinga2@pam, depending on which backend you have chosen in proxmox)
  15. -p API_PASSWORD, --password API_PASSWORD
  16. PVE API user password
  17. -t API_TOKEN, --api-token API_TOKEN
  18. PVE API token (format: TOKEN_ID=TOKEN_SECRET)
  19. -k, --insecure Don't verify HTTPS certificate
  20. Check Options:
  21. -m {cluster,version,cpu,memory,swap,storage,io_wait,io-wait,updates,services,subscription,vm,vm_status,vm-status,replication,disk-health,ceph-health,zfs-health,zfs-fragmentation,backup}, --mode {cluster,version,cpu,memory,swap,storage,io_wait,io-wait,updates,services,subscription,vm,vm_status,vm-status,replication,disk-health,ceph-health,zfs-health,zfs-fragmentation,backup}
  22. Mode to use.
  23. -n NODE, --node NODE Node to check (necessary for all modes except cluster, version and backup)
  24. --name NAME Name of storage, vm, or container
  25. --vmid VMID ID of virtual machine or container
  26. --expected-vm-status {running,stopped,paused}
  27. Expected VM status
  28. --ignore-vmid VMID Ignore VM with vmid in checks
  29. --ignore-vm-status Ignore VM status in checks
  30. --ignore-service NAME
  31. Ignore service NAME in checks
  32. --ignore-disk NAME Ignore disk NAME in health check
  33. --ignore-pools NAME Ignore vms and containers in pool(s) NAME in checks
  34. -w THRESHOLD_WARNING, --warning THRESHOLD_WARNING
  35. Warning threshold for check value. Mutiple thresholds with name:value,name:value
  36. -c THRESHOLD_CRITICAL, --critical THRESHOLD_CRITICAL
  37. Critical threshold for check value. Mutiple thresholds with name:value,name:value
  38. -M Values are shown in the unit which is set with --unit (if available). Thresholds are also treated in this unit
  39. -V MIN_VERSION, --min-version MIN_VERSION
  40. The minimal pve version to check for. Any version lower than this will return CRITICAL.
  41. --unit {GB,MB,KB,GiB,MiB,KiB,B}
  42. Unit which is used for performance data and other values

Check examples

Check cluster health

  1. ./check_pve.py -u <API_USER> -t <API_TOKEN> -e <API_ENDPOINT> -m cluster
  2. OK - Cluster 'proxmox1' is healthy'

Check PVE version

  1. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m version -V 5.0.0
  2. OK - Your pve instance version '5.2' (0fcd7879) is up to date

Check CPU load

  1. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m cpu -n node1
  2. OK - CPU usage is 2.4%|usage=2.4%;;

Check memory usage

  1. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m memory -n node1
  2. OK - Memory usage is 37.44%|usage=37.44%;; used=96544.72MB;;;257867.91

Check disk-health

  1. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m disk-health -n node1
  2. OK - All disks are healthy|wearout_sdb=96%;; wearout_sdc=96%;; wearout_sdd=96%;; wearout_sde=96%;;

Check storage usage

  1. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m storage -n node1 --name local
  2. OK - Storage usage is 54.23%|usage=54.23%;; used=128513.11MB;;;236980.36
  3. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m storage -n node1 --name vms-disx
  4. CRITICAL - Storage 'vms-disx' doesn't exist on node 'node01'

Check subscription status

  1. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m subscription -n node1 -w 50 -c 10
  2. OK - Subscription of level 'Community' is valid until 2019-01-09

Check VM status

Without specifying a node name:

  1. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m vm --name test-vm
  2. OK - VM 'test-vm' is running on 'node1'|cpu=1.85%;; memory=8.33%;;

You can also pass a container name for the VM check:

  1. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m vm --name test-lxc
  2. OK - LXC 'test-lxc' on node 'node1' is running|cpu=0.11%;; memory=13.99%;;

With memory thresholds:

  1. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m vm --name test-vm -w 50 -c 80
  2. OK - VM 'test-vm' is running on 'node1'|cpu=1.85%;; memory=40.33%;50.0;80.0

With a specified node name, the check plugin verifies on which node the VM runs.

  1. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m vm -n node1 --name test-vm
  2. OK - VM 'test-vm' is running on node 'node1'|cpu=1.85%;; memory=8.33%;;
  3. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m vm -n node1 --name test-vm
  4. WARNING - VM 'test-vm' is running on node 'node2' instead of 'node1'|cpu=1.85%;; memory=8.33%;;

If you only want to gather metrics and don’t care about the vm status add the --ignore-vm-status flag:

  1. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m vm --name test-vm --ignore-vm-status
  2. OK - VM 'test-vm' is not running

Specify the expected VM status:

  1. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m vm --name test-vm --expected-vm-status stopped
  2. OK - VM 'test-vm' is not running

For hostalive checks without gathering performance data use vm_status instead of vm. The parameters are the same as with vm.

Check swap usage

  1. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m swap -n pve
  2. OK - Swap usage is 0.0 %|usage=0.0%;; used=0.0MB;;;8192.0

Check storage replication status

  1. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m replication -n node1
  2. OK - No failed replication jobs on node1

Check ceph cluster health

  1. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m ceph-health
  2. WARNING - Ceph Cluster is in warning state

Check ZFS pool health

  1. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m zfs-health -n pve
  2. OK - All ZFS pools are healthy

Check for specific pool:

  1. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m zfs-health -n pve --name rpool
  2. OK - ZFS pool 'rpool' is healthy

Check ZFS pool fragmentation

  1. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m zfs-fragmentation -n pve -w 40 -c 60
  2. CRITICAL - 2 of 2 ZFS pools are above fragmentation thresholds:
  3. - rpool (71 %) is CRITICAL
  4. - diskpool (50 %) is WARNING
  5. |fragmentation_diskpool=50%;40.0;60.0 fragmentation_rpool=71%;40.0;60.0

Check for specific pool:

  1. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m zfs-fragmentation -n pve --name diskpool -w 40 -c 60
  2. WARNING - Fragmentation of ZFS pool 'diskpool' is above thresholds: 50 %|fragmentation=50%;40.0;60.0

Check VZDump Backups

Check task history on all nodes:

  1. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m backup
  2. CRITICAL - 8 backup tasks successful, 3 backup tasks failed

Check for specific node and time frame:

  1. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m backup -n pve -c 86400
  2. OK - 2 backup tasks successful, 0 backup tasks failed within the last 86400.0s

Ignore a VM by their id from backup check:

  1. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m backup --ignore-vmid 123

FAQ

Individual thresholds per metric

You can either specify a threshold for warning or critical which is applied to all metrics or define individual thresholds like this (name:value,name:value,...):

  1. ./check_pve.py -u <API_USER> -p <API_PASSWORD> -e <API_ENDPOINT> -m vm --name test-vm -w memory:50 -c cpu:50,memory:80
  2. OK - VM 'test-vm' is running on 'node1'|cpu=1.85%;50.0; memory=40.33%;50.0;80.0

Could not connect to PVE API: Failed to resolve hostname

Verify that your DNS server is working and can resolve your hostname. If everything is fine check for proxyserver environment variables (HTTP_PROXY,HTTPS_PROXY), which maybe not allow communication to port 8006.

Contributors

Thank you to everyone, who is contributing to check_pve: https://github.com/nbuchwitz/check_pve/graphs/contributors.