项目作者: seatgeek

项目描述 :
detect Nomad allocation crash-loops, by consuming the allocation stream from nomad-firehose
高级语言: Go
项目地址: git://github.com/seatgeek/nomad-crashloop-detector.git
创建时间: 2017-07-12T14:30:29Z
项目社区:https://github.com/seatgeek/nomad-crashloop-detector

开源协议:BSD 3-Clause "New" or "Revised" License

下载


nomad-crashloop-detector

nomad-crashloop-detector is a tool meant to detect allocation crash-loops, by consuming the allocation stream from nomad-firehose in RabbitMQ.

Running

The project got build artifacts for linux, darwin and windows in the GitHub releases tab.

A docker container is also provided at seatgeek/nomad-crashloop-detector

Requirements

  • Go 1.8

Building

To build a binary, run the following

  1. # get this repo
  2. go get github.com/seatgeek/nomad-crashloop-detector
  3. # go to the repo directory
  4. cd $GOPATH/src/github.com/seatgeek/nomad-crashloop-detector
  5. # build the `nomad-crashloop-detector` binary
  6. make build

This will create a nomad-crashloop-detector binary in your $GOPATH/bin directory.

Configuration

Any NOMAD_* env that the native nomad CLI tool supports are supported by this tool.

  • $AMQP_CONNECTION is identical to $SINK_AMQP_CONNECTION, but is for the consuming stream from nomad-firehose
  • $AMQP_QUEUE is the RabbitMQ queue to consume the nomad-firehose from.
  • $RESTART_COUNT how many restarts to allow within $RESTART_INTERVAL time (example: 5)
  • $RESTART_INTERVAL within what time frame $RESTART_COUNT allocation restarts must happen to trigger an notification (example: 5m)
  • $NOTIFICATION_INTERVAL how often a notification should happen on a crash-looping allocation (example: 5m)

Sinks

The sink type is configured using $SINK_TYPE environment variable. Valid values are: stdout, kinesis and amqp.

The amqp sink is configured using $SINK_AMQP_CONNECTION (amqp://guest:guest@127.0.0.1:5672/), $SINK_AMQP_EXCHANGE and $SINK_AMQP_ROUTING_KEY environment variables.

The kinesis sink is configured using $SINK_KINESIS_STREAM_NAME and $SINK_KINESIS_PARTITION_KEY environment variables.

The stdout sink do not have any configuration, it will simply output the JSON to stdout for debugging.

Example

Assuming the following setup:

  • nomad exchange (type=topic)
  • nomad.crash-loop-in queue which is bound to nomad exchange with routing key allocations
  • nomad.crash-loop-out queue which is bound to nomad exchange with routing key crash-loop

Running nomad-firehose:

  1. SINK_TYPE=amqp \
  2. SINK_AMQP_CONNECTION="amqp://guest:guest@127.0.0.1:5672/" \
  3. SINK_AMQP_EXCHANGE=nomad \
  4. SINK_AMQP_ROUTING_KEY=allocations \
  5. nomad-firehose allocations

Running nomad-crashloop-detector:

  1. RESTART_COUNT=2 \
  2. RESTART_INTERVAL=5m \
  3. NOTIFICATION_INTERVAL=5m \
  4. SINK_TYPE=amqp \
  5. SINK_AMQP_CONNECTION="amqp://guest:guest@127.0.0.1:5672/" \
  6. SINK_AMQP_EXCHANGE=nomad \
  7. SINK_AMQP_ROUTING_KEY=crash-loop \
  8. AMQP_CONNECTION=$SINK_AMQP_CONNECTION \
  9. AMQP_QUEUE=nomad.crash-loop-in \
  10. nomad-crashloop-detector

The setup will make nomad-firehose send all nomad allocation changes to the nomad exchange, that will forward messages to the nomad.crash-loop-in queue.
nomad-crashloop-detector will consume the messages in nomad.crash-loop-in, and when a restart threshold is reached, submit a AMQP job to the nomad exchange, which will redirect the message to nomad.crash-loop-in.

Example crash-loop payload

  1. {
  2. "LastEvent": {
  3. "Name": "job.task[0]",
  4. "AllocationID": "fd4deb1f-405b-93a6-3eb4-a84e0670049d",
  5. "DesiredStatus": "run",
  6. "DesiredDescription": "",
  7. "ClientStatus": "running",
  8. "ClientDescription": "",
  9. "JobID": "job",
  10. "GroupName": "group",
  11. "TaskName": "task",
  12. "EvalID": "db0064ab-a44d-e450-4f66-2cabbec536bb",
  13. "TaskState": "pending",
  14. "TaskFailed": false,
  15. "TaskStartedAt": "2017-07-12T13:56:30.932498912Z",
  16. "TaskFinishedAt": "0001-01-01T00:00:00Z",
  17. "TaskEvent": {
  18. "Type": "Restarting",
  19. "Time": 1499867806677609000,
  20. "FailsTask": false,
  21. "RestartReason": "Restart within policy",
  22. "SetupError": "",
  23. "DriverError": "",
  24. "DriverMessage": "",
  25. "ExitCode": 0,
  26. "Signal": 0,
  27. "Message": "",
  28. "KillReason": "",
  29. "KillTimeout": 0,
  30. "KillError": "",
  31. "StartDelay": 17425840945,
  32. "DownloadError": "",
  33. "ValidationError": "",
  34. "DiskLimit": 0,
  35. "DiskSize": 0,
  36. "FailedSibling": "",
  37. "VaultError": "",
  38. "TaskSignalReason": "",
  39. "TaskSignal": ""
  40. }
  41. },
  42. "EventLog": [
  43. "2017-07-12T15:56:15.401013209+02:00",
  44. "2017-07-12T15:56:46.677608921+02:00"
  45. ]
  46. }