项目作者: samkos

项目描述 :
A fault-tolerant SLURM extension
高级语言: Python
项目地址: git://github.com/samkos/decimate.git
创建时间: 2017-11-28T07:53:23Z
项目社区:https://github.com/samkos/decimate

开源协议:BSD 2-Clause "Simplified" License

下载


NAME

  1. decimate - a fault-tolerant SLURM scheduler extension

SYNOPSIS

  1. dbatch [ Slurm options ] [ --check <user_script> ]
  2. [ --max-retry=<number of restart> ]
  3. script [args...]

DESCRIPTION

  1. Developped by the KAUST Supercomputing Laboratory (KSL),
  2. decimate is a SLURM extension written in python designed to handle
  3. dependent jobs more easely and efficiently.
  4. Decimate transparently adds parameters to SLURM sbatch command
  5. to check the correctness of jobs and automatically
  6. reschedules jobs found faulty.
  7. Using Decimate on Shaheen II, one can submit, run, monitor or
  8. terminate a workflow composed of dependent jobs. If asked,
  9. thanks to standardized or customized messages, the user will be
  10. informed by mail of the progress of its workflow on the system.
  11. In case of failure of one part of tne workflow, decimate
  12. automatically detects the failure, signals it to the user and
  13. launches the misbehaving part after having fixed the job
  14. dependency. By default if the same failure happens three
  15. consecutive times, decimate cancels the whole workfow removing
  16. all the depending jobs from the scheduling. In a next version,
  17. decimate will allow the automatic restarting of the workflow
  18. once the problem causing its failure has been cured.
  19. decimate also allows the user to define his own mail alerts
  20. that can be sent at any point of the workflow through a call to
  21. a python method. This feature will also be available from bash
  22. in a next version.
  23. Some customized checking functions can also be designed by the
  24. user. Their purpose is to validate if a step of the workflow
  25. was succesful or not. It could involved checking for the
  26. presence of some result files, grepping some error or success
  27. messages in them, computing ratio or checksum... These
  28. intermediate results can be easely transmitted to decimate
  29. validating or not the correctness of any step. They can also be
  30. forwarded by mail to the user where as the workflow is
  31. executing.

USE

  1. At this moment, jobs only need to be submitted through the
  2. dbatch
  3. command that accepts exactely the same parameters as the
  4. original SLURM sbatch command plus the new parameters
  5. --check=SCRIPT_FILE
  6. where SCRIPT_FILE is a python
  7. or shell script
  8. to check if results are ok.
  9. --max-retry=MAX_RETRY
  10. number of time a step can fail and be
  11. restarted automatically before failing the
  12. whole workflow (3 per default)
  13. sslog tails out the decimate logging file attached to the
  14. current directory, tracking all the jobs that were launched
  15. with dbatch from this directory.
  16. sstatus gives the current status of the workflow excecuting
  17. in the current directory.
  18. Decimate is still in a beta phase and under test with some of
  19. our KSL users. More documentations will be provided once the
  20. stabilized and fully tested version is made available by the
  21. end of June 2018.
  22. If interested in testing decimate or contributing, please send
  23. a mail to help@hpc.kaust.edu.sa

AUTHOR

  1. Written by Samuel Kortas (samuel.kortas (at) kaust.edu.sa)

REPORTING BUGS

  1. Report decimate bugs to help@hpc.kaust.edu.sa

COPYRIGHT
Copyright (c) 2018, KAUST Supercomputing Laboratory
All rights reserved.

  1. Redistribution and use in source and binary forms, with or without
  2. modification, are permitted provided that the following conditions are met:
  3. * Redistributions of source code must retain the above copyright notice, this
  4. list of conditions and the following disclaimer.
  5. * Redistributions in binary form must reproduce the above copyright notice,
  6. this list of conditions and the following disclaimer in the documentation
  7. and/or other materials provided with the distribution.
  8. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
  9. AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  10. IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
  11. DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
  12. FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  13. DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
  14. SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
  15. CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
  16. OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  17. OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

SEE ALSO

  1. decimate official documentation pages:
  2. <http://http://decimate.readthedocs.io>
  3. KAUST Supercomputing Laboratory: <http://hpc.kaust.edu.sa></http:>