项目作者: holgerbrandl

项目描述 :
An HPC-task manager
高级语言: Scala
项目地址: git://github.com/holgerbrandl/joblist.git
创建时间: 2015-11-16T12:58:52Z
项目社区:https://github.com/holgerbrandl/joblist

开源协议:

下载


JobList

Join the chat at https://gitter.im/holgerbrandl/joblist Build Status Download

A task list manager for hpc-clusters. Among others it supports monitoring, automatic resubmission, profiling, and reporting of job lists.

JobList jl can submit, monitor and wait until an entire a list of clusters jobs has finished. It reports average runtime statistics, and predicts the remaining runtime of a joblist based on cluster load and job complexities. jl can recover crashed jobs and resubmit them again using a customizable set of resubmission strategies.

Conceptually jl is just managing lists of job-ids as reported by the underlying queuing system. Currently LSF, slurm but also any computer
(by means of a bundled local multi-threading scheduler) are supported to process job lists.

Installation

  1. cd ~/bin
  2. wget https://github.com/holgerbrandl/joblist/releases/download/v0.7.1/joblist_installer_v0.7.1.tar.gz
  3. tar -zxvf joblist_installer_v0.7.1.tar.gz
  4. # You also may want to update your bash profile to includ jl in your PATH by default
  5. echo '
  6. export PATH='$(pwd)/joblist_v0.7.1':$PATH
  7. ' >> ~/.bash_profile
  8. source ~/.bash_profile

Java8 is required to run JobList. To create (optional but recommenced) html reports R (v3.2) and pandoc (static build) are needed.

Basic Usage

  1. > jl --help
  2. Usage: jl <command> [options] [<joblist_file>]
  3. Supported commands are
  4. submit Submits a job to the underlying queuing system and adds it to the list
  5. add Extracts job-ids from stdin and adds them to the list
  6. wait Wait for a list of jobs to finish
  7. resub Resubmit non-complete jobs with escalated scheduler parameters
  8. status Prints various statistics and allows to create an html report for the list
  9. cancel Removes all jobs of this list from the scheduler queue
  10. up Moves a list of jobs to the top of a queue (if supported by the underlying scheduler)
  11. reset Removes all information related to this joblist.
  12. If no <joblist_file> is provided, jl will use '.jobs' as default, but to save typing it will remember
  13. the last used joblist instance per directory.

All sub-commands provide more specific information (e.g. jl submit --help)

The basic workflow is as follow:

1) Submit some jobs

  1. jl submit "sleep 10" ## add a job
  2. jl submit "sleep 1000" ## add another which won't finish in our default queue

2) Wait for them to finish

  1. jl wait
  2. > 2 jobs in total; 0.0% complete; Remaining time <NA>; 0 done; 0 running; 2 pending; 0 killed; 0 failed
  3. > 2 jobs in total; 0.0% complete; Remaining time <NA>; 0 done; 2 running; 0 pending; 0 killed; 0 failed
  4. > 2 jobs in total; 50.0% complete; Remaining time ~10S; 1 done; 1 running; 0 pending; 0 killed; 0 failed
  5. > 2 jobs in total; 50.0% complete; Remaining time ~10S; 1 done; 0 running; 0 pending; 1 killed; 0 failed

3) Report status, render html-report and log information with

  1. jl status
  2. > 2 jobs in total; 50.0% complete; Remaining time ~10S; 1 done; 0 running; 0 pending; 1 killed; 0 failed
  3. jl status --report
  4. > .jobs: Exported statistics into .jobs.{runinfo|jc}.log
  5. > .jobs: Rendering HTML report... done

4) Resubmit non-complete jobs by escalating their scheduler configuration

  1. ## to different queue
  2. jl resub --queue "long"
  3. ## or with 10h
  4. jl resub --time "10:00"

By using jl workflows will be decoupled from the underlying queuing system.
Ie. jl-ified workflows would run on a slurm system, an LSF cluster or simply locally on any desktop machine.

API Usage

In addition to the provided shell utilities, joblist can be also used programatically in Java, Scala, Kotlin and other JVM languages. To get started simply add it as a dependency via BinTray:

  1. <dependency>
  2. <groupId>de.mpicbg.scicomp</groupId>
  3. <artifactId>joblist</artifactId>
  4. <version>0.7.1</version>
  5. <type>pom</type>
  6. </dependency>

Shown below is a Scala example that auto-detects the used scheduler (slurm, lsf, or simple multi-threading as fallback), submits some jobs, waits for all of them to finish, and resubmits failed ones again to another queue:

  1. import joblist._
  2. val jl = JobList()
  3. jl.run(JobConfiguration("echo foo"))
  4. jl.run(JobConfiguration("echo bar"))
  5. // block execution until are jobs are done
  6. jl.waitUntilDone()
  7. // optionally we could investigate jobs that were killed by the queuing system
  8. val killedInfo: List[RunInfo] = jl.killed.map(_.info)
  9. // resubmit to other queue
  10. jl.resubmit(new OtherQueue("long"))

To use joblist with Kotlin, we suggest using a little support API which is available as an artifact under

  1. de.mpicbg.scicomp.joblist:joblist-kotlin:1.1

See joblist_test.kts for an example made with kscript.

Support & Documentation

Feel welcome to submit pull-requests or tickets, or simply get in touch via gitter (see button on top).

  • para is a parasol-like wrapper around LSF for efficiently handling batches of jobs on a compute cluster
  • Snakemake is a workflow management system
  • lsf_utils is a collections of bash functions to manage list of lsf-cluster jobs
  • Queue is a command-line scripting framework for defining multi-stage genomic analysis pipelines combined with an execution manager
  • DRMAA is a high-level API specification for the submission and control of jobs to a distributed resource management (DRM) system
  • sbatch_run script takes a job name and your command in quotes, creates the script, and runs it (Slurm only)
  • Redis allows to schedule tasks through the standard JDK ExecutorService and ScheduledExecutorService API, with submitted tasks being executed on Redisson nodes.