Deploying

Overview

This file explains the deployment of DAPHNE, on HPC with Slurm or manually through SSH, and highlights the excerpts from descriptions of functionalities in the deploy/ directory (mostly deploy-distributed-on-slurm.sh):

compilation of the Singularity image
compilation of DAPHNE (and the DAPHNE DistributedWorker) within the Singularity image
packaging compiled DAPHNE
packaging compiled DAPHNE with user payload as a payload package
uploading the payload package to an HPC platform
starting and managing DAPHNE workers on HPC platforms using Slurm
executing DAPHNE on HPC using Slurm
collection of logs from DAPHNE execution
cleanup of worker environments and payload deployment

Background

DAPHNE's distributed runtime consists of a single coordinator and multiple DistributedWorkers (see the documentation of the distributed runtime for more information). For now, in order to execute DAPHNE in a distributed fashion, we need to deploy DistributedWorkers manually. The coordinator gets the worker's addresses through an environment variable.

deployDistributed.sh manually connects to machines with SSH and starts up DistributedWorker processes. On the other hand, the script deploy-distributed-on-slurm.sh packages and starts DAPHNE on a target HPC platform, and is tailored to the communication required with Slurm and the target HPC platform.

Deploying without Slurm Support

deployDistributed.sh can be used to manually connect to a list of machines and remotely start up workers, get status of running workers or terminate distributed worker processes. This script depends only on an SSH client/server and does not require any use of a resource management tool (e.g. Slurm). With this script you can:

build and deploy DistributedWorkers to remote machines
start workers
check status of running workers
kill workers

Workers' own IPs and ports to listen to, can be specified inside the script, or with --peers [IP[:PORT]],[IP[:PORT]],.... The default port for all workers is 50000, but this can also be specified inside the script or with -p,--port PORT. If running on the same machine (e.g., localhost), different ports must be specified.

With --deploy the script builds the DistributedWorker executable (./build.sh --target DistributedWorker), compresses the build, lib and bin folders and uses scp and ssh to send to and decompress at remote machines, inside the directory specified by --pathToBuild (default ~/DaphneDistributedWorker/). If running workers on localhost, PATH_TO_BUILD can be set /path/to/daphne and provided DistributedWorker is built, --deploy is not nessecary.

The SSH username must be specified inside the script. For now the script assumes all remote machines can be accessed with the same username, id_rsa key and SSH port (default 22).

Usage example:

# deploy distributed
$ ./deployDistributed.sh --help
$ ./deployDistributed.sh --deploy --pathToBuild /path/to/dir --peers localhost:5000,localhost:5001
$ ./deployDistributed.sh -r # (Uses default peers and path/to/build/ to start workers)

Deploying with Slurm Support

Building DAPHNE (to be later deployed on distributed nodes) can be done with a Singularity container. The Singularity container can be built on the utilized HPC. deployDistributed.sh sends executables to each node, assuming there are different storages for each node. This might cause unnecessary overwrites if the workers use the same mounted user storage (e.g. HPC environments with distributed storages). Instead, deploy-distributed-on-slurm.sh should be used for such cases. The latter also automatically generates the environment variable PEERS from Slurm.

How to Use `deploy-distributed-on-slurm.sh` for DAPHNE Packaging, Distributed Deployment, and Management Using Slurm

This section explains how to set up the DistributedWorkers on a HPC platform and briefly comments on what to do afterwards (how to run, manage, stop, and clean it). Commands, with their parameters and arguments, are hence described below for deployment with deploy-distributed-on-slurm.sh.

Usage: deploy-distributed-on-slurm.sh <options> <command>

Start the DAPHNE distributed deployment on remote machines using Slurm.

These are the options (short and long formats available):
  -h, --help              Print this help message and exit.
  -i SSH_IDENTITY_FILE    Specify OpenSSH identity file (default: private key in ~/.ssh/id_rsa.pub).
  -u, --user SSH_USERNAME Specify OpenSSH username (default: $USER).
  -l, --login SSH_LOGIN_NODE_HOSTNAME     Specify OpenSSH login name hostname (default: localhost).
  -d, --pathToBuild       A path to deploy or where the build is already deployed (default ~/DaphneDistributedWorker can be specified in the script).
  -n, --numcores          Specify number of workers (cores) to use to deploy DAPHNE workers (default: 128).
  -p, --port              Specify DAPHNE deployed port range begin (default: 50000).
  --args ARGS_CS          Specify arguments of a DaphneDSL SCRIPT in a comma-separated format.
  -S, --ssh-arg=S         Specify additional arguments S for ssh client (default command: $SSH_COMMAND).
  -C, --scp-arg=C         Specify additional arguments C for scp client (default command: $SCP_COMMAND).
  -R, --srun-arg=R        Specify additional arguments R for srun client.
  -G, --singularity-arg=G Specify additional arguments G for singularity client.

These are the commands that can be executed:
  singularity             Compile the Singularity SIF image for DAPHNE (and transfer it to the target platform).
  build                   Compile DAPHNE codes (daphne, DistributedWorker) using the Singularity image for DAPHNE.
                          It should only be invoked from the code base root directory.
                          It could also be invoked on a target platform after a transfer.
  package                 Create the package image with *.daphne scripts and a compressed build/ directory.
  transfer                Transfers (uploads) a package to the target platform.
  start                   Run workers on remote machines through login node (deploys this script and runs workers).
  workers                 Run workers on current login node through Slurm.
  status                  Get distributed workers' status.
  wait                    Waits until all workers are up.
  stop                    Stops all distributed workers.
  run [SCRIPT [ARGs]]     Run one request on the deployed platform by processing one DaphneDSL SCRIPT file (default: /dev/stdin)
                          using optional arguments (ARGs in script format).
  clean                   Cleans (deletes) the package on the target platform.
  deploy                  Deploys everything in one sweep: singularity=>build=>package=>transfer=>start=>wait=>run=>clean.


The default connection to the target platform (HPC) login node is through OpenSSH, configured by default in ~/.ssh (see: man ssh_config).

The default ports for worker peers begin at 50000 (PORTRANGE_BEGIN) and the list of PEERS is generated as:
PEERS = ( WORKER1_IP:PORTRANGE_BEGIN, WORKER1_IP:PORTRANGE_BEGIN+1, ..., WORKER2_IP:PORTRANGE_BEGIN, WORKER2_IP:PORTRANGE_BEGIN+1, ... )

Logs can be found at [pathToBuild]/logs.

Short Examples

The following list presents a few examples about how to use the deploy-distributed-on-slurm.sh command.

These comprise more hands-on documentation about deployment, including tutorial-like explanation examples about how to package, distributively deploy, manage, and execute workloads using DAPHNE.

Builds the Singularity image and uses it to compile the build directory codes, then packages it.

./deploy-distributed-on-slurm.sh singularity && ./deploy-distributed-on-slurm.sh build && ./deploy-distributed-on-slurm.sh package

Transfers a package to the target platform through OpenSSH, using login node HPC, user hpc, and identify key hpc.pub.
```
./deploy-distributed-on-slurm.sh --login HPC --user hpc -i ~/.ssh/hpc.pub transfer
```
Using login node HPC, accesses the target platform and starts workers on remote machines.
```
./deploy-distributed-on-slurm.sh -l HPC start
```
Runs one request (script called example-time.daphne) on the deployment using 1024 cores, login node HPC, and the default OpenSSH configuration.
```
./deploy-distributed-on-slurm.sh -l HPC -n 1024 run example-time.daphne
```
Executes one request (DaphneDSL script input from standard input) at a running deployed platform, using default singularity/srun configurations.
```
./deploy-distributed-on-slurm.sh run
```
Deploys once at the target platform through OpenSSH using the default login node (localhost), then cleans.
```
./deploy-distributed-on-slurm.sh deploy -n 10
```

Starts workers at a running deployed platform using custom srun arguments (2 hours dual-core with 10G memory).

./deploy-distributed-on-slurm.sh workers -R="-t 120 --mem-per-cpu=10G --cpu-bind=cores --cpus-per-task=2"

Executes a request with custom srun arguments (30 minutes single-core).

./deploy-distributed-on-slurm.sh run -R="--time=30 --cpu-bind=cores --nodes=1 --ntasks-per-node=1 --cpus-per-task=1"

Example request job from a pipe.

cat ../scripts/examples/hello-world.daph | ./deploy-distributed-on-slurm.sh run

Scenario Usage Example

Here is a scenario usage as a longer example demo.

Fetch the code from the latest GitHub code repository.

function compile() {
  git clone --recursive git@github.com:daphne-eu/daphne.git 2>&1 | tee daphne-$(date +%F-%T).log
  cd daphne/deploy
  ./deploy-distributed-on-slurm.sh singularity # creates the Singularity container image
  ./deploy-distributed-on-slurm.sh build       # Builds the daphne codes using the container
}
compile

Package the built targets (binaries) to packet file daphne-package.tgz.
```
./deploy-distributed-on-slurm.sh package
```
Transfer the packet file daphne-package.tgz to HPC (Slurm) with OpenSSH key ~/.ssh/hpc.pub and unpack it.
```
./deploy-distributed-on-slurm.sh --login HPC --user $USER -i ~/.ssh/hpc.pub transfer 
```
E.g., for EuroHPC Vega, use the instance, if your username matches the one at Vega and the key is ~/.ssh/hpc.pub:
```
./deploy-distributed-on-slurm.sh --login login.vega.izum.si --user $USER -i ~/.ssh/hpc.pub transfer
```

Start the workers from the local computer by logging into the HPC login node:

./deploy-distributed-on-slurm.sh --login login.vega.izum.si --user $USER -i ~/.ssh/hpc.pub start

Starting a main target on the HPC (Slurm) and connecting it with the started workers, to execute payload from the stream.

cat ../scripts/examples/hello-world.daph | ./deploy-distributed-on-slurm.sh --login login.vega.izum.si --user $USER -i ~/.ssh/hpc.pub run

Starting a main target on the HPC (Slurm) and connecting it with the started workers, to execute payload from a file.

./deploy-distributed-on-slurm.sh --login login.vega.izum.si --user $USER -i ~/.ssh/hpc.pub run example-time.daphne

Stopping all workers on the HPC (Slurm).

./deploy-distributed-on-slurm.sh --login login.vega.izum.si --user $USER -i ~/.ssh/hpc.pub stop

Cleaning the uploaded targets from the HPC login node.

./deploy-distributed-on-slurm.sh --login login.vega.izum.si --user $USER -i ~/.ssh/hpc.pub clean