1. Overview
This guide explains details required to run Swift on various system types, with details for specific installations on which Swift is currently used. For a given system type, most instructions should work on that system type anywhere. However, details such as queue names or file system locations will have to be customized by the user.
2. Prerequisites
This guide assumes that you have already downloaded and installed Swift. It assumes that Swift is in your PATH and that you have a working version of Sun Java 1.5+. For more information on downloading and installing Swift, see the Swift Quickstart Guide.
3. Beagle (Cray XE6)
Beagle is a Cray XE6 supercomputer at UChicago. It employs a batch-oriented computational model where-in a PBS schedular accepts user’s jobs and queues them in the queueing system for execution.
The computational model requires a user to prepare the submit files, track job submissions, chackpointing, managing input/output data and handling exceptional conditions manually. Running Swift under Beagle can accomplish the above tasks with least manual user intervention and maximal oppurtunistic computation time on Beagle queues. In the following sections, we discuss more about specifics of running Swift on Beagle. A more detailed information about Swift and its workings can be found on Swift documentation page here:
http://swiftlang.org/docs/index.php
More information on Beagle can be found on UChicago Beagle website here:
http://beagle.ci.uchicago.edu
3.1. Requesting Access
If you do not already have a Computation Institute (CI) account, you can request one at https://www.ci.uchicago.edu/accounts/. This page will give you a list of resources you can request access to.
If you already have an existing CI account, but do not have access to Beagle, send an email to support@ci.uchicago.edu to request access.
3.2. Connecting to a login node
Once you have account, you should be able to access a Beagle login node with the following command:
ssh yourusername@login.beagle.ci.uchicago.edu
3.3. Getting Started with Swift
Follow the steps outlined below to get started with Swift on Beagle:
step 1. Load the Swift and Sun-java module on Beagle as follows: module load swift sun-java
step 2. Create and change to a directory where your Swift related work will stay. (say, mkdir swift-lab, followed by, cd swift-lab)
step 3. To get started with a simple example running /bin/cat to read an input file data.txt and write to an output file f.nnn.out, start with writing a simple swift source script as follows:
type file; /* App definition */ app (file o) cat (file i) { cat @i stdout=@o; } file out[]<simple_mapper; location="outdir", prefix="f.",suffix=".out">; file data<"data.txt">; /* App invocation: n times */ foreach j in [1:@toint(@arg("n","1"))] { out[j] = cat(data); }
step 4. The next step is to create a sites file. An example sites file (sites.xml) is shown as follows:
<config> <pool handle="pbs"> <execution provider="coaster" jobmanager="local:pbs"/> <!-- replace with your project --> <profile namespace="globus" key="project">CI-CCR000013</profile> <profile namespace="globus" key="providerAttributes"> pbs.aprun;pbs.mpp;depth=24</profile> <profile namespace="globus" key="jobsPerNode">24</profile> <profile namespace="globus" key="maxTime">1000</profile> <profile namespace="globus" key="slots">1</profile> <profile namespace="globus" key="nodeGranularity">1</profile> <profile namespace="globus" key="maxNodes">1</profile> <profile namespace="karajan" key="jobThrottle">.63</profile> <profile namespace="karajan" key="initialScore">10000</profile> <filesystem provider="local"/> <!-- replace this with your home on lustre --> <workdirectory >/lustre/beagle/ketan/swift.workdir</workdirectory> </pool> </config>
step 5. In this step, we will see the config and tc files. The config file (cf) is as follows:
wrapperlog.always.transfer=true sitedir.keep=true execution.retries=1 lazy.errors=true use.provider.staging=true provider.staging.pin.swiftfiles=false foreach.max.threads=100 provenance.log=false
The tc file (tc) is as follows:
pbs cat /bin/cat null null null
More about config and tc file options can be found in the Swift documentation.
step 6. Run the example using following commandline:
swift -config cf -tc.file tc -sites.file sites.xml catsn.swift -n=1
You can further change the value of -n to any arbitrary number to run that many number of concurrent cat
step 7. Swift will show a status message as "done" after the job has completed its run in the queue. Check the output in the generated outdir directory (ls outdir)
Swift 0.93RC5 swift-r5285 cog-r3322 RunID: 20111218-0246-6ai8g7f0 Progress: time: Sun, 18 Dec 2011 02:46:33 +0000 Progress: time: Sun, 18 Dec 2011 02:46:42 +0000 Active:1 Final status: time: Sun, 18 Dec 2011 02:46:43 +0000 Finished successfully:1
3.4. Larger Runs on Beagle
A key factor in scaling up Swift runs on Beagle is to setup the sites.xml parameters. The following sites.xml parameters must be set to scale that is intended for a large run:
-
maxTime : The expected walltime for completion of your run. This parameter is accepted in seconds.
-
slots : This parameter specifies the maximum number of pbs jobs/blocks that the coaster scheduler will have running at any given time. On Beagle, this number will determine how many qsubs swift will submit for your run. Typical values range between 40 and 60 for large runs.
-
nodeGranularity : Determines the number of nodes per job. It restricts the number of nodes in a job to a multiple of this value. The total number of workers will then be a multiple of jobsPerNode * nodeGranularity. For Beagle, jobsPerNode value is 24 corresponding to its 24 cores per node.
-
maxNodes : Determines the maximum number of nodes a job must pack into its qsub. This parameter determines the largest single job that your run will submit.
-
jobThrottle : A factor that determines the number of tasks dispatched simultaneously. The intended number of simultaneous tasks must match the number of cores targeted. The number of tasks is calculated from the jobThrottle factor is as follows:
Number of Tasks = (JobThrottle x 100) + 1
Following is an example sites.xml for a 50 slots run with each slot occupying 4 nodes (thus, a 200 node run):
<config> <pool handle="pbs"> <execution provider="coaster" jobmanager="local:pbs"/> <profile namespace="globus" key="project">CI-CCR000013</profile> <profile namespace="globus" key="ppn">24:cray:pack</profile> <!-- For swift 0.93 <profile namespace="globus" key="ppn">pbs.aprun;pbs.mpp;depth=24</profile> --> <profile namespace="globus" key="jobsPerNode">24</profile> <profile namespace="globus" key="maxTime">50000</profile> <profile namespace="globus" key="slots">50</profile> <profile namespace="globus" key="nodeGranularity">4</profile> <profile namespace="globus" key="maxNodes">4</profile> <profile namespace="karajan" key="jobThrottle">48.00</profile> <profile namespace="karajan" key="initialScore">10000</profile> <filesystem provider="local"/> <workdirectory >/lustre/beagle/ketan/swift.workdir</workdirectory> </pool> </config>
3.5. Troubleshooting
In this section we will discuss some of the common issues and remedies while using Swift on Beagle. The origin of these issues can be Swift or the Beagle’s configuration, state and user configuration among other factors. We try to identify maximum known issues and address them here:
-
Command not found: Swift is installed on Beagle as a module. If you see the following error message:
If 'swift' is not a typo you can run the following command to lookup the package that contains the binary: command-not-found swift -bash: swift: command not found
The most likely cause is the module is not loaded. Do the following to load the Swift module:
$ module load swift sun-java Swift version swift-0.93 loaded sun-java version jdk1.7.0_02 loaded
-
Failed to transfer wrapperlog for job cat-nmobtbkk and/or Job failed with an exit code of 254. Check the <workdirectory> element on the sites.xml file.
<workdirectory >/home/ketan/swift.workdir</workdirectory>
It is likely that it is set to a path where the compute nodes can not write, e.g. your /home directory. The remedy for this error is to set your workdirectory to the /lustre path where swift could write from compute nodes.
<workdirectory >/lustre/beagle/ketan/swift.workdir</workdirectory>
4. Amazon EC2 Cloud
Please refer to the Swift Tutorial for Cloud and Ad hoc Resources.
5. Fusion (x86 cluster)
Fusion is a 320-node computing cluster for the Argonne’s Laboratory Computing Resource Center (LCRC). The primary goal of the LCRC is to facilitate mid-range computing in all of the scientific programs of Argonne and the University of Chicago.
This section will walk you through running a simple Swift script on Fusion.
5.1. Requesting Access
If you do not already have a Fusion account, you can request one at https://accounts.lcrc.anl.gov/request.php. Email support@lcrc.anl.gov for additional help.
5.2. Projects
In order to run a job on a Fusion compute node, you must first be associated with a project.
Each project has one or more Primary Investigators, or PIs. These PIs are responsible for adding and removing users to a project. Contact the PI of your project to be added.
More information on this process can be found at http://www.lcrc.anl.gov/info/Projects.
5.3. SSH Keys
Before accessing Fusion, be sure to have your SSH keys configured correctly. SSH keys are required to access fusion. You should see information about this when you request your account. Check ssh FAQ or email support@lcrc.anl.gov for additional help.
5.4. Connecting to a login node
Once your keys are configured, you should be able to access a Fusion login node with the following command:
ssh <yourusername>@fusion.lcrc.anl.gov
5.5. Creating sites.xml
Swift uses various configuration files to determine how to run an application. This section will provide a working configuration file which you can copy and paste to get running quickly. The sites.xml file tells Swift how to submit jobs, where working directories are located, and various other configuration information. More information on sites.xml can be found in the Swift documentation.
The first step is to paste the text below into a file named sites.xml.
This file will require one customization. Create a directory called swiftwork. Modify _WORK_ in sites.xml to point to a new directory. For example
<workdirectory>/tmp/swiftwork</workdirectory>
5.6. Creating tc
The tc configuration file gives information about the applications that will be called by Swift. More information about the format of tc can be found in the Swift user guide.
Paste the following example into a file named tc
5.7. Copy a Swift Script
Within the Swift directory is an examples directory which contains several introductory Swift scripts. The example we will use in this section is called catsn.swift. The script copies input file’s content to another file using the Unix cat utility. Copy this script to the same directory that your sites.xml and tc.data files are located.
$ cp ~/swift/examples/misc/catsn.swift . $ cp ~/swift/examples/misc/data.txt .
The location of your swift directory may vary depending on how you installed it. Change this to the examples/misc directory of your installation as needed. |
5.8. Run Swift
Finally, run the script
$ swift -sites.file sites.xml -tc.file tc catsn.swift
You should see 10 new text files get created, named catsn*.out. If you see these files, then you have succesfully run Swift on Fusion!
5.9. Queues
Fusion has two queues: shared and batch. The shared queue has a maximum 1 hour walltime and limited to 4 nodes. The batch queue is for all other jobs.
Edit your sites.xml file and edit the queue option to modify Swift’s behavior. For example:
<profile namespace="globus" key="queue">batch</profile>
More information on Fusion queues can be found at http://www.lcrc.anl.gov/info/BatchJobs.
5.10. More Help
The best place for additional help is the Swift user mailing list. You can subscribe to this list at swift-user. When submitting information, send your sites.xml file, your tc.data, and any Swift log files that were created during your attempt.
6. Futuregrid Cloud
Please refer to the Swift Tutorial for Cloud and Ad hoc Resources.
7. Geyser and Caldera (x86 clusters)
The Geyser and Caldera clusters are specialized data analysis and visualization resources at NCAR (National Center for Atmospheric Research).
More information about these clusters can be found at http://www2.cisl.ucar.edu/resources/geyser_caldera.
7.1. Example sites.xml
The following sites.xml has the basic definitions for how to run a job using the LSF scheduler.
<config> <pool handle="geyser"> <filesystem provider="local"/> <execution provider="coaster" jobmanager="local:lsf"/> <profile namespace="globus" key="jobsPerNode">1</profile> <profile namespace="globus" key="maxTime">3600</profile> <profile namespace="globus" key="maxwalltime">00:05</profile> <profile namespace="globus" key="lowOverallocation">100</profile> <profile namespace="globus" key="highOverallocation">100</profile> <profile namespace="globus" key="nodeGranularity">1</profile> <profile namespace="globus" key="maxNodes">1</profile> <profile namespace="globus" key="project">P93300606</profile> <profile namespace="globus" key="queue">small</profile> <profile namespace="karajan" key="jobThrottle">4</profile> <profile namespace="karajan" key="initialScore">10000</profile> <workdirectory>/glade/scratch/davkelly</workdirectory> </pool> </config>
The values for workdirectory, queue, and project will likely need to be adjusted based on the requirements of your project.
8. Grids: Open Science Grid and TeraGrid
Please refer to the Swift Tutorial for OSGConnect.
9. Intrepid (Blue Gene/P)
Intrepid is an IBM Blue Gene/P (BG/p) supercomputer located at the Argonne Leadership Computing Facility. More information on Intrepid can be found at http://www.alcf.anl.gov. Surveyor and Challenger are similar, smaller machines.
9.1. Requesting Access
9.2. SSH Keys
Accessing the Intrepid via SSH can be done with any SSH software package. Before logging in, you will need to generate an SSH public key and send it to support@alcf.anl.gov for verification and installation.
9.3. Cryptocard
This security token uses one-time passwords for controlled access to the BG/P login systems.
9.4. Connecting to a login node
When you gain access to Intrepid, you should receive a cryptocard and a temporary PIN. You must have a working cryptocard, know your PIN, and have your SSH key in place before you may login.
You can connect to Intrepid with the following command:
ssh yourusername@intrepid.alcf.anl.gov
You will be presented with a password prompt. The first part of your password is your PIN. Enter you PIN, press the Cryptocard button, and then enter the password your crypocard generates. If this is the first time you are logging in, you will be prompted to change your PIN.
9.5. Downloading and building Swift
The most recent versions of Swift can be found at http://swiftlang.org/downloads/index.php. Follow the instructions provided on that site to download and build Swift.
9.6. Adding Swift to your PATH
Once you have installed Swift, add the Swift binary to your PATH so you can easily run it from any directory.
In your home directory, edit the file ".bashrc".
If you have installed Swift via a source repository, add the following line at the bottom of .bashrc.
export PATH=$PATH:$HOME/cog/modules/swift/dist/swift-svn/bin
If you have installed Swift via a binary package, add this line:
export PATH=$PATH:$HOME/swift-<version>/bin
Replace <version> with the actual name of the swift directory in the example above.
9.7. What You Need To Know Before Running Swift
Note that on Intrepid, the compute nodes can not create or write to a /home filesystem. Consequently, in order for Swift to interface correctly from login node to the compute nodes, Swift must write all internal and intermediate files to /intrepid-fs0, which is writable by the compute nodes. In order to accomplish this, export the environment variable SWIFT_USERHOME as follows:
export SWIFT_USERHOME=/intrepid-fs0/users/`whoami`/scratch
Before you can create a Swift configuration file, there are some things you will need to know.
9.7.1. Swift Work Directory
The Swift work directory is a directory which Swift uses for processing work. This directory needs to be writable. Common options for this are:
/home/username/swiftwork /home/username/work /tmp/swift.work
9.7.2. Which project(s) are you a member of?
Intrepid requires that you are a member of a project. You can determine this by running the following command:
$ projects HTCScienceApps
If you are not a member of a project, you must first request access to a project. More information on this process can be found at https://wiki.alcf.anl.gov/index.php/Discretionary_Allocations
9.7.3. Determine your Queue
Intrepid has several different queues you can submit jobs to depending on the type of work you will be doing. The command "qstat -q" will print the most up to date list of this information.
User Queue | Queue | Nodes | Time (hours) | User Maxrun | Project maxrun |
---|---|---|---|---|---|
prod-devel |
prod-devel |
64-512 |
0-1 |
5 |
20 |
prod |
prod-short |
512-4096 |
0-6 |
5 |
20 |
prod |
prod-long |
512-4096 |
6-12 |
5 |
20 |
prod |
prod-capability |
4097-32768 |
0-12 |
2 |
20 |
prod |
prod-24k |
16385-24576 |
0-12 |
2 |
20 |
prod |
prod-bigrun |
32769-40960 |
0-12 |
2 |
20 |
prod |
backfill |
512-8192 |
0-6 |
5 |
10 |
9.8. Generating Configuration Files
Now that you know what queue to use, your project, and your work directory, it is time to set up Swift. Swift uses a configuration file called sites.xml to determine how it should run. There are two methods you can use for creating this file. You can manually edit the configuration file, or generate it with a utility called gensites.
9.8.1. Manually Editing sites.xml
Below is the template that is used by Swift’s test suite for running on Intrepid.
Copy and paste this template, replace the values, and call it sites.xml.
The values to note here are the ones that are listed between underscores. In the example above, they are _HOST_, _PROJECT_, _QUEUE_, and _WORK_.
- HOST
-
The IP address on which Swift runs and to which workers must connect. To obtain this, run ifconfig and select the IP address that starts with 172.
- PROJECT
-
The project to use.
- QUEUE
-
The queue to use.
- WORK
-
The Swift work directory.
9.9. Manually Editing tc.data
Below is the tc.data file used by Swift’s test suite.
Copy these commands and save it as tc.data.
9.10. Catsn.swift
The swift script we will run is called catsn.swift. It simply cats a file and saves the result. This is a nice simple test to ensure jobs are running correctly. Create a file called data.txt which contains some simple input - a "hello world" will do the trick.
type file; app (file o) cat (file i) { cat @i stdout=@o; } string t = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"; string char[] = @strsplit(t, ""); file out[]<simple_mapper; location=".", prefix="catsn.",suffix=".out">; foreach j in [1:@toInt(@arg("n","10"))] { file data<"data.txt">; out[j] = cat(data); }
9.11. Running Swift
Now that everything is in place, run Swift with the following command:
swift -sites.file sites.xml -tc.file tc.data catsn.swift -n=10
You should see several new files being created, called catsn.0001.out, catsn.0002.out, etc. Each of these files should contain the contents of what you placed into data.txt. If this happens, your job has run successfully!
9.12. More Help
The best place for additional help is the Swift user mailing list. You can subscribe to this list. When submitting information, send your sites.xml file, your tc.data, and any Swift log files that were created during your attempt.
10. MCS Compute Servers (x86 workstations)
This sections describes how to use the general use compute servers for the MCS division of Argonne National Laboratory. Swift is available as a soft package on mcs machines. Add +swift to your .soft and run resoft command.
$ resoft
10.1. Create a coaster-service.conf
To begin, copy the text below and paste it into your Swift distribution’s etc directory. Name the file coaster-service.conf.
10.2. Starting the Coaster Service
Change directories to the location you would like to run a Swift script and start the coaster service with this command:
start-coaster-service
This will create a configuration file that Swift needs called sites.xml.
Any existing sites.xml files in this directory will be overwritten. Be sure to make a copy of any custom configuration files you may have. |
10.3. Run Swift
Next, run Swift. If you do not have a particular script in mind, you can test Swift by using a Swift script in the examples/ directory.
Run the following command to run the script:
swift -sites.file sites.xml -tc.file tc.data yourscript.swift
10.4. Stopping the Coaster Service
The coaster service will run indefinitely. The stop-coaster-service script will terminate the coaster service.
$ stop-coaster-service
This will kill the coaster service and kill the worker scripts on remote systems.
11. Midway (x86 cluster)
Midway is a cluster maintained by the Research Computing Center at the University of Chicago. Midway uses Slurm to handle job scheduling. For more details about Midway, and to request an account, visit http://rcc.uchicago.edu.
11.1. Connecting to a login node
Once you have access to Midway, connect to a Midway login node.
$ ssh userid@midway.rcc.uchicago.edu
11.2. Loading Swift
Swift is available on Midway as a module. To load the Swift, run:
$ module load swift
11.3. Example sites.xml
Below is an example that uses two of the queues available on Midway, sandyb and westmere. Be sure to adjust walltime, work directory, and other options as needed.
<config> <pool handle="midway-sandyb"> <execution provider="coaster" jobmanager="local:slurm"/> <profile namespace="globus" key="jobsPerNode">16</profile> <profile namespace="globus" key="maxWalltime">00:05:00</profile> <profile namespace="globus" key="highOverAllocation">100</profile> <profile namespace="globus" key="lowOverAllocation">100</profile> <profile namespace="globus" key="queue">sandyb</profile> <profile namespace="karajan" key="initialScore">10000</profile> <filesystem provider="local"/> <workdirectory>/scratch/midway/{env.USER}/work</workdirectory> </pool> <pool handle="midway-westmere"> <execution provider="coaster" jobmanager="local:slurm"/> <profile namespace="globus" key="jobsPerNode">12</profile> <profile namespace="globus" key="maxWalltime">00:05:00</profile> <profile namespace="globus" key="highOverAllocation">100</profile> <profile namespace="globus" key="lowOverAllocation">100</profile> <profile namespace="globus" key="queue">westmere</profile> <profile namespace="karajan" key="initialScore">10000</profile> <filesystem provider="local"/> <workdirectory>/scratch/midway/{env.USER}/work</workdirectory> </pool> </config>
11.4. Example sites.xml for use with MPI
Below is an example sites.xml that is suitable for running with MPI. Jobtype must be set to single. The value you set for ppn will determine the number of cores/slots your application uses per node. The value of count will set the number of nodes to request. The example below requests 2 nodes with 12 slots per node.
<config> <pool handle="midway-westmere"> <execution provider="coaster" jobmanager="local:slurm"/> <profile namespace="globus" key="jobsPerNode">1</profile> <profile namespace="globus" key="ppn">12</profile> <profile namespace="globus" key="maxWalltime">_WALLTIME_</profile> <profile namespace="globus" key="highOverAllocation">100</profile> <profile namespace="globus" key="lowOverAllocation">100</profile> <profile namespace="globus" key="queue">westmere</profile> <profile namespace="karajan" key="initialScore">10000</profile> <profile namespace="globus" key="jobtype">single</profile> <profile namespace="globus" key="count">2</profile> <filesystem provider="local"/> <workdirectory>/scratch/midway/{env.USER}/work</workdirectory> </pool> </config>
11.5. Defining non-standard Slurm options
A Slurm submit script has many settings and options. Swift knows about many of the basic Slurm settings, like how to define a project or a queue, but it does not know about every setting. Swift provides a simple way to pass-thru your own settings into the Slurm submit script.
The general way to do this is:
<profile namespace="globus" key="slurm.setting">value</profile>
Here is one specific example. Slurm has the ability to notify users via email when a job is done. To make this happen, the Slurm submit script that Swift generates needs a line that contains "--mail-type=END". The following line will make it happen.
<profile namespace="globus" key="slurm.mail-type">END</profile>
Any valid Slurm setting can be set in a similar way (see the sbatch man page for a list of all settings).
11.6. Various tips for running MPI jobs
-
You’ll need to load an MPI module. Run "module load openmpi" to add to your path.
-
The app that Swift runs should be a wrapper script that invokes your MPI application by running "mpiexec /path/to/yourMPIApp"
12. Persistent Coasters
Coasters is a protocol that Swift uses for scheduling jobs and transferring data. In most configurations, coasters are used automatically when you run Swift. With persistent coasters, the coaster server runs outside of Swift.
This section describes a utility called start-coaster-service that allows you to configure persistent coasters.
12.1. Example 1: Starting workers locally
Below is the simplest example, where the coaster service is started, and workers are launched locally on the same machine.
First, create a file called coaster-service.conf with the configuration below.
export WORKER_MODE=local export IPADDR=127.0.0.1 export JOBSPERNODE=1 export JOBTHROTTLE=0.0099 export WORK=$HOME/swiftwork
To start the coaster service and worker, run the command "start-coaster-service". Then run Swift with the newly generated sites.xml file.
$ start-coaster-service Start-coaster-service... Configuration: coaster-service.conf Service address: 127.0.0.1 Starting coaster-service Service port: 51099 Local port: 41764 Generating sites.xml Starting worker on local machine $ swift -sites.file sites.xml -tc.file tc.data hostsnsleep.swift Swift trunk swift-r7153 cog-r3810 RunID: 20131014-1807-q6h89eq3 Progress: time: Mon, 14 Oct 2013 18:07:13 +0000 Passive queue processor initialized. Callback URI is http://128.135.112.73:41764 Progress: time: Mon, 14 Oct 2013 18:07:14 +0000 Active:1 Final status: Mon, 14 Oct 2013 18:07:15 +0000 Finished successfully:1
You can then run swift multiple times using the same coaster service. When you are finished and would like to shut down the coaster, run stop-coaster-service.
$ stop-coaster-service Stop-coaster-service... Configuration: coaster-service.conf Ending coaster processes.. Killing process 8579 Done
When you define your apps/tc file, use the site name "persistent-coasters". |
12.2. Example 2: Starting workers remotely via SSH
The start-coaster-service script can start workers on multiple remote machines. To do this, there are two main settings you need to define in your coaster-service.conf. The first is to set WORKER_MODE=ssh, and the second is set WORKER_HOSTS to the list of machines where workers should be started.
export WORKER_MODE=ssh export WORKER_USERNAME=yourusername export WORKER_HOSTS="host1.example.edu host2.example.edu" export WORKER_LOCATION="/homes/davidk/logs" export IPADDR=swift.rcc.uchicago.edu export JOBSPERNODE=1 export JOBTHROTTLE=0.0099 export WORK=/homes/davidk/swiftwork
If there is no shared filesystem available between the remote machines and the local machine, you will need to enable coaster provider staging to transport files for you. Below is an example Swift configuration file to enable it:
wrapperlog.always.transfer=false sitedir.keep=false execution.retries=0 lazy.errors=false status.mode=provider use.provider.staging=true provider.staging.pin.swiftfiles=false use.wrapper.staging=false
Run start-coaster service to start coaster and workers. When you run Swift, reference the cf file to enable provider staging.
$ start-coaster-service Start-coaster-service... Configuration: coaster-service.conf Service address: swift.rcc.uchicago.edu Starting coaster-service Service port: 41714 Local port: 41685 Generating sites.xml Starting worker on host1.example.edu Starting worker on host2.example.edu $ swift -sites.file sites.xml -tc.file tc.data -config cf hostsnsleep.swift Swift trunk swift-r7153 cog-r3810 RunID: 20131014-1844-7flhik67 Progress: time: Mon, 14 Oct 2013 18:44:43 +0000 Passive queue processor initialized. Callback URI is http://128.135.112.73:41685 Progress: time: Mon, 14 Oct 2013 18:44:44 +0000 Selecting site:4 Finished successfully:4 Final status: Mon, 14 Oct 2013 18:44:45 +0000 Finished successfully:10
This requires that you are able to connect to the remote systems without being prompted for a password/passphrase. This is usually done with SSH keys. Please refer to SSH documentation for more info. |
12.3. Example 3: Starting workers remotely via SSH, with multihop
This example is for a situation where you want to start a worker on nodes that you can’t connect to directly. If you have to first connect to a login/gateway machine before you can ssh to your worker machine, this configuration is for you.
The coaster-service.conf and cf files are the same as in Example 2.
Assume that node.host.edu is the machine where you want to start your worker, and that gateway.host.edu is the machine where you must log into first. Add the following to your $HOME/.ssh/config file:
Host node.host.edu Hostname node.host.edu ProxyCommand ssh -A username@gateway.host.edu nc %h %p 2> /dev/null ForwardAgent yes User username
This will allow you to SSH directly to node.host.edu. You can now add node.host.edu to WORKER_HOSTS.
12.4. Example 4: Starting workers remotely via SSH, with tunneling
The coaster workers need to be able to make a connection back to the coaster service. If you are running Swift on a machine behind a firewall where the workers cannot connect, you can use SSH reverse tunneling to allow this connection to happen.
To enable this, add the following line to your coaster-service.conf:
export SSH_TUNNELING=yes
12.5. Example 5: Starting workers remotely via SSH, hostnames in a file
The variable WORKER_HOSTS defines the list of hostnames where workers will be started. To set this to be the contents of a file, you can set WORKER_HOSTS as follows:
export WORKER_HOSTS=$( cat /path/to/hostlist.txt )
12.6. Example 6: Starting workers via a scheduler
To start workers via some other script, such as a scheduler submit script, export WORKER_MODE=scheduler. Once the coaster service has been initialized, start-coaster-service will run whatever user defined command is defined in $SCHEDULER_COMMAND.
The contents of SCHEDULER_COMMAND will vary greatly based on your needs and the system you are running on. However, all SCHEDULER_COMMANDs will need to run the same command exactly once on each worker node:
$WORKER $WORKERURL logname $WORKER_LOG_DIR
Here is an example that runs on a campus cluster using the Slurm scheduler: .coaster-service.conf
export WORKER_MODE=scheduler export WORKER_LOG_DIR=/scratch/midway/$USER export IPADDR=10.50.181.1 export JOBSPERNODE=1 export JOBTHROTTLE=0.0099 export WORK=$HOME/swiftwork export SCHEDULER_COMMAND="sbatch start-workers.submit"
The SCHEDULER_COMMAND in this case submits a Slurm job script and starts the workers via the following commands:
#!/bin/bash #SBATCH --job-name=start-workers #SBATCH --output=start-workers.stdout #SBATCH --error=start-workers.stderr #SBATCH --nodes=1 #SBATCH --partition=westmere #SBATCH --time=00:10:00 #SBATCH --ntasks-per-node=12 #SBATCH --exclusive $WORKER $WORKERURL logname $WORKER_LOG_DIR
12.7. List of all coaster-service.conf settings
Below is a list of all settings that start-coaster-service knows about, along with a brief description of what it does.
The settings are defined in terms of bash variables. Below is an example of the format used
export WORKER_LOGGING_LEVEL=DEBUG
Below is a list of the options that coaster-service.conf recognizes and what they do.
12.7.1. IPADDR
Defines IP address where the coaster-service is running. Workers need to know the IP address where to connect back to. Example: export IPADDR=192.168.2.12
12.7.2. LOCAL_PORT
Define a static local port number. If undefined, this is generated randomly. Example: export LOCAL_PORT=50100
12.7.3. LOG
LOG set the name of the local log file to be generated. This log file is the standard output and standard error output of the coaster-service and other commands that start-coaster-service runs. This file can get large at times. To disable, set "export LOG=/dev/null". Default value: start-coaster-service.log
12.7.4. SCHEDULER_COMMAND
In schedule mode, this defines the command to run via start-coaster-service that will start workers via the scheduler. Example: export SCHEDULER_COMMAND="qsub start-workers.submit".
12.7.5. SERVICE_PORT
Sets the coaster service port number. If undefined, this is generated randomly. Example: Export SERVICE_PORT=50200
12.7.6. SSH_TUNNELING
When the machine you are running Swift on is behind a firewall that is blocking workers from connecting back to it, add "export SSH_TUNNELING=yes". This will set up a reverse tunnel to allow incoming connections. Default value: no.
12.7.7. WORKER_HOSTS
WORKER_HOSTS should contain the list of hostnames that start-coaster-service will connect to start workers. This is only used when WORKER_MODE is ssh. Example: export WORKER_HOST="host1 host2 host3".
12.7.8. WORKER_LOCATION
In ssh mode, defines the directory on remote systems where the worker script will be copied to. Example: export WORKER_LOCATION=/tmp
12.7.9. WORKER_LOG_DIR
In ssh mode, defines the directory on the remote systems where worker logs will go. Example: export WORKER_LOG_DIR=/home/john/logs
12.7.10. WORKER_LOGGING_LEVEL
Defines the logging level of the worker script. Values can be "TRACE", "DEBUG", "INFO ", "WARN ", or "ERROR". Example: export WORKER_LOGGING_LEVEL=NONE.
12.7.11. WORKER_USERNAME
In ssh mode, defines the username to use when connecting to each host defined in WORKER_HOSTS.
13. SSH
This section describes how to use the SSH provider to connect to remote sites and to handle data transfer.
13.1. Generate a unique SSH key
It is recommended that you create a new SSH key exclusively for this purpose. In order to avoid being prompted for passwords/passphrases, your SSH passphrase will be stored in a read protected file. Run this command on the machine where you will be running Swift:
ssh-keygen -t dsa -f $HOME/.ssh/id_dsa-swift
You will be prompted to create a passphrase. This will create two files: $HOME/.ssh/id_dsa-swift and $HOME/.ssh/id_dsa-swift.pub.
13.2. Add your public key to the remote host
On the remote host where you will be running, edit or create the file $HOME/.ssh/authorized_keys. Paste in the contents of the newly created $HOME/.ssh/id_dsa-swift.pub from the previous step to the end of the file.
13.3. Verify your new key works
From the host where you will be running Swift, run the following command to verify your keys are working:
$ ssh -o IdentitiesOnly=true -i $HOME/.ssh/id_dsa-swift user@login.remotehost.edu
You should be prompted for your new passphrase and be able to connect.
13.4. Create auth.defaults
Create a file called $HOME/.ssh/auth.defaults on the host where you are running Swift. Use the following commands to create this file:
$ touch $HOME/.ssh/auth.defaults $ chmod 600 $HOME/.ssh/auth.defaults
Next, edit $HOME/.ssh/auth.defaults and add the following lines:
login.remotehost.edu.type=key login.remotehost.edu.username=your_remote_username login.remotehost.edu.key=/your/home/.ssh/id_dsa-swift login.remotehost.edu.passphrase=your_passphrase
Replace login.remotehost.edu with the hostname you want to use, replace the values for "your_remote_username", "your_passphrase", and set the correct path of private key you generated.
13.5. Create a sites.xml file
Here is an example sites.xml file that will allow you to connect and transfer data to a remote host:
<config> <pool handle="remotehost"> <execution provider="coaster" jobmanager="ssh:local" url="login.remotehost.edu"/> <filesystem provider="ssh" url="login.remotehost.edu"/> <profile namespace="karajan" key="jobThrottle">0</profile> <profile namespace="karajan" key="initialScore">10000</profile> <workdirectory>/path/to/remote/workdirectory</workdirectory> </pool> </config>
This example will run work directly on login.remotehost.edu. In many cases you will not want to do this. You’ll like want to modify your sites.xml to use a remote scheduler, by setting jobmanager to ssh:pbs or ssh:slurm, for example. This usually requires also setting things like queues and walltimes. This example is provided for simplicity and testing. |
13.6. Setting your properties
Since you want to data transfer via ssh, you’ll want to verify that you’re not using any other file transfer mechanisms. Make sure you have the following swift properties defined in your configuration file:
use.provider.staging=false use.wrapper.staging=false
14. SSH-CL
This section describes how to use the SSH command line provider (ssh-cl) to connect to remote sites.
14.1. Verify you can connect to the remote site
The first step of this process is to verify that you can connect to a remote site without being prompted for a password or passphrase.
$ ssh my.site.com
Typically to make this work you will need to add your SSH public key to the $HOME/.ssh/authorized_keys file on the remote system.
This SSH connection must work without specifying a username on the command line. If your username differs on your local machine and the remote machine, you will need to add an entry like this to your local $HOME/.ssh/config:
Host my.site.com Hostname my.site.com User myusername
14.2. Create a sites.xml file
Once you have verified that you can connect without prompt to the remote machine, you will need to create a sites.xml file that contains the host information. The example below will assume there is no scheduler on the remote system - it simply connects to the remote machine and runs work there.
<config> <pool handle="mysite"> <execution provider="coaster" jobmanager="ssh-cl:local" url="my.site.com"/> <profile namespace="globus" key="jobsPerNode">1</profile> <profile namespace="globus" key="lowOverAllocation">100</profile> <profile namespace="globus" key="highOverAllocation">100</profile> <profile namespace="karajan" key="jobThrottle">1</profile> <profile namespace="karajan" key="initialScore">10000</profile> <workdirectory>/home/username/work</workdirectory> </pool> </config>
This requires that the remote site can connect back to the machine where you are running Swift. If a firewall is configured to block incoming connections, this will not work correctly. |
14.3. Enabling coaster provider staging
If there is a shared filesystem between the two machines, you can set this as your work directory and skip this step. Otherwise, you will need to enable coaster provider staging.
To do this, add the following line to your "cf" file:
use.provider.staging=true
To run swift, then:
swift -sites.file sites.xml -tc.file tc.data -config cf script.swift
15. Stampede (x86 cluster)
Stampede is a cluster managed by the Texas Advanced Computing Center (TACC). It is a part of the XSEDE project. For more information about how to request an account, a project, how to log in and manage SSH keys, please see the More information about the system can be found in the Stampede User Guide.
15.1. Downloading and building Swift
The most recent versions of Swift can be found at the Swift downloads page. Follow the instructions provided on that site to download and build Swift.
15.2. Overview of How to Run
You will need to do the following steps in order to run.
-
Connect to a system that has the Globus myproxy-init command. This will be the system where Swift runs and from where Swift submits jobs to Stampede.
-
Obtain a grid certificate.
-
Run Swift with configuration files that define how to start remote jobs to Stampede via gram.
15.3. Verify System Requirements and Environment
The system where you run Swift needs to have the myproxy-init tool installed. Ideally it should also have globus-job-run for testing purposes.
Swift uses two environment variables in order for remote job execution to work. They are $GLOBUS_HOSTNAME and $GLOBUS_TCP_PORT_RANGE.
GLOBUS_HOSTNAME should contain the full hostname of the system where you are running. It may also contain the IP address of the machine.
GLOBUS_TCP_PORT_RANGE defines a range of ports to which a remote system may connect. You will likely need this defined if you are behind a firewall with somewhat restricted access.
15.4. Obtain a Grid Certificate
Once you have verified you have everything you need on the submit host where you are running, you can obtain an XSEDE grid certificate with following command:
$ myproxy-logon -l username -s myproxy.teragrid.org
15.5. Create sites.xml file
You may use the following example as a guide to run on Stampede. You will likely need to make a few modifications, as described below.
<config> <pool handle="stampede"> <execution provider="coaster" jobmanager="gt2:gt2:slurm" url="login5.stampede.tacc.utexas.edu:2119/jobmanager-slurm"/> <filesystem provider="gsiftp" url="gsiftp://gridftp.stampede.tacc.utexas.edu:2811"/> <profile namespace="globus" key="jobsPerNode">16</profile> <profile namespace="globus" key="ppn">16</profile> <profile namespace="globus" key="maxTime">3600</profile> <profile namespace="globus" key="maxwalltime">00:05:00</profile> <profile namespace="globus" key="lowOverallocation">100</profile> <profile namespace="globus" key="highOverallocation">100</profile> <profile namespace="globus" key="queue">normal</profile> <profile namespace="globus" key="nodeGranularity">1</profile> <profile namespace="globus" key="maxNodes">1</profile> <profile namespace="globus" key="project">yourproject</profile> <profile namespace="karajan" key="jobThrottle">.3199</profile> <profile namespace="karajan" key="initialScore">10000</profile> <workdirectory>/scratch/01503/yourusername</workdirectory> </pool> </config>
You will need to modify the XSEDE project name to the match the name that has been allocated to you. In most cases you’ll want to set the work directory to your Stampede scratch directory. This is defined, on Stampede, in the environment variable $SCRATCH.
15.6. Running Swift
You may now run your Swift script exactly as you have before.
$ swift -sites.file sites.xml -tc.file tc -config cf myscript.swift
15.7. Debugging
If you are having problems getting this working correctly, there are a few places where you may look to help you debug. Since this configuration is slightly more complicated, there are several log files produced.
-
The standard swift log, created in your current working directory on the machine where you are running from. This will be named something along the lines of myscript-<datestring>.log.
-
The bootstrap log. These are located on Stampede in your home directory and have the name coaster-bootstrap-<datestring>.log.
-
The coaster log. This is located on Stampede in your $HOME/.globus/coasters/coasters.log.
-
The gram log. This is located on Stamped in $HOME/gram-<datestring>.log.
For help in getting this configuration working, feel free to contact the Swift support team at swift-support@ci.uchicago.edu.
16. UC3 / OSGConnect (x86 cluster)
Please refer to the Swift Tutorial for OSGConnect.