Introduction: Why Parallel Scripting?

Swift is a simple scripting language for executing many instances of ordinary application programs on distributed parallel resources. Swift scripts run many copies of ordinary programs concurrently, using statements like this:

file out[];
foreach param,i in paramList {
  out[i] = simulate(param);
}

Swift acts like a structured "shell" language. It runs programs concurrently as soon as their inputs are available, reducing the need for complex parallel programming. Swift expresses your workflow in a portable fashion: The same script runs on multicore computers, clusters, clouds, grids, and supercomputers.

In this tutorial, you’ll be able to first try a few Swift examples on your local machine, to get a sense of the language.

Swift installation

  1. Download the file from http://swiftlang.org/packages/swift-0.95-RC6.tar.gz

  2. Extract by running "tar xfz swift-0.95-RC6.tar.gz"

  3. Add to PATH by running "export PATH=$PATH:/path/to/swift-0.95-RC6/bin"

Swift tutorial installation

Run the following commands to extract these tutorial scripts.

$ cd $HOME
$ wget http://swiftlang.org/tutorials/localhost/swift-localhost-tutorial.tar.gz
$ tar xvfz swift-localhost-tutorial.tar.gz
$ cd swift-localhost-tutorial

Run the tutorial setup script

$ source setup.sh   # You must run this with "source" !

To check out the tutorial scripts from SVN

If you later want to get the most recent version of this tutorial from the Swift Subversion repository, do:

$ svn co https://svn.ci.uchicago.edu/svn/vdl2/SwiftTutorials/swift-localhost-tutorial  <TODO: Fix link>

This will create a directory called "swift-localhost-tutorial" which contains all of the files used in this tutorial.

Simple "science applications" for the workflow tutorial

This tutorial is based on two intentionally trivial example programs, simulation.sh and stats.sh, (implemented as bash shell scripts) that serve as easy-to-understand proxies for real science applications. These "programs" behave as follows.

simulate.sh

The simulation.sh script serves as a trivial proxy for any more complex scientific simulation application. It generates and prints a set of one or more random integers in the range [0-2^62) as controlled by its command line arguments, which are:

$ ./app/simulate.sh --help
./app/simulate.sh: usage:
    -b|--bias       offset bias: add this integer to all results [0]
    -B|--biasfile   file of integer biases to add to results [none]
    -l|--log        generate a log in stderr if not null [y]
    -n|--nvalues    print this many values per simulation [1]
    -r|--range      range (limit) of generated results [100]
    -s|--seed       use this integer [0..32767] as a seed [none]
    -S|--seedfile   use this file (containing integer seeds [0..32767]) one per line [none]
    -t|--timesteps  number of simulated "timesteps" in seconds (determines runtime) [1]
    -x|--scale      scale the results by this integer [1]
    -h|-?|?|--help  print this help
$

All of thess arguments are optional, with default values indicated above as [n].

With no arguments, simulate.sh prints 1 number in the range of 1-100. Otherwise it generates n numbers of the form (R*scale)+bias where R is a random integer. By default it logs information about its execution environment to stderr. Here’s some examples of its usage:

$ simulate.sh 2>log
       5
$ head -4 log

Called as: /home/wilde/swift/tut/CIC_2013-08-09/app/simulate.sh:
Start time: Thu Aug 22 12:40:24 CDT 2013
Running on node: login01.osgconnect.net

$ simulate.sh -n 4 -r 1000000 2>log
  239454
  386702
   13849
  873526

$ simulate.sh -n 3 -r 1000000 -x 100 2>log
 6643700
62182300
 5230600

$ simulate.sh -n 2 -r 1000 -x 1000 2>log
  565000
  636000

$ time simulate.sh -n 2 -r 1000 -x 1000 -t 3 2>log
  336000
  320000
real    0m3.012s
user    0m0.005s
sys     0m0.006s

stats.sh

The stats.sh script serves as a trivial model of an "analysis" program. It reads N files each containing M integers and simply prints the\ average of all those numbers to stdout. Similarly to simulate.sh it logs environmental information to the stderr.

$ ls f*
f1  f2  f3  f4

$ cat f*
25
60
40
75

$ stats.sh f* 2>log
50

Basics of the Swift language with local execution

A Summary of Swift in a nutshell

  • Swift scripts are text files ending in .swift The swift command runs on any host, and executes these scripts. swift is a Java application, which you can install almost anywhere. On Linux, just unpack the distribution tar file and add its bin/ directory to your PATH.

  • Swift scripts run ordinary applications, just like shell scripts do. Swift makes it easy to run these applications on parallel and remote computers (from laptops to supercomputers). If you can ssh to the system, Swift can likely run applications there.

  • The details of where to run applications and how to get files back and forth are described in configuration files separate from your program. Swift speaks ssh, PBS, Condor, SLURM, LSF, SGE, Cobalt, and Globus to run applications, and scp, http, ftp, and GridFTP to move data.

  • The Swift language has 5 main data types: boolean, int, string, float, and file. Collections of these are dynamic, sparse arrays of arbitrary dimension and structures of scalars and/or arrays defined by the type declaration.

  • Swift file variables are "mapped" to external files. Swift sends files to and from remote systems for you automatically.

  • Swift variables are "single assignment": once you set them you can’t change them (in a given block of code). This makes Swift a natural, "parallel data flow" language. This programming model keeps your workflow scripts simple and easy to write and understand.

  • Swift lets you define functions to "wrap" application programs, and to cleanly structure more complex scripts. Swift app functions take files and parameters as inputs and return files as outputs.

  • A compact set of built-in functions for string and file manipulation, type conversions, high level IO, etc. is provided. Swift’s equivalent of printf() is tracef(), with limited and slightly different format codes.

  • Swift’s foreach {} statement is the main parallel workhorse of the language, and executes all iterations of the loop concurrently. The actual number of parallel tasks executed is based on available resources and settable "throttles".

  • In fact, Swift conceptually executes all the statements, expressions and function calls in your program in parallel, based on data flow. These are similarly throttled based on available resources and settings.

  • Swift also has if and switch statements for conditional execution. These are seldom needed in simple workflows but they enable very dynamic workflow patterns to be specified.

We’ll see many of these points in action in the examples below. Lets get started!

Part 1: Run a single application under Swift

The first swift script, p1.swift, runs simulate.sh to generate a single random number. It writes the number to a file.

p1 workflow
p1.swift
type file;

app (file o) simulation ()
{
  simulate stdout=filename(o);
}

file f <"sim.out">;
f = simulation();

To run this script, run the following command:

$ cd part01
$ swift p1.swift
Swift 0.95 Branch SVN swift-r7903 cog-r3908
RunID: run001
Progress: Tue, 03 Jun 2014 15:01:28-0500
Final status:Tue, 03 Jun 2014 15:01:29-0500  Finished successfully:1
$ cat sim.out
      84
$ swift p1.swift
$ cat sim.out
      36

To cleanup the directory and remove all outputs (including the log files and directories that Swift generates), run the cleanup script which is located in the tutorial PATH:

$ cleanup
Note You’ll also find a Swift configuration files in each partNN directory of this tutorial. These specify the environment-specific details of where to find application programs and where to run them . These files will be explained in more detail in parts 4-6, and can be ignored for now.

Part 2: Running an ensemble of many apps in parallel with a "foreach" loop

The p2.swift script introduces the foreach parallel iteration construct to run many concurrent simulations.

part02.png
p2.swift
type file;

app (file o) simulation ()
{
  simulate stdout=filename(o);
}

foreach i in [0:9] {
  file f <single_file_mapper; file=strcat("output/sim_",i,".out")>;
  f = simulation();
}

The script also shows an example of naming the output files of an ensemble run. In this case, the output files will be named output/sim_N.out.

In part 2, we update the swift.properties file.

Instead of using shell script (simulate is a link to simulate.sh), we use the equivalent python version (simulate.py). Swift does not need to know anything about the language an application is written in. The application can be written in Perl, Python, Java, Fortran, or any other language.

The swift.properties file also defines properties that control the parallelism involved in the swift execution. By setting the tasksPerWorker to the desired number of parallel application invocations it is possible to take advantage of multiple cores in the processor. The taskThrottle variable is used to limit number of tasks sent to workers for execution. Both tasksPerWorker and taskThrottle are set to 2 in the tutorial and could be set to the number of cores available on your machine.

site=localhost
use.provider.staging=true
execution.retries=2

site.localhost {
   jobmanager=local
   filesystem=local
   initialscore=10000
   taskWalltime=00:02:00
   workdir=/tmp/swiftwork
   tasksPerWorker=2
   taskThrottle=2
}

app.localhost.simulate=simulate.py
app.localhost.stats=stats.py

To run the script and view the output:

$ cd ../part02
$ swift p2.swift
$ ls output
sim_0.out  sim_1.out  sim_2.out  sim_3.out  sim_4.out  sim_5.out  sim_6.out  sim_7.out  sim_8.out  sim_9.out
$ more output/*
::::::::::::::
output/sim_0.out
::::::::::::::
      44
::::::::::::::
output/sim_1.out
::::::::::::::
      55
...
::::::::::::::
output/sim_9.out
::::::::::::::
      82

Part 3: Analyzing results of a parallel ensemble

After all the parallel simulations in an ensemble run have completed, its typically necessary to gather and analyze their results with some kind of post-processing analysis program or script. p3.swift introduces such a postprocessing step. In this case, the files created by all of the parallel runs of simulation.sh will be averaged by by the trivial "analysis application" stats.sh:

part03.png
p3.swift
type file;

app (file o) simulation (int sim_steps, int sim_range, int sim_values)
{
  simulate "--timesteps" sim_steps "--range" sim_range "--nvalues" sim_values stdout=filename(o);
}

app (file o) analyze (file s[])
{
  stats filenames(s) stdout=filename(o);
}

int nsim   = toInt(arg("nsim","10"));
int steps  = toInt(arg("steps","1"));
int range  = toInt(arg("range","100"));
int values = toInt(arg("values","5"));

file sims[];

foreach i in [0:nsim-1] {
  file simout <single_file_mapper; file=strcat("output/sim_",i,".out")>;
  simout = simulation(steps,range,values);
  sims[i] = simout;
}

file stats<"output/average.out">;
stats = analyze(sims);

To run:

$ cd part03
$ swift p3.swift

Note that in p3.swift we expose more of the capabilities of the simulate.sh application to the simulation() app function:

app (file o) simulation (int sim_steps, int sim_range, int sim_values)
{
  simulate "--timesteps" sim_steps "--range" sim_range "--nvalues" sim_values stdout=filename(o);
}

p3.swift also shows how to fetch application-specific values from the swift command line in a Swift script using arg() which accepts a keyword-style argument and its default value:

int nsim   = toInt(arg("nsim","10"));
int steps  = toInt(arg("steps","1"));
int range  = toInt(arg("range","100"));
int values = toInt(arg("values","5"));

Now we can specify that more runs should be performed and that each should run for more timesteps, and produce more that one value each, within a specified range, using command line arguments placed after the Swift script name in the form -parameterName=value:

$ swift p3.swift -nsim=3 -steps=10 -values=4 -range=1000000
Swift 0.95 Branch SVN swift-r7903 cog-r3908
RunID: run001
Progress: Tue, 03 Jun 2014 17:06:58-0500
Progress: Tue, 03 Jun 2014 17:06:59-0500  Submitted:2  Active:1
Progress: Tue, 03 Jun 2014 17:07:00-0500  Submitted:1  Active:2
Progress: Tue, 03 Jun 2014 17:07:01-0500  Active:3
Progress: Tue, 03 Jun 2014 17:07:09-0500  Active:2  Finished successfully:1
Progress: Tue, 03 Jun 2014 17:07:10-0500  Active:1  Finished successfully:2
Final status:Tue, 03 Jun 2014 17:07:11-0500  Finished successfully:4

$ ls output/
average.out  sim_0.out  sim_1.out  sim_2.out
$ more output/*
::::::::::::::
output/average.out
::::::::::::::
651368
::::::::::::::
output/sim_0.out
::::::::::::::
  735700
  886206
  997391
  982970
::::::::::::::
output/sim_1.out
::::::::::::::
  260071
  264195
  869198
  933537
::::::::::::::
output/sim_2.out
::::::::::::::
  201806
  213540
  527576
  944233

Now try running (-nsim=) 100 simulations of (-steps=) 1 second each:

$ swift p3.swift -nsim=100 -steps=1
Swift 0.95 Branch SVN swift-r7903 cog-r3908
RunID: run002
Progress: Tue, 03 Jun 2014 17:08:05-0500
Progress: Tue, 03 Jun 2014 17:08:06-0500  Selecting site:79  Submitted:20  Active:1
Progress: Tue, 03 Jun 2014 17:08:07-0500  Selecting site:78  Submitted:18  Active:3  Finished successfully:1
Progress: Tue, 03 Jun 2014 17:08:08-0500  Selecting site:75  Submitted:17  Active:4  Finished successfully:4
Progress: Tue, 03 Jun 2014 17:08:09-0500  Selecting site:71  Submitted:16  Active:5  Finished successfully:8
Progress: Tue, 03 Jun 2014 17:08:11-0500  Selecting site:66  Submitted:15  Active:6  Finished successfully:13
Progress: Tue, 03 Jun 2014 17:08:12-0500  Selecting site:60  Submitted:14  Active:7  Finished successfully:19
Progress: Tue, 03 Jun 2014 17:08:13-0500  Selecting site:53  Submitted:13  Active:8  Finished successfully:26
Progress: Tue, 03 Jun 2014 17:08:14-0500  Selecting site:45  Submitted:12  Active:9  Finished successfully:34
Progress: Tue, 03 Jun 2014 17:08:15-0500  Selecting site:36  Submitted:11  Active:10  Finished successfully:43
Progress: Tue, 03 Jun 2014 17:08:16-0500  Selecting site:26  Submitted:10  Active:11  Finished successfully:53
Progress: Tue, 03 Jun 2014 17:08:17-0500  Selecting site:15  Active:21  Finished successfully:64
Progress: Tue, 03 Jun 2014 17:08:18-0500  Active:16  Finished successfully:84
Progress: Tue, 03 Jun 2014 17:08:19-0500  Active:11  Finished successfully:89
Final status:Tue, 03 Jun 2014 17:08:19-0500  Finished successfully:101

Part 4: Specifying more complex workflow patterns

p4.swift expands the workflow pattern of p3.swift to add additional stages to the workflow. Here, we generate a dynamic seed value that will be used by all of the simulations, and for each simulation, we run an pre-processing application to generate a unique "bias file". This pattern is shown below, followed by the Swift script.

part06.png
p4.swift
type file;

# app() functions for application programs to be called:

app (file out) genseed (int nseeds, file seed_script)
{
  bash "simulate.sh" "-r" 2000000 "-n" nseeds stdout=@out;
}

app (file out) genbias (int bias_range, int nvalues, file bias_script)
{
  bash "simulate.sh" "-r" bias_range "-n" nvalues stdout=@out;
}

app (file out, file log) simulation (int timesteps, int sim_range,
                                     file bias_file, int scale, int sim_count,
                                     file sim_script, file seed_file)
{
  bash "simulate.sh" "-t" timesteps "-r" sim_range "-B" @bias_file "-x" scale
           "-n" sim_count "-S" @seed_file stdout=@out stderr=@log;
}

app (file out, file log) analyze (file s[], file stat_script)
{
  bash "stats.sh" filenames(s) stdout=@out stderr=@log;
}

# Command line arguments

int  nsim  = toInt(arg("nsim",   "10"));  # number of simulation programs to run
int  steps = toInt(arg("steps",  "1"));   # number of timesteps (seconds) per simulation
int  range = toInt(arg("range",  "100")); # range of the generated random numbers
int  values = toInt(arg("values", "10"));  # number of values generated per simulation

# Main script and data

file simulate_script <"simulate.sh">;
file stats_script <"stats.sh">;
file seedfile <"output/seed.dat">;        # Dynamically generated bias for simulation ensemble

tracef("\n*** Script parameters: nsim=%i range=%i num values=%i\n\n", nsim, range, values);
seedfile = genseed(1,simulate_script);

file sims[];                      # Array of files to hold each simulation output

foreach i in [0:nsim-1] {
  file biasfile <single_file_mapper; file=strcat("output/bias_",i,".dat")>;
  file simout   <single_file_mapper; file=strcat("output/sim_",i,".out")>;
  file simlog   <single_file_mapper; file=strcat("output/sim_",i,".log")>;
  biasfile = genbias(1000, 20, simulate_script);
  (simout,simlog) = simulation(steps, range, biasfile, 1000000, values, simulate_script, seedfile);
  sims[i] = simout;
}

file stats_out<"output/average.out">;
file stats_log<"output/average.log">;
(stats_out,stats_log) = analyze(sims, stats_script);

Note that the workflow is based on data flow dependencies: each simulation depends on the seed value, calculated in this statement:

seedfile = genseed(1);

and on the bias file, computed and then consumed in these two dependent statements:

  biasfile = genbias(1000, 20, simulate_script);
  (simout,simlog) = simulation(steps, range, biasfile, 1000000, values, simulate_script, seedfile);

To run:

$ cd ../part04
$ swift p4.swift

The default parameters result in the following execution log:

$ swift p4.swift
Swift 0.95 Branch SVN swift-r7903 cog-r3908
RunID: run001
Progress: Thu, 05 Jun 2014 06:01:36-0500

*** Script parameters: nsim=10 range=100 num values=10

Progress: Thu, 05 Jun 2014 06:01:37-0500  Submitted:8  Active:2  Finished successfully:11
Progress: Thu, 05 Jun 2014 06:01:38-0500  Submitted:6  Active:2  Finished successfully:13
Progress: Thu, 05 Jun 2014 06:01:39-0500  Submitted:4  Active:2  Finished successfully:15
Progress: Thu, 05 Jun 2014 06:01:40-0500  Submitted:2  Active:2  Finished successfully:17
Progress: Thu, 05 Jun 2014 06:01:41-0500  Active:2  Finished successfully:19
Final status:Thu, 05 Jun 2014 06:01:41-0500  Finished successfully:22

which produces the following output:

$ ls -lrt output
total 264
-rw-r--r-- 1 p01532 61532     9 Aug 27 19:17 seed.dat
-rw-r--r-- 1 p01532 61532   180 Aug 27 19:17 bias_9.dat
-rw-r--r-- 1 p01532 61532   180 Aug 27 19:17 bias_8.dat
-rw-r--r-- 1 p01532 61532   180 Aug 27 19:17 bias_7.dat
-rw-r--r-- 1 p01532 61532   180 Aug 27 19:17 bias_6.dat
-rw-r--r-- 1 p01532 61532   180 Aug 27 19:17 bias_5.dat
-rw-r--r-- 1 p01532 61532   180 Aug 27 19:17 bias_4.dat
-rw-r--r-- 1 p01532 61532   180 Aug 27 19:17 bias_3.dat
-rw-r--r-- 1 p01532 61532   180 Aug 27 19:17 bias_2.dat
-rw-r--r-- 1 p01532 61532   180 Aug 27 19:17 bias_1.dat
-rw-r--r-- 1 p01532 61532   180 Aug 27 19:17 bias_0.dat
-rw-r--r-- 1 p01532 61532    90 Aug 27 19:17 sim_9.out
-rw-r--r-- 1 p01532 61532 14897 Aug 27 19:17 sim_9.log
-rw-r--r-- 1 p01532 61532 14897 Aug 27 19:17 sim_8.log
-rw-r--r-- 1 p01532 61532    90 Aug 27 19:17 sim_7.out
-rw-r--r-- 1 p01532 61532    90 Aug 27 19:17 sim_6.out
-rw-r--r-- 1 p01532 61532 14897 Aug 27 19:17 sim_6.log
-rw-r--r-- 1 p01532 61532    90 Aug 27 19:17 sim_5.out
-rw-r--r-- 1 p01532 61532 14897 Aug 27 19:17 sim_5.log
-rw-r--r-- 1 p01532 61532    90 Aug 27 19:17 sim_4.out
-rw-r--r-- 1 p01532 61532 14897 Aug 27 19:17 sim_4.log
-rw-r--r-- 1 p01532 61532 14897 Aug 27 19:17 sim_1.log
-rw-r--r-- 1 p01532 61532    90 Aug 27 19:18 sim_8.out
-rw-r--r-- 1 p01532 61532 14897 Aug 27 19:18 sim_7.log
-rw-r--r-- 1 p01532 61532    90 Aug 27 19:18 sim_3.out
-rw-r--r-- 1 p01532 61532 14897 Aug 27 19:18 sim_3.log
-rw-r--r-- 1 p01532 61532    90 Aug 27 19:18 sim_2.out
-rw-r--r-- 1 p01532 61532 14898 Aug 27 19:18 sim_2.log
-rw-r--r-- 1 p01532 61532    90 Aug 27 19:18 sim_1.out
-rw-r--r-- 1 p01532 61532    90 Aug 27 19:18 sim_0.out
-rw-r--r-- 1 p01532 61532 14897 Aug 27 19:18 sim_0.log
-rw-r--r-- 1 p01532 61532     9 Aug 27 19:18 average.out
-rw-r--r-- 1 p01532 61532 14675 Aug 27 19:18 average.log

Each sim_N.out file is the sum of its bias file plus newly "simulated" random output scaled by 1,000,000:

$ cat output/bias_0.dat
     302
     489
      81
     582
     664
     290
     839
     258
     506
     310
     293
     508
      88
     261
     453
     187
      26
     198
     402
     555

$ cat output/sim_0.out
64000302
38000489
32000081
12000582
46000664
36000290
35000839
22000258
49000506
75000310

We produce 20 values in each bias file. Simulations of less than that number of values ignore the unneeded number, while simualtions of more than 20 will use the last bias number for all remoaining values past 20. As an exercise, adjust the code to produce the same number of bias values as is needed for each simulation. As a further exercise, modify the script to generate a unique seed value for each simulation, which is a common practice in ensemble computations.

Performing larger Swift runs

To test with larger runs, provide suitable command line arguments. The example below will run 1000 simulations with each simulation taking 5 seconds.

$ swift p4.swift -steps=5 -nsim=1000