Contents Different modes of Distributed TLC How Distributed TLC in Ad Hoc Mode Works Running the Master Running the Slave Computers Limitations of Distributed Mode Getting and Giving Help
TLC keeps fingerprints of all states that it has found, which it uses to determine if a newly computed state has already been examined. It writes the set of fingerprints to disk if it becomes too big to fit in memory. Having to read fingerprints from disk slows TLC down considerably. As explained below, you can have distributed TLC distribute the set of fingerprints to multiple slave computers (instead of keeping them on the master computer).
Unless you're just trying it out, you're running TLC in distributed mode because your model is quite large. For any large model, it's a good idea to give TLC as much of the computer's memory as is not needed by the operating system and other programs that will be running at the same time. This applies to TLC being run on the slaves and on the master if slave fingerprint servers are not used.
TLC is run on a model. To specify that it is to be run in distributed mode, select "ad hoc" in the Run in distributed mode drop down list in the How to run? section of the Model Overview Page. That section also allows you to adjust the amount of memory allocated to the master.
Your master computer is likely to have multiple network interface cards
and thus several different IP addresses. The Master's network address
drop down list shows all IP addresses the Toolbox could identify. Choose the one
to which the workers will be able to connect. The Toolbox tries guess the best
match and selects it by default.
It is up to you to configure your master's firewall to allow the workers incoming
connections.
You start a TLC run in distributed mode as usual by clicking on the button, by selecting Run model on the TLC Model Checker menu, or by typing F11. This should cause the Current status field of the Model Checking Results Page to change to Master is waiting for (remote) workers and then to indicate that one or more workers have registered with the master.
If you want TLC to store the fingerprint set among slave computers, you must run a (single) fingerprint server on each of them. Your model must tell TLC how many fingerprint servers it will use. If you want fingerprints stored on 5 fingerprint servers (5 slave computers), you increase the Number of distributed fingerprint sets spinner to 5. Setting it to 0, means that the master will store the fingerprints. The master uses more memory in this mode.
For distributed mode, you may also want to add special arguments to the command that launches the JVM that runs the master. As explained below, one such argument may be needed for the master and the workers to communicate. The arguments are put in the JVM arguments field of the model's Advanced Options page.
If your model will run for a long time (days or weeks), you may not want to
keep the Toolbox open for the entire run. Instead, you can run TLC from
a command line.
The easiest way to do this is to create the model
in the Toolbox, and validate it.
You can then run distributed TLC with the following command, where tool-path
is the complete pathname of the directory containing the file tla2tools.jar
and model-path
is is the complete pathname of the directory
Spec.toolbox/ModelName
, with
Spec
the specification's name and ModelName
the model's name
(see How TLC is Run):
java -cp tool-path tlc2.tool.distributed.TLCServer model-path/MCLike the JVM argument
-cp tool-path
, other JVM arguments can come between java
and tlc2.tool.distributed.TLCServer
in this command. TLC options follow
model-path/MC
.
The following are the basic steps that must be accomplished to start the slaves. The details of how these steps are performed will vary according to the operating system and network configuration that you are using. If you have trouble getting distributed TLC to work on your system, try finding help on the TLA+ Google group.
To run the slaves, you must have a network of computers that can communicate with one another and with the machine running the master and the Toolbox. We assume that there is some some remote management/administration system installed on the computers that run the workers. Examples of such a system are Remote Desktop (Windows), ssh, rsh, and telnet. A Java Runtime Environment (JRE), version 1.5 or later, must also be installed on the machines.
There are two basic ways to run a slave computer computer. In the following instructions,
master-computer
is the name (or IP address) of the master computer.
http://master-computer:10996This will display a Web page containing buttons to launch worker threads, a fingerprint server, or both on the slave.
You can launch the slaves with additional system properties and VM arguments
by adding them to the URL. (When you launch a slave from a command line,
system properties [which begin with -D
] and VM arguments are specified separately.) For example, the URL
http://localhost:10996/worker.jnlp?sysprops=-Diron.bar=5%20-Dfrob=2%20foo=&vmargs=-Xmx=512m%20-XX:MaxDirectMemorysize=512mcauses the worker to be launched in a VM with VM arguments
-XX:MaxDirectMemorysize=512m
and
-Xmx=512m
and system properties
-Diron.bar=5
and
-Dfrob=2
.
wget http://master-computer:10996/files/tla2tools.jarThen execute one of the following commands in that slave's shell:
java -cp tla2tools.jar tlc2.tool.distributed.TLCWorker master-computer
java -cp tla2tools.jar tlc2.tool.distributed.fp.DistributedFPSet master-computer
java -cp tla2tools.jar tlc2.tool.distributed.fp.TLCWorkerAndFPSet master-computer
The wget
command downloads the file
tla2tools.jar
into the current directory on the worker
machine, the java
command actually starts the worker. The
wget
command therefore just has to be executed the first
time you run a worker, and then whenever you install a new version of
the Toolbox. The wget
command is not part of
Windows, but can be installed on Windows as part of Cygwin.
You can add JVM arguments to the java
command, such as the argument
-Xmx7G
that
gives the slave 7 gigabytes of memory.
By default, each slave that runs workers runs as many worker threads as it has processors. Running that many threads on the slave might cause problems with some operating system. The following JVM argument causes each slave to run 2 threads:
-Dtlc2.tool.distributed.TLCWorker.threadCount=2(This JVM argument is only applicable for slaves and has no effect if passed to the master process).
jones-home-laptop
or
tla.msr-inria.inria.fr
. However, some computers have
multiple network interfaces, with different IP addresses--for example,
192.168.1.10
and 10.10.0.1
. If the master
computer is such a computer, then it's possible that using this name
as the master-computer name in the instructions above will cause the
worker not to be able to communicate with the master because they are
using different IP addresses. To solve this problem, choose one of
those IP addresses--say 192.168.1.10
. Use that address
as the master-computer name in the instructions above for running the
workers, add the following JVM argument
to the model:
-Djava.rmi.server.hostname=192.168.1.10
tla2tools.jar
file and
starting the workers. How this is done will depend on the
operating system and network configuration.
If you need help, try going to
the TLA+ Google group.
If you have successfully run TLC in distributed mode, please use that
Forum to tell others how you did it on your system.