Run 5 learners and 10 actors in a cluster¶
The setup and execution is a complex procedure. If not clear, please report an issue in the issue tracker
Install dependencies for the code¶
The complete system requires several dependencies. The dependencies are:
- docker
- python libraries
- The learning code for neonrace is modified from openai/universe-starter-agent
An installation script is provided in universe-starter-agent/install/install.sh
Modify the cluster configuration¶
The multiple-learner component is implemented with distributed tensorflow.
The learner configuration is hard coded in universe-starter-agent/ccvl_cluster_spec.py
. Modify this file according to your cluster spec.
In the following document, the parameter server will be ccvl2
. And the other five machines ccvl1-5
will run learners. The parameter server is responsible for coordinating weights between learners.
Start the parameter server¶
In the machine for parameter server, ccvl2
, start the parameter server with
cd universe-starter-agent/
sh run_ps.sh
universe-starter-agent/run_ps.sh
will start ps_run.py
with proper parameters.
Start five learners¶
In each machine from ccvl1-5, start the learner with
sh run_learner.sh 0
The number 0 is the worker id for ccvl1, number 1 will be the id for ccvl2.
The learner will wait until all actors are connected.
Start all actors and start learning¶
Start docker which contains the neonrace
virtual environment. This script will start two docker containers, each running a neonrace virtual environment.
sh run_docker.sh
Start the actor code with
sh run_actor.sh
run_actor.sh
will run actor.py
with proper parameters.
Check the learning result¶
The learning procedure can be visualized by connecting to the docker container through vnc.
Use TurboVNC client to connect to ccvl1.ccvl.jhu.edu:13000
. Change the url to your own configuration.
The learnt models will be stored in train-log
folder. Use tensorboard to visualize the result, or use the code in neonrace
to use trained model.