Getting Started with the Stats HPC ==================================== This introductory guide explains how to use the system-wide PyTorch and TensorFlow Conda environments, using the example of running Slurm jobs on the **srf_gpu_01** cluster. Overview ---------- The Department of Statistics HPC clusters are: - srf_cpu_01 (shared CPU cluster) - srf_gpu_01 (shared GPU cluster) - swan (For research groups) At present, two preconfigured Conda environments are available on the srf_gpu_01 cluster: - /opt/conda/envs/pytorch-2025a (PyTorch, GPU-enabled) - /opt/conda/envs/tensorflow-2025a (TensorFlow, GPU-enabled) Connecting to the Stats HPC ---------------------------- To copy your files to the Stats HPC and to connect to the HPC, please check the Intro HPC & Linux presentation slides, especially p.8-11. Running your test PyTorch Slurm job ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Please copy the file **test_pytorch_gpu.sbatch** and/or **test_tensorflow_gpu.sbatch** (see below) to the **slurm-hn02** login node as shown in the Intro to HPC & Linux presentation slides. These sbatch files are Slurm job files. .. code-block:: bash :caption: test_pytorch_gpu.sbatch :emphasize-lines: 3 :linenos: #!/bin/bash #SBATCH --job-name=test_pytorch_gpu #SBATCH --mail-user= #SBATCH --mail-type=BEGIN,END,FAIL #SBATCH --partition=standard-gpu #SBATCH --clusters=srf_gpu_01 #SBATCH --gres=gpu:1 #SBATCH --cpus-per-task=2 #SBATCH --mem=4G #SBATCH --time=00:10:00 #SBATCH --output=test_pytorch_gpu_%j.out # Load the Conda module module load conda # Source conda.sh for non-interactive shell source /opt/conda/etc/profile.d/conda.sh # Activate PyTorch environment conda activate /opt/conda/envs/pytorch-2025a # Display GPU information echo "Running on host: $(hostname)" echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES" nvidia-smi # Run a PyTorch CUDA test python3 - <<'EOF' import torch print("PyTorch version:", torch.__version__) print("CUDA available:", torch.cuda.is_available()) if torch.cuda.is_available(): print("Using GPU:", torch.cuda.get_device_name(0)) x = torch.rand(1000, 1000, device="cuda") y = torch.rand(1000, 1000, device="cuda") print("Tensor sum (GPU):", (x + y).sum().item()) else: print("Running on CPU.") EOF sleep 300 .. code-block:: bash :caption: test_tensorflow_gpu.sbatch :emphasize-lines: 3 :linenos: #!/bin/bash #SBATCH --job-name=test_tensorflow_gpu #SBATCH --mail-user= #SBATCH --mail-type=BEGIN,END,FAIL #SBATCH --partition=standard-gpu #SBATCH --clusters=srf_gpu_01 #SBATCH --gres=gpu:1 #SBATCH --cpus-per-task=2 #SBATCH --mem=4G #SBATCH --time=00:10:00 #SBATCH --output=test_tensorflow_gpu_%j.out # Load the Conda module module load conda # Source conda.sh for non-interactive shell source /opt/conda/etc/profile.d/conda.sh # Activate TensorFlow environment conda activate /opt/conda/envs/tensorflow-2025a # Display GPU information echo "Running on host: $(hostname)" echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES" nvidia-smi # Run a TensorFlow CUDA test python3 - <<'EOF' import tensorflow as tf print("TensorFlow version:", tf.__version__) print("GPU available:", tf.config.list_physical_devices('GPU')) if tf.config.list_physical_devices('GPU'): print("Using GPU:", tf.config.list_physical_devices('GPU')[0]) with tf.device('/GPU:0'): a = tf.random.normal((1000, 1000)) b = tf.random.normal((1000, 1000)) c = tf.add(a, b) print("Tensor sum (GPU):", tf.reduce_sum(c).numpy()) else: print("Running on CPU.") EOF sleep 300 You will need to use a command-line intereface (CLI) text editor to open the files. If you have never used a Linux CLI text editor, nano is a great option for beginners. Here is a YouTube video that does a great job of introducing you to nano: https://www.youtube.com/watch?v=g2PU--TctAM In the sbatch files above you see a number of #SBATCH lines. These are Slurm job parameters that will define how Slurm is going to run your job on the cluster. You can edit these to your requirements. For now, I recommend you only change the ‘--mail-user' parameter to your email address and submit the Slurm job. Explanation of #SBATCH Options ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ --job-name Job name shown in queues/logs. --mail-user Email address to where you want to receive Slurm notifications. --mail-type Events to notify in the Slurm email (BEGIN, END, FAIL). --clusters Which cluster you want to run your Slurm job (eg. srf_gpu_01) --partition Partition/queue (e.g., standard-gpu). --gres=gpu:1 The number of GPU you’re requesting, eg 1 GPU. --cpus-per-task The number of CPU cores you want assigned per task. --mem How much memory allocation you’re requesting (e.g., 4G). --time Maximum runtime/timelimit (HH:MM:SS). --output Where you want Slurm to output/write the Slurm job log file (%j = job ID). The PyTorch test Slurm job ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ l.14 loads the cluster’s system-wide Conda module and then sources its config in l.17. After that, l.20 activates your PyTorch environment inside your Conda. If you are new to using Conda and PyTorch in this way, it's recommended you do not change l.15-20. Ideally, the system Conda + PyTorch is sufficient for your needs. If it’s missing anything or not working as expected, please let me know at ithelp@stats.ox.ac.uk. l.23-24 are self-explanatory, I think. `echo` is the print command for Linux. And l.25 the nvidia-smi command, is a common way for you to query information about the GPUs on the cluster. You will find the output of all these commands in your Slurm job’s log file (see below). And l.28-41 is just a basic PyTorch script to check whether PyTorch detects the GPU. Later on, you would either paste your PyTorch script into the Slurm job file, or just replace l.29-40 with the path to the Python file, eg: .. code-block:: bash python3 my-pytorch-script.py The sleep command in l.41 is not necessary for you to use, going forward. I just included it, to allow you the time to view your job running in the queue. The PyTorch script I used here is too basic to require much computing power, and so the cluster would complete the job before you had the time to run the squeue command (see below). That is why I’ve included `sleep 300`, which means “Do nothing, and just wait for 300 seconds.” Submitting the Slurm job ^^^^^^^^^^^^^^^^^^^^^^^^^^^ On the login node (eg slurm-hn02) type the Slurm batch job submission command into the terminal: For the PyTorch sbatch file: .. code-block:: bash sbatch test_pytorch_gpu.sbatch For the TensorFlow sbatch file: .. code-block:: bash sbatch test_tensorflow_gpu.sbatch When you press ENTER, if everything went well, the terminal will return a message like: Submitted batch job 9616 on cluster srf_gpu_01 Each job is assigned a unique job ID by Slurm. In the above example, the job ID is 9616. Viewing the Slurm job queue ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Now that your job is running, you can view its status in the job queue: .. code-block:: bash squeue --long -M srf_gpu_01 Now that your job is running, you can view its status in the job queue: .. code-block:: bash squeue --long -M srf_gpu_01 .. image:: img/squeue-long.png In the screenshot you can see Slurm job 9616 running on the srf_gpu_01 cluster. If you see something similar for your job, congratulations, you’ve just submitted your first Slurm job to the Department of Statistics HPC! If you set your email in the Slurm job file, you will have received two emails from the HPC. One at the beginning of the job, when you submitted the job and Slurm ran it on the HPC, and one at the end of the job, when the job completed. **Job begins:** .. image:: img/slurm-email-begin.png **Job completed:** .. image:: img/slurm-email-comp.png Reviewing your job results ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Once this test Slurm job has completed, you can open (eg with nano) the Slurm job log file you defined in the job file. In this example, it’s located in the folder **myslurmlogs/** and is called **test_pytorch_gpu_9616.out** <-- is the Slurm job ID I received when I ran the job at the time of writing this tutorial. Your individual Slurm job IDs will be different and unique. .. code-block:: bash $ nano ~/myslurmlogs/test_pytorch_gpu_......out Output of test_pytorch_gpu_9616.out: .. image:: img/slurm-job.out.png The first two lines in the above screenshots are the outputs of: l.23 echo "Running on host: $(hostname)" l.24 echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES" And then from “Fri Jul 25...” to “No running processes found..” you have the output of the nvidia-smi command, which is showing the GPU information for the cluster node this Slurm job ran on. Which in this case was swangpu24.cpu.stats.ox.ac.uk. And the remaining lines are from PyTorch. If you need similar guides or software installations on the HPC, please let me know at ithelp@stats.ox.ac.uk Useful links on Getting Started with Linux, HPC, and Parallel Programming for AI/ML and HPC -------------------------------------------------------------------------------------------- Linux ~~~~~~ Introduction to Linux (LFS101) https://training.linuxfoundation.org/training/introduction-to-linux/ HPC ~~~~ * Introduction to Parallel Computing Tutorial https://hpc.llnl.gov/documentation/tutorials/introduction-parallel-computing-tutorial Python (with PyTorch) ~~~~~~~~~~~~~~~~~~~~~~ * Getting Started with Distributed Data Parallel https://docs.pytorch.org/tutorials/intermediate/ddp_tutorial.html * Introduction to High-Performance Computing in Python https://www.hpc-carpentry.org/hpc-python/ R (programming language) ~~~~~~~~~~~~~~~~~~~~~~~~~~ * R doParallel: A Brain-Friendly Introduction to Parallelism in R https://www.appsilon.com/post/r-doparallel * CRAN Task View: High-Performance and Parallel Computing with R https://cran.r-project.org/web/views/HighPerformanceComputing.html