Skip to content

Lab1 Construct simple cluster

Keep in mind

Always redirect the output of your commands to a file. It helps to store informations in files. You may need them later to get information or debug.

# Redirect stdout and stderr to files
command > stdout.log 2> stderr.log

If you want to see the information in your terminal, you can use tee to print the output to both stdout and files.

# Redirect stdout and stderr to files and print them to terminal
command 2>&1 | tee stdout.log stderr.log

0. Create a virtual machine

Procedures of installation are skipped.

If you can't bear low display resolution in terminals, see Configuring Network and SSH to use SSH to connect to the virtual machine.

1. Build and install OpenMPI from source

Follow the official tutorial OpenMPI: FAQ:Building Open MPI to install OpenMPI.

wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.1.tar.gz
tar -xvf openmpi-4.1.1.tar.gz
cd openmpi-4.1.1
./configure --prefix=/usr/local
make all install

You may need sudo privilege to execute make all install because it needs to change /usr/local/. After installation, the binary files are in /usr/local/bin and you don't need to change PATH.

Remember directory paths

You'll need to configure paths to OpenMPI when compiling and running MPI programs. In my case, the paths are:

  • binary: /usr/local/bin/
  • header files: /usr/local/include
  • hostfile: /usr/local/etc/openmpi-default-hostfile

After I've done the lab, I understood that I should have used an independent path to save the OpenMPI files and other components, so that they can be found more easily. For example, I can use /usr/local/openmpi-4.1.1 to save the files.

Test your OpenMPI installation

You can test your OpenMPI installation by executing mpirun:

mpirun --version

If you see the version information, then you've installed OpenMPI successfully.

If you want to test the compilation and execution of MPI programs, you can try to compile and run the hello.c. Follow the instructions on Using MPI with C

2. Build and install BLAS and CBLAS

Prepare tools:

sudo apt install build-essential gfortran cmake

Get the source code and extract them:

wget "https://netlib.org/benchmark/hpl/hpl-2.3.tar.gz"
wget "http://www.netlib.org/blas/blas-3.11.0.tgz"
wget "http://www.netlib.org/blas/blast-forum/cblas.tgz"
tar xvf hpl-2.3.tar.gz -C hpl
tar xvf blas-3.11.0.tgz -C blas
tar xvf cblas.tgz -C cblas

Install BLAS

cd blas
make

Then you'll Find the line BLASLIB = ../../blas$(PLAT).a and change it to BLASLIB = /usr/local/lib/libblas.a. Then save and exit. Then execute make, get the binary file cblas_LINUX.a in the folder. Copy it to /usr/local/lib:

cp blas_LINUX.a /usr/local/lib/libblas.a

Run the test program for BLAS and passed:

BLAS Test

Install CBLAS

cd ../cblas
more README

Read the instructions in README to install CBLAS.

cp Makefile.LINUX Makefile.in
vim Makefile.in

Find the line BLASLIB = ../../blas$(PLAT).a and change it to BLASLIB = /usr/local/lib/libblas.a. Then save and exit. Then execute make, get the binary file cblas_LINUX.a in the folder. Copy it to /usr/local/lib:

make
cp cblas_LINUX.a /usr/local/lib/libcblas.a
Error when compiling CBLAS

If you encountered the same error when executing make in CBLAS folder and stopped at testing process, receiving the error message like:

Error: rank mismatch in argument 'strue1' at (1) (rank-1 and scalar)

CBLAS Error

The reason is that there are some old-fashioned code deprecated by recent version of gcc-10. You can try to modify the Makefile in test folder, adding -std=legacy or -fallow-argument-mismatch to the flags.

Here are more information:

After edit, run make again and you'll see the test passed.

CBLAS Test

3. Build HPL

There are some difference between the experiment manual and INSTALL file, you should follow the manual.

cp setup/Make.Linux_PII_CBLAS Make.test

Edit some lines in Make.test:

# diff from Make.test and Make.Linux_PII_CBLAS
64c64
< ARCH         = test
---
> ARCH         = Linux_PII_CBLAS
84,86c84,86
< MPdir        =
< MPinc        = -I/usr/local/include
< MPlib        = /usr/local/lib/libmpi.so
---
> MPdir        = /usr/local/mpi
> MPinc        = -I$(MPdir)/include
> MPlib        = $(MPdir)/lib/libmpich.a
95,97c95,97
< LAdir        =
< LAinc        = #/home/bowling/CBLAS/include
< LAlib        = /usr/local/lib/libcblas.a /usr/local/lib/libblas.a /usr/lib/gcc/x86_64-linux-gnu/10/libgfortran.a /usr/lib/gcc/x86_64-linux-gnu/10/libquadmath.a /usr/lib/x86_64-linux-gnu/libm.a
---
> LAdir        = $(HOME)/netlib/ARCHIVES/Linux_PII
> LAinc        =
> LAlib        = $(LAdir)/libcblas.a $(LAdir)/libatlas.a
169c169
< CC           = /usr/local/bin/mpicc
---
> CC           = /usr/bin/gcc
176c176
< LINKER       = $(CC)
---
> LINKER       = /usr/bin/g77
Error in linking

You may encounter the error message like this:

HPC_error

Search the web and you'll find the libs containing missing symbols. They are:

  • /usr/lib/gcc/x86_64-linux-gun/10/libgfortran.a
  • /usr/lib/gcc/x86_64-linux-gun/10/libquadmath.a
  • /usr/lib/gcc/x86_64-linux-gun/10/libm.a

Searching for the missing library was hard, because it happens everywhere. Solutions to others may not take effect on you.

I'm still wondering why these libraries were not linked automatically, they are in the default PATH.

After make, you'll get the binary file xhpl in the folder bin/test.

4. Configuring network and SSH

To construct a cluster, we need to use NAT network to connect them together. The host cannot connect to virtual machine inside NAT network directly, but we can use port forwarding:

Port Forwarding

Now we can ssh into the master machine:

ssh -p 3022 bowling@localhost

5. Experiment

Error: cannot open file HPL.dat

You may encounter error like this:

HPL.dat

Refer to this issue:

Final result:

Result

Bonus: Docker

Using docker, we can skip the process of building everything. We only need to configure SSH connection and write HPL.dat, then enjoy. DOCKERS ARE AWESOME.

Find docker

I use the docker from ashael/hpl - Docker Image | Docker Hub, which contains configured hpl-2.2.

Here I've opened 4 containers from the same image:

docker image

Setting up connection

Dockers are connected within a bridge by default.

docker network

Attach to each of them and get their ip address:

docker ip

Select the host of 172.17.0.5 as master, 172.17.0.4-172.17.0.2 as slaves.

In master:

ssh-keygen
cat ~/.ssh/id_rsa.pub

In slaves:

mkdir /var/run/sshd
vim /etc/ssh/sshd_config

find these lines:

#ListenAddress ::
#ListenAddress 0.0.0.0
#AuthorizedKeyFile %h/.ssh/authorized_keys

uncomment them and save.

Then add master's public key to authorized_keys and start sshd.

docker ssh

Succeeded.

Execute experiment

docker result

I found the computing speed in dockers is much faster then clusters. DOCKERS ARE EXTREMELY AWESOME.

Extended

Later in the course, instructors mentioned we can try to modify HPL.dat to get better performance and configure NFS to share files between nodes. I also tried to do these things.

Tuning HPL.dat

Refer to HPL Tuning.

Configuring NFS File System

NFS uses RPC protocol to share files between server and client.

Setup NFS Server and Client

Please refer to this passage: How to Set Up an NFS Mount on Debian 11 | DigitalOcean.

More details about NFS can be seen in:

Have a test

First install nfs-common and create mount point on clients:

sudo apt install nfs-common
sudo mkdir /hpc

mount error

If you encounter the following error:

mount: bad option; for several filesystems (e.g. nfs, cifs) you might need a /sbin/mount.<type> helper program.

You need to install nfs-common on clients.

Add the following line to clients:

/etc/fstab
10.0.2.15:/hpc /hpc nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0

Then run the program and get the result: