Lab1 Construct simple cluster¶
Keep in mind
Always redirect the output of your commands to a file. It helps to store informations in files. You may need them later to get information or debug.
# Redirect stdout and stderr to files
command > stdout.log 2> stderr.log
If you want to see the information in your terminal, you can use tee
to print the output to both stdout and files.
# Redirect stdout and stderr to files and print them to terminal
command 2>&1 | tee stdout.log stderr.log
0. Create a virtual machine¶
Procedures of installation are skipped.
If you can't bear low display resolution in terminals, see Configuring Network and SSH to use SSH to connect to the virtual machine.
1. Build and install OpenMPI from source¶
Follow the official tutorial OpenMPI: FAQ:Building Open MPI to install OpenMPI.
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.1.tar.gz
tar -xvf openmpi-4.1.1.tar.gz
cd openmpi-4.1.1
./configure --prefix=/usr/local
make all install
You may need sudo
privilege to execute make all install
because it needs to change /usr/local/
. After installation, the binary files are in /usr/local/bin
and you don't need to change PATH.
Remember directory paths
You'll need to configure paths to OpenMPI when compiling and running MPI programs. In my case, the paths are:
- binary:
/usr/local/bin/
- header files:
/usr/local/include
- hostfile:
/usr/local/etc/openmpi-default-hostfile
After I've done the lab, I understood that I should have used an independent path to save the OpenMPI files and other components, so that they can be found more easily. For example, I can use /usr/local/openmpi-4.1.1
to save the files.
Test your OpenMPI installation
You can test your OpenMPI installation by executing mpirun
:
mpirun --version
If you see the version information, then you've installed OpenMPI successfully.
If you want to test the compilation and execution of MPI programs, you can try to compile and run the hello.c
. Follow the instructions on Using MPI with C
2. Build and install BLAS and CBLAS¶
Prepare tools:
sudo apt install build-essential gfortran cmake
Get the source code and extract them:
wget "https://netlib.org/benchmark/hpl/hpl-2.3.tar.gz"
wget "http://www.netlib.org/blas/blas-3.11.0.tgz"
wget "http://www.netlib.org/blas/blast-forum/cblas.tgz"
tar xvf hpl-2.3.tar.gz -C hpl
tar xvf blas-3.11.0.tgz -C blas
tar xvf cblas.tgz -C cblas
Install BLAS¶
cd blas
make
Then you'll Find the line BLASLIB = ../../blas$(PLAT).a
and change it to BLASLIB = /usr/local/lib/libblas.a
. Then save and exit. Then execute make
, get the binary file cblas_LINUX.a
in the folder. Copy it to /usr/local/lib
:
cp blas_LINUX.a /usr/local/lib/libblas.a
Run the test program for BLAS and passed:
Install CBLAS¶
cd ../cblas
more README
Read the instructions in README
to install CBLAS.
cp Makefile.LINUX Makefile.in
vim Makefile.in
Find the line BLASLIB = ../../blas$(PLAT).a
and change it to BLASLIB = /usr/local/lib/libblas.a
. Then save and exit. Then execute make
, get the binary file cblas_LINUX.a
in the folder. Copy it to /usr/local/lib
:
make
cp cblas_LINUX.a /usr/local/lib/libcblas.a
Error when compiling CBLAS
If you encountered the same error when executing make in CBLAS folder and stopped at testing process, receiving the error message like:
Error: rank mismatch in argument 'strue1' at (1) (rank-1 and scalar)
The reason is that there are some old-fashioned code deprecated by recent version of gcc-10. You can try to modify the Makefile
in test folder, adding -std=legacy
or -fallow-argument-mismatch
to the flags.
Here are more information:
After edit, run make
again and you'll see the test passed.
3. Build HPL¶
There are some difference between the experiment manual and INSTALL
file, you should follow the manual.
cp setup/Make.Linux_PII_CBLAS Make.test
Edit some lines in Make.test
:
# diff from Make.test and Make.Linux_PII_CBLAS
64c64
< ARCH = test
---
> ARCH = Linux_PII_CBLAS
84,86c84,86
< MPdir =
< MPinc = -I/usr/local/include
< MPlib = /usr/local/lib/libmpi.so
---
> MPdir = /usr/local/mpi
> MPinc = -I$(MPdir)/include
> MPlib = $(MPdir)/lib/libmpich.a
95,97c95,97
< LAdir =
< LAinc = #/home/bowling/CBLAS/include
< LAlib = /usr/local/lib/libcblas.a /usr/local/lib/libblas.a /usr/lib/gcc/x86_64-linux-gnu/10/libgfortran.a /usr/lib/gcc/x86_64-linux-gnu/10/libquadmath.a /usr/lib/x86_64-linux-gnu/libm.a
---
> LAdir = $(HOME)/netlib/ARCHIVES/Linux_PII
> LAinc =
> LAlib = $(LAdir)/libcblas.a $(LAdir)/libatlas.a
169c169
< CC = /usr/local/bin/mpicc
---
> CC = /usr/bin/gcc
176c176
< LINKER = $(CC)
---
> LINKER = /usr/bin/g77
Error in linking
You may encounter the error message like this:
Search the web and you'll find the libs containing missing symbols. They are:
/usr/lib/gcc/x86_64-linux-gun/10/libgfortran.a
/usr/lib/gcc/x86_64-linux-gun/10/libquadmath.a
/usr/lib/gcc/x86_64-linux-gun/10/libm.a
Searching for the missing library was hard, because it happens everywhere. Solutions to others may not take effect on you.
I'm still wondering why these libraries were not linked automatically, they are in the default PATH
.
After make, you'll get the binary file xhpl
in the folder bin/test
.
4. Configuring network and SSH¶
To construct a cluster, we need to use NAT network to connect them together. The host cannot connect to virtual machine inside NAT network directly, but we can use port forwarding:
Now we can ssh into the master
machine:
ssh -p 3022 bowling@localhost
5. Experiment¶
Error: cannot open file HPL.dat
You may encounter error like this:
Refer to this issue:
Final result:
Bonus: Docker¶
Using docker, we can skip the process of building everything. We only need to configure SSH connection and write HPL.dat
, then enjoy. DOCKERS ARE AWESOME.
Find docker¶
I use the docker from ashael/hpl - Docker Image | Docker Hub, which contains configured hpl-2.2
.
Here I've opened 4 containers from the same image:
Setting up connection¶
Dockers are connected within a bridge by default.
Attach to each of them and get their ip address:
Select the host of 172.17.0.5
as master, 172.17.0.4
-172.17.0.2
as slaves.
In master:
ssh-keygen
cat ~/.ssh/id_rsa.pub
In slaves:
mkdir /var/run/sshd
vim /etc/ssh/sshd_config
find these lines:
#ListenAddress ::
#ListenAddress 0.0.0.0
#AuthorizedKeyFile %h/.ssh/authorized_keys
uncomment them and save.
Then add master's public key to authorized_keys
and start sshd
.
Succeeded.
Execute experiment¶
I found the computing speed in dockers is much faster then clusters. DOCKERS ARE EXTREMELY AWESOME.
Extended¶
Later in the course, instructors mentioned we can try to modify HPL.dat
to get better performance and configure NFS to share files between nodes. I also tried to do these things.
Tuning HPL.dat¶
Refer to HPL Tuning.
Configuring NFS File System¶
NFS uses RPC protocol to share files between server and client.
Setup NFS Server and Client¶
Please refer to this passage: How to Set Up an NFS Mount on Debian 11 | DigitalOcean.
More details about NFS can be seen in:
Have a test¶
First install nfs-common
and create mount point on clients:
sudo apt install nfs-common
sudo mkdir /hpc
mount error
If you encounter the following error:
mount: bad option; for several filesystems (e.g. nfs, cifs) you might need a /sbin/mount.<type> helper program.
You need to install nfs-common
on clients.
Add the following line to clients:
10.0.2.15:/hpc /hpc nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0
Then run the program and get the result: