Recommendations for Monitoring the HPC Platform¶

Setting Up Monitor Server (Ganglia & Nagios)¶

On Master Node¶

Create /opt/vm/monitor.xml for deploying the storage VM (Available here)

Create disk image for the monitor VM:

qemu-img create -f qcow2 monitor.qcow2 80G

Define the VM:
```
virsh define monitor.xml
```

On Controller VM¶

Create a group for the monitor VM (add at least monitor1 as a node in the group, set additional groups of services,cluster,domain allows for more diverse group management):
```
metal configure group monitor
```
Customise monitor1 node configuration (set the primary IP address to 10.10.0.5):
```
metal configure node monitor1
```
Create /var/lib/metalware/repo/config/monitor1.yaml with the following network and server definition:
```
ganglia:
  is_server: true

nagios:
  is_server: true
```

Add the following to /var/lib/metalware/repo/config/domain.yaml:

ganglia:
  server: 10.10.0.5
  is_server: false
nagios:
  is_server: false

Additionally, add the following to the setup: namespace list in /var/lib/metalware/repo/config/domain.yaml:
```
- /opt/alces/install/scripts/03-ganglia.sh
- /opt/alces/install/scripts/04-nagios.sh
```

Download the ganglia.sh and nagios.sh scripts to the above location:

mkdir -p /opt/alces/install/scripts/
cd /opt/alces/install/scripts/
wget -O 03-ganglia.sh https://raw.githubusercontent.com/alces-software/knowledgebase/release/2017.1/epel/7/ganglia/ganglia.sh
wget -O 04-nagios.sh https://raw.githubusercontent.com/alces-software/knowledgebase/release/2017.1/epel/7/nagios/nagios.sh

Follow Client Deployment Example to setup the compute nodes

This will setup minimal installations of both Ganglia and Nagios. All nodes within the domain will be built to connect to these services such that they can be monitored. It is possible to expand upon the metrics monitored and notification preferences.