SA 256
Original text by D.Shin, revisions by S. Kirklin
A more in depth, and commensurately more complicated, guide to system administration on our group clusters.
Job management
On master node, pbs_server should be running to accept jobs and pbs_mom shoud be running all computing nodes. Please do NOT restart or kill running pbs_server deamons, unless it is really needed. It will reset all running jobs.
josquin ~ # ps aux|grep pbs_server
root 6728 0.0 0.0 15500 3560 ? Ss Jan02 16:39 pbs_server
If PBS related commands, such as qstat, qsub, qalter and etc., are not working, then make sure pbs_server is not working with above command and launch pbs_server command manuall. DO NOT restart entire pbs service with /etc/init.d/pbs restart.
josquin ~ # /usr/local/sbin/pbs_server
Maui is a scheduler for job management on our clusters. It starts its service at the boot along with pbs_server. If all maui related commands are not working, such as showq, diagnose –f (aliased as fs), diagnose –p (aliased as p), showstart and etc., relaunch maui command by:
josquin ~ # /usr/local/maui/sbin/maui
Fairshare Scheme
Fairshare scheme is applied to all cluster and its settings can be found in maui.cfg file in /usr/local/maui. All jobs will be launched based on the order of priority, which is weighted by many different categories, such as fairshare and resources requested.
NEED TO UPDATE THE FAIRSHARE SCHEME THEN PUT THAT INFORMATION HERE.
Package management
To install/remove/upgrade a program on cluster, you may want to use its package management feature. There is a nice summary on Wikipedia for various package management system on linux.
Victoria
Its OS is CentOS 5.2, and uses rpm, the most common linux package management system. To actually install files, you can
victoria ~ # yum search scipy
Loading "fastestmirror" plugin
Loading "priorities" plugin
Loading "downloadonly" plugin
Loading mirror speeds from cached hostfile
* rpmforge: fr2.rpmfind.net
* base: yum.singlehop.com
* updates: mirror.sanctuaryhost.com
* addons: mirror.team-cymru.org
* extras: pubmirrors.reflected.net
rpmforge 100% |=========================| 1.1 kB 00:00
base 100% |=========================| 2.1 kB 00:00
updates 100% |=========================| 1.9 kB 00:00
addons 100% |=========================| 951 B 00:00
extras 100% |=========================| 2.1 kB 00:00
Excluding Packages in global exclude list
Finished
0 packages excluded due to repository priority protections
python-numpy.x86_64 : Fast multidimensional array facility for Python
python-numpy.x86_64 : Fast multidimensional array facility for Python
victoria ~ # yum install python-numpy
Josquin/Byrd/Palestrina
Gentoo is installed on palestrina, josquin and byrd, which uses portage for package management. To access the library of software available:
palestrina ~ # emerge -s scipy
Searching...
[ Results for search key : scipy ]
[ Applications found : 1 ]
* sci-libs/scipy
Latest version available: 0.7.2-r1
Latest version installed: 0.7.2-r1
Size of files: 13,340 kB
Homepage: http://www.scipy.org/ http://pypi.python.org/pypi/scipy
Description: Scientific algorithms library for Python
License: BSD
palestrina ~ # emerge scipy
Encina
Ubuntu 8.04 server is installed on encina, and it uses APT for package management.
Accessibility
iptables
iptables is a kernel level firewall that blocks an access to a port which is not opened.
On Wolverton clusters, ports other than 22 (ssh), 25 (mail), 80 (http), 443 (https), 3573 (DevMan[2]), are all closed. Rule files are /etc/iptables.bak (josquin, byrd, palestrina) and /etc/sysconfig/iptable.save (victoria).
kaien@josquin ~$ sudo /sbin/iptables -L
Password:
Chain INPUT (policy ACCEPT)
target prot opt source destination
ACCEPT all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:ssh
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:smtp
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:http
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:https
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:3573
DROP all -- anywhere anywhere
Chain FORWARD (policy ACCEPT)
target prot opt source destination
DROP tcp -- anywhere anywhere tcp spt:31337 dpt:31337
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
DROP tcp -- anywhere anywhere tcp spt:31337 dpt:31337
Fail2ban
Fail2ban is a program that bans certain ip addresses, if there are more than certain number of malicious attempts and it basically adds more rules to iptables. The configuration file, jail.conf, can be found in /etc/fail2ban directory.
kaien@josquin /etc/fail2ban $ sudo /sbin/iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
fail2ban-BadBots tcp -- anywhere anywhere multiport dports http,https
fail2ban-SSH tcp -- anywhere anywhere tcp dpt:ssh
ACCEPT all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:ssh
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:smtp
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:http
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:https
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:3573
DROP all -- anywhere anywhere
Chain FORWARD (policy ACCEPT)
target prot opt source destination
DROP tcp -- anywhere anywhere tcp spt:31337 dpt:31337
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
DROP tcp -- anywhere anywhere tcp spt:31337 dpt:31337
Chain fail2ban-BadBots (1 references)
target prot opt source destination
RETURN all -- anywhere anywhere
Chain fail2ban-SSH (1 references)
target prot opt source destination
RETURN all -- anywhere anywhere
/etc/hosts.allow, /etc/hosts.deny
Access to Wolverton clusters is only allowed from certain ip addresses that are listed in /etc/hosts.allow files. An ip address of a group member can be added to make a hole.
#
# hosts.allow This file describes the names of the hosts which are
# allowed to use the local INET services, as decided
# by the '/usr/sbin/tcpd' server.
#
#sshd: *.northwestern.edu: allow
#sshd: phasepusan.metsce.psu.edu: allow
#
# encina
sshd: 129.105.92.49: allow
# byrd
sshd: 165.124.29.202: allow
# victoria
sshd: 165.124.29.204: allow
# morales
sshd: 129.105.12.20: allow
# guerrero
sshd: 129.105.12.19 : allow
# tallis
sshd: 165.124.29.197: allow
# quest
sshd: 165.124.130.5: allow
sshd: 165.124.130.6: allow
sshd: 165.124.130.7: allow
sshd: 165.124.130.8: allow
Services
Linux provides certain services for users, such as web, ssh, and etc. They can be start/stop/restart by:
$ /etc/init.d/service_name [start/stop/restart/status]
Web via apache2 server
$ /etc/init.d/apache2
(josquin/byrd/palestrina)
$ /etc/init.d/httpd
(victoria)
SSH (Secure shell)
$ /etc/init.d/ssh
Nodewatch
$ /etc/init.d/ssh
Ganglia
$ /etc/init.d/gmond
(nodes)
$ /etc/init.d/gmetad
(master)
Pathscale subscription server
There is only one seat for pathscale compiler suite, and encina is serving as the license server. The license file is /opt/pathscale/lib/3.2/pscsubscription-7104.xml.
$ /etc/init.d/pathsub
(only on encina)