LCDP at Sheridan
   over a high-speed network

Linux Cluster on Red Hat 6.2

A High Performance Diskless Linux Cluster



General Info


This cluster is formed by one central server (master), and fourty diskless (node1, ....node40) clients. All of which are based on Intel hardware and are operated under Linux (Red Hat 6.2). For a detailed description of the hardware setup, please check here.

Server provides disk space, network and even the operating system to a number of nodes. Clients boot initially from a floppy disk.On the server side the TFTP deamon is installed to provide the clients with the neccessary data.
The server provides the diskless client with its network configuration data. The client mounts the root file system via NFS and uses part of the server directory tree (/tftpboot/node...).


Server Setup


The server is booted from the /boot sub-directory using kernel 2.4.6 (Master kernel configuration).
Diskless cluster suite is installed to setup master system. Additional file modifications are made.
Files /etc/host, and /etc/host.equiv include the name of every node.
File /root/.rhost on all machines has " -rw-----" permissions.
File /etc/pam.d/rlogin has commented out the first line:
#authrequired=lib=security=pamsecuretty.so
For mpi programs:
file /usr/local/mpich-1.2.1/share/machines.LINUX includes the name of every node.



Client Setup


Clients are booted from the floppy disk using kernel 2.4.4 (Floppy kernel configuration).



Management Software


SCE (Scalable Cluster Environment) from Kasetsart University is used for cluster monitoring.

Scalable cluster environment is a set of interoperable opensource tools that enable users to build and use Beowulf cluster effectively.


This set of tools consists of:

  • Beowulf Diskless Cluster Suite a utility that help you build a diskless Beowulf cluster easily.
  • SCMS SMILE Cluster Management System . A powerful system administration tools for Beowulf style cluster
  • KCAP Web and VRML based system monitoring and navigation tool for large scale cluster


Benchmarks


Results of tests performed in Lab S142
(December 05, 2001):



Known Bugs/Problems


  • Kernel related problems
    • The latest kernel was compiled (2.4.6) with custom configuration

  • Red Hat distribution related problems
    • unable to remote shell into nodes from the master.
      • Solution [a] - editing {hosts, .rhost, host.equiv} to ensure all nodes and master were listed.
      • Solution [b] - 'authrequired=lib=security=pamsecuritty:so' line was commented out of file: [/etc/pam.d/rlogin]
    • Rebooting nodes remotely from the master was causing node to not respond.
      • Solution - The appropriate system initialization script were identified and then modified.
        [/etc/rc.d/init.d/network]
    • MPI-Povray program, used for benchmarking would not work on multiple nodes.
      • Solution - A Patch was applied to the program to fix this problem.
        [gzip -dc mpi-povray*.patch.gz | patch -p1]


  • Miscellaneous problems
    • Parallel programs would not run on multiple processors (nodes)
      • Solution - Modified default ?machines.LINUX? file required by MPI to define all nodes in cluster.
    • When the cluster was setup for testing in a new lab with additional nodes, DHCP client IPs were non-sequential, and unpredictable.
      • The 'dhcpd-leases' file was cleared of all old leases, and new leases were assigned by the DHCP daemon upon rebooting the nodes.
    • Parallel commands from the SCMS Management Tools wouldn't work on all nodes.
      • Solution - Changed the default configuration file to list all current nodes.
    • Web-based cluster monitoring tool (KCAP) encounters Java error: "unable to connect to RMI". Which limits functionality of this tool.
      • Solution - This is an ongoing problem.




Back to [TOP]