The HERMES Linux Production Cluster

Since Sep. 6th 1996 Hermes has got a new production farm for extracting the physics out of the events occuring inside the Hermes detector. Please feel free to read more about this interesting experiment on the Hermes Homepage.



Requirements

The requirements for cpu power in an HEP (High Energy Physics) experiment are high. To get an impression: Per day an average data volume of 33GB is recorded which adds up to several TBs stored on robot tape drives per year. On the other hand results have to be published soon after the data has been recorded. As a precise knowledge of all detector parameters can be obtained only after several analysis iterations our goal is to run one iteration of one year's data within less than 2 months.

For comparison: the batch data processing of 1995 data (2.5 TBs) was done on a 28 multiprocessor SGI Challenge XL system. As the data volumes increased (1996 3.5TB, 1997 6.5TB) it was decided to extend the computing power for Hermes and move the data processing on a separate system. The interactive physics analysis will still be done on the SGI system.

The experiments raw data is stored on fast tapes and must be read with a speed of up to 2MB/s in order not to block the tape drives unnecessary long. The I/O-bandwidth inside the production system does not have to be high. Average rates of 200-800KB/s throughput have to be handled, only.

Choosing a new platform also has to take the available analysis software into account. All the software for HERMES was developed on Unix systems. As several different Unix platforms were in use by the collaboration it was tried to maintain the code in a (Unix-)system independent way. Code is available in C, C++, Fortran77, bash, perl and tk/tcl dialects.

The cluster components

Shifting the batch production from the SGI multiprocessor system while keeping the production speed was achieved by installing a cluster of 10 PentiumPro200 systems running Linux. (Later - see below - each PC was upgraded to a 2 CPU dual-PPro system). The data access to the tape silos is done via fast-Ethernet (100baseTX) and an fddi-100baseTX switch. Recently the switch has been replaced by one of the PCs configured as an FDDI/Fast-Ethernet router. Each PC now is equipped with The machines were bought at COMPTRONIC, a Hamburg computer dealer. Their excellent support justifies not only this link...

Currently - and this situation is unlikely to change in near future - hardware of this type offers a price/performance ratio of a factor 2 to 5 better than traditional workstation systems.

The operating system

There's a variety of operating systems for PCs. The most popular ones certainly are MS-DOS and other Microsoft products (mainly WinNT), IBM's OS/2 and various Unix dialects: SCO, Solaris, Free-BSD and Linux. A rather famous paper comparing different Unix flavours for PCs is the Lai-Baker Paper (Proceedings of USENIX96). The Linux version taken for this comparison is still 1.2, so be sure to read to the end of the paper where the development and progress issues for newer kernels (1.3) are mentioned.

High Energy physics software is strongly correlated with the CERN software library which offers a huge amount of (mostly) f77 library calls. The Cern Libraries are available to only a few of the PC operating systems - so the choice becomes more easy: only WinNT and Linux are left.

Porting the Hermes software (500.000 lines of code) to New Technologies like NT seems to be a challenging task. Several similar tasks are currently under investigation and a realistic timescale for the case of Hermes is in the area of 0.5 to 1 man years. In this respect the additional cost for the OS, development software, extra network licences (we need more than 10 Internet connections per process) is even rather small.

So the choice was obvious: Linux. A free Unix operating system with additionally excellent performance features. Porting involved the design of a f77 standardising filter (f77reorder) as our code used many non standard extensions to f77, handled by most compilers but not by f2c or early g77 versions, the free Fortran compilers for Linux. With an effort of 2 man weeks the whole Hermes source tree got translated and was brought into operation.

The current status

16.Sep.96
Well the cluster is still young. Currently it is busy with Monte-Carlo production (computer simulated evens in the detector) as data processing requires a new setup of the production daemon which is able to distribute tasks to the different cluster members. (At this time the machines still were single cpu machines).
22.Sep.96
On Saturday and Sunday (21-22.Sept 96) a first test production of raw 96 data has successfully been run on the cluster. The processing was done at the expected speed of about a factor of 2.2 faster than on a single SGI Challenge MIPS 4400 /200MHz CPU.
27.Sep.96
The 3com905 Fast Ethernet cards got replaced by SMC EtherPower 10/100 cards which now offer full usage of the 100Mbit/s bandwidth of Fast Ethernet.
15.Nov.96
Monte-Carlo Production has finished and new calibration data for 96 is available. We start the first test production on 100 runs of 96. A central job and disk manager controls the job execution on the cluster via the Hermes DAD Slow-Control Scheme. Click on this graph for the production control layout:

Job execution can be monitored using PINK clients:

The results of the production look promising. Together with code optimisation the speed gain with respect to the last production on our large SMP machine is in the order of 3 per CPU.

27.Nov.96
Now that everything is shown to work reliably we have to 'sell' the product: Some transparencies (656KB) from a talk given at CERN about the Linux production cluster.
20.Dec.96
The data production is running after some final corrections to programs and detector calibration have been applied. However we experience problems with our Fast-Ethernet/FDDI switch. After an average of 12 hours the Fast-Ethernet part of the switch stops execution under heave load on the Fast-Ethernet segment.
23.Dec.96
We replace the faulty switch by one of the PCs which is now equipped with a DEC DC21140 FDDI adapter. (Thanks to DEC for the driver development!). Now a stable operation seems to be possible:

Hermes members can run the above screen using the clusterstatus.pink command on our SGI.

12.Feb.97
The first iteration of 96 data on the PC farm has finished in time.
28.Mar.97
Up to know the machines have been 100% stable - no crashes since 23.Dec.96. This results in an integrated uptime of 2.5 years for the cluster. The systems were fully loaded for 95% of the time.
29.Mar.97
Never touch a running system - well...
The SMP upgrade arrives and the sorrow begins. First impression of the dual CPU systems is great. They perform at 199% system usage and do in fact double the available resources. The price for the upgrade is small.
But Linux at this time contains a lot of dead-locks in its kernel and is not SMP safe.
Various iterations of the tulip driver (which was believed to be the faulty one in the beginning are tried). Various kernel releases are installed and Oopses recorded and sent to the developers.
20.May.97
Finally we got a reasonably stable for our environment release of Linux (uptimes of about one week). A long time of booting (uptimes of 12hours and shorter for 10 machines are quite annoying ,-) is over.
HERMES runs a 4GIPS cluster for less than 60k$.
30.Jun.97
W.Wander leaves HERMES and heads towards MIT/Boston. Andrei Shevel takes over.
11.Oct.97
A.Shevel leaves back to St.Petersburg. Alexander Kisselev becomes a new maintainer of the farm.
Nov-Dec.97
Extensive tests of the two recently released Linux kernels (2.0.31 and 2.0.32) showed their excellent stability (uptimes up to the manual reboot or another CPU overheating due to the broken fan).


Wolfgang Wander, wwc@ralph2.mit.edu,
Alexander Kisselev
Last modified: Mon Feb 23 11:47:18 1998
linux