The HERMES Linux Production Cluster
|
Since Sep. 6th 1996
Hermes has got a new production farm
for extracting the physics out of the events occuring inside the
Hermes detector.
Please feel free to read more about this interesting experiment on
the Hermes Homepage.
|
Requirements
The requirements for cpu power in an HEP (High Energy Physics)
experiment are high. To get an impression: Per day an average data
volume of 33GB is recorded which adds up to several TBs stored on
robot tape drives per year. On the other hand results have to be
published soon after the data has been recorded. As a precise
knowledge of all detector parameters can be obtained only after
several analysis iterations our goal is to run one iteration of one
year's data within less than 2 months.
For comparison: the batch data processing of 1995 data (2.5 TBs)
was done on a 28 multiprocessor SGI Challenge XL system. As the data
volumes increased (1996 3.5TB, 1997 6.5TB) it was decided to
extend the computing power for Hermes and move the data processing on
a separate system. The interactive physics analysis will still be done
on the SGI system.
The experiments raw data is stored on fast tapes and must be read
with a speed of up to 2MB/s in order not to block the tape drives
unnecessary long. The I/O-bandwidth inside the production system does not have
to be high. Average rates of 200-800KB/s throughput have to be
handled, only.
Choosing a new platform also has to take the available analysis
software into account. All the software for HERMES was developed
on Unix systems. As several different Unix platforms were in use by the
collaboration it was tried to maintain the code in a (Unix-)system
independent way. Code is available in C, C++, Fortran77, bash, perl
and tk/tcl dialects.
The cluster components
Shifting the batch production from the SGI multiprocessor system while
keeping the production speed was achieved by installing a
cluster of 10 PentiumPro200 systems running Linux. (Later - see below - each
PC was upgraded to a 2 CPU dual-PPro system). The data access to
the tape silos is done via fast-Ethernet (100baseTX) and an
fddi-100baseTX switch. Recently the switch has been replaced by one
of the PCs configured as an FDDI/Fast-Ethernet router. Each PC
now is equipped with
- Asus Motherboard P65UP5 with 2xPPro 200 CPU (256K Cache),
- 4x32MB EDO Ram (60ns),
- NCR SCSI Controller,
- 2x4GB IBM DFRS-34320 FastSCSI2 Disks,
- S3 TRIO Graphics card (only for installation),
- 3.5'' Floppy Disk (installation),
- SMC EtherPower 10/100 (DEC Tulip Chip) Fast Ethernet card.
The machines were bought at COMPTRONIC,
a Hamburg computer dealer. Their excellent support justifies
not only this link...
Currently - and this situation is unlikely to change in near future -
hardware of this type offers a price/performance ratio of a factor 2 to 5
better than traditional workstation systems.
The operating system
There's a variety of operating systems for PCs. The most popular
ones certainly are MS-DOS and other Microsoft products (mainly WinNT),
IBM's OS/2 and various Unix dialects: SCO, Solaris, Free-BSD and Linux.
A rather famous paper comparing different Unix flavours for PCs is the
Lai-Baker Paper (Proceedings of USENIX96).
The Linux version taken for this comparison is still 1.2, so be sure
to read to the end of the paper where the development and progress issues
for newer kernels (1.3) are mentioned.
High Energy physics software is strongly correlated with the
CERN software library which offers a huge
amount of (mostly) f77 library calls. The Cern Libraries are available to
only a few of the PC operating systems - so the choice becomes more
easy: only WinNT and Linux are left.
Porting the Hermes software (500.000 lines of code)
to New Technologies like NT seems to be a challenging task. Several
similar tasks are currently under investigation and a realistic timescale
for the case of Hermes is in the area of 0.5 to 1 man years.
In this respect the additional cost for the OS, development software,
extra network licences (we need more than 10 Internet connections per
process) is even rather small.
So the choice was obvious: Linux. A free Unix operating system with
additionally excellent performance features. Porting involved the design of a
f77 standardising filter
(f77reorder)
as our code
used many non standard extensions to f77, handled by most compilers but
not by f2c or early g77 versions, the free Fortran compilers for Linux.
With an effort of 2 man weeks the whole Hermes source tree got translated
and was brought into operation.
The current status
- 16.Sep.96
- Well the cluster is still young. Currently it is busy with Monte-Carlo
production (computer simulated evens in the detector) as data processing
requires a new setup of the production daemon which is able to distribute
tasks to the different cluster members. (At this time the machines
still were single cpu machines).
- 22.Sep.96
- On Saturday and Sunday (21-22.Sept 96) a first test production of raw 96
data has successfully been run on the cluster. The processing was done at
the expected speed of about a factor of 2.2 faster than on a single
SGI Challenge MIPS 4400 /200MHz CPU.
- 27.Sep.96
- The 3com905 Fast Ethernet cards got replaced by SMC EtherPower 10/100
cards which now offer full usage of the 100Mbit/s bandwidth of
Fast Ethernet.
- 15.Nov.96
- Monte-Carlo Production has finished and new calibration data for 96
is available. We start the first test production on 100 runs of 96.
A central job and disk manager controls the job execution on the
cluster via the Hermes
DAD
Slow-Control Scheme. Click on this graph for the production control
layout:

Job execution can be monitored using
PINK
clients:

The results of the production look promising. Together with code
optimisation the speed gain with respect to the last production
on our large SMP machine is in the order of 3 per CPU.
- 27.Nov.96
- Now that everything is shown to work reliably we have to 'sell'
the product: Some transparencies (656KB)
from a talk given at CERN about the Linux production cluster.
- 20.Dec.96
- The data production is running after some final corrections to programs
and detector calibration have been applied. However we experience problems
with our Fast-Ethernet/FDDI switch. After an average of 12 hours the
Fast-Ethernet part of the switch stops execution under heave load on
the Fast-Ethernet segment.
- 23.Dec.96
- We replace the faulty switch by one of the PCs which is now equipped
with a DEC DC21140 FDDI adapter.
(Thanks to DEC for the driver development!).
Now a stable operation seems to be possible:

Hermes members can run the above screen using the clusterstatus.pink
command on our SGI.
- 12.Feb.97
- The first iteration of 96 data on the PC farm has finished in time.
- 28.Mar.97
- Up to know the machines have been 100% stable - no crashes since
23.Dec.96. This results in an integrated uptime of 2.5 years for
the cluster. The systems were fully loaded for 95% of the time.
- 29.Mar.97
-
Never touch a running system - well...
The SMP upgrade arrives and the sorrow begins. First impression
of the dual CPU systems is great. They perform at 199% system
usage and do in fact double the available resources. The price
for the upgrade is small.
But Linux at this time contains a lot of dead-locks in its kernel
and is not SMP safe.
Various iterations of the tulip driver (which was believed to be the
faulty one in the beginning are tried). Various kernel releases
are installed and Oopses recorded and sent to the developers.
- 20.May.97
- Finally we got a reasonably stable for our environment release of Linux
(uptimes of about one week).
A long time of booting (uptimes of 12hours and shorter for 10 machines
are quite annoying ,-) is over.
HERMES runs a 4GIPS cluster for less than 60k$.
- 30.Jun.97
- W.Wander leaves HERMES and heads towards MIT/Boston. Andrei Shevel
takes over.
- 11.Oct.97
- A.Shevel leaves back to St.Petersburg. Alexander Kisselev becomes a new
maintainer of the farm.
- Nov-Dec.97
- Extensive tests of the two recently released Linux kernels (2.0.31 and
2.0.32) showed their excellent stability (uptimes up to the manual reboot
or another CPU overheating due to the broken fan).
Wolfgang Wander,
wwc@ralph2.mit.edu,
Alexander Kisselev
Last modified: Mon Feb 23 11:47:18 1998