Monday, February 13, 2006

Condor is now installed on the Grid nodes

Condor version and platform:
$CondorVersion: 6.6.10 Jun 13 2005 $
$CondorPlatform: I386-LINUX_RH9 $


Condor is now instaled on Grid4,5,6,7 and 9. All the nodes are both compute and submit nodes, Grid9 is the Central Manager.
Grid8 is down due to a hard disk failure.

[condor@grid4 condor]$ condor_status

Name OpSys Arch State Activity LoadAv Mem ActvtyTime

vm1@grid4.bgu LINUX INTEL Owner Idle 0.000 501[?????]
vm2@grid4.bgu LINUX INTEL Unclaimed Idle 0.000 501 0+00:39:33
vm1@grid5.bgu LINUX INTEL Owner Idle 0.000 249[?????]
vm2@grid5.bgu LINUX INTEL Owner Idle 0.000 249[?????]
vm1@grid6.bgu LINUX INTEL Owner Idle 0.000 501[?????]
vm2@grid6.bgu LINUX INTEL Owner Idle 0.000 501[?????]
vm1@grid7.bgu LINUX INTEL Owner Idle 0.000 501[?????]
vm2@grid7.bgu LINUX INTEL Owner Idle 0.000 501[?????]
vm1@grid9.bgu LINUX INTEL Owner Idle 0.000 501 0+00:15:09
vm2@grid9.bgu LINUX INTEL Unclaimed Idle 0.000 501 0+00:00:05

Machines Owner Claimed Unclaimed Matched Preempting

INTEL/LINUX 10 8 0 2 0 0

Total 10 8 0 2 0 0


Condor_view is available here.

The Grid computers specifications

Grid4: Processor: Dual AMD Athlon(tm) MP 2000+, 1666MHz, cache: 256 KB, RAM: 1GB, HD: 40GB, OS: Scientific Linux 3.03, Kernel: 2.4.21-20.ELsmp

Grid5: Processor: Intel(R) Pentium(R) 4 CPU 3.00GHz, cache: 1MB, HD:80GB, OS: Scientific Linux, Kernel 2.4.21-37.ELsmp.

Grid6: Processor: Dual AMD Athlon(tm) MP 1900+, 1600MHz, cache: 256 KB, RAM: 1GB, HD: 40GB, OS: Scientific Linux, Kernel: 2.4.21-20.ELsmp

Grid7: Processor: Dual AMD Athlon(tm) MP 2000+, 1666MHz, cache: 256 KB, RAM: 1GB, HD: 40GB, OS: Scientific Linux 3.03, Kernel: 2.4.21-20.ELsmp

Grid8: Down!!!

Grid9: Dual AMD Athlon(tm) MP 2000+, 1666MHz, cache: 256 KB, RAM: 1GB, HD: 40GB, OS: Scientific Linux 3.03, Kernel: 2.4.21-20.ELsmp

A New Condor Application at the BGU

Collaboration with Chen Keasar
In the forthcoming weeks I will try to "Condorize" his computer code Meshi

BGU grid computers maintenance

Welcome to my Grid Computing and other stuff blog!
Your comments will be most appreciated.

Today's activities
1) Set ssh access without password between nodes.
2) Install Condor 6.6.10 on grid8.bgu.ac.il being the central manager:
under /usr/local
[root@grid8 local]# gzip -d condor-6.6.10-linux-x86-glibc23-dynamic.tar.gz
[root@grid8 local]# tar xvf ./condor-6.6.10-linux-x86-glibc23-dynamic.tar
[root@grid8 local]# hostname grid8.bgu.ac.il
[root@grid8 local]# cd condor-6.6.10
[root@grid8 condor-6.6.10]# ./condor_install
My answer to the Condor installer:
Full installation.
Multiple machines.
Machines do not share files via a file server.
There is no realse dir yet.
Installation dir: /usr/local/condor
Create that directory.
Notify by Email to: tel-zur@ee.bgu.ac.il
Mail path: /bin/mail
Do all the machines are from domain "bgu.ac.il" - Yes.
Unique UID - No.
Enable Java support: Yes
Java exists under: /usr/bin/java
Create links to other directories: Yes
"bin" will go to /usr/local/bin
Full name of the central manager: grid8.bgu.ac.il (this node)
Condor directories will go to: /home/condor
Local config file:
Creating config files in "/home/condor" ... done.

Configuring global condor config file ... done.
Created /usr/local/condor/etc/condor_config.

Pool name: "BGU grid"
Should I put in a soft link from /home/condor/condor_config to
/usr/local/condor/etc/condor_config [yes] yes

As "root" start up Condor: /usr/local/condor/etc/examples/condor.boot start
Unfortunately, Condor did not start.
It seems that the HD has a problem, here are a few lines from /var/log/messages:
Feb 13 09:58:58 grid8 kernel: end_request: I/O error, dev 03:02 (hda), sector 591512
Feb 13 10:04:02 grid8 kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Feb 13 10:04:02 grid8 kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=800358, sector=59
1512

The hards disk will be replaced and I lost a working hour :(

The Condor central manager installation will be repeated on Grid9