1) if the master node fails...will the whole cluster fail?
In general, with one control node, if the main node goes down, the cluster is inaccessable from the outside world. There are three solutions to this problem:
- Have two (or more) control nodes, see (a). (we don't do this)
- Backup to large medium on a regular basis, and be able to "quickly" turn a compute node into the control node while the control node is out.
- Have the master node run with RAID 1 -- and be able to move the mirror and second nic to a compute node "quickly," see (b). (we don't do this either, but I've been pushing for it)
We have one "master" node which among other things has the compiler (licenced to only one), the backup (of the whole cluster), the interface to the outside world and the control
node for most of our parallel work. However, once compiled most/all of the simulations we run can be run using another node as the control node. If the control node went down, we can replace the control node with a compute node in the short term by moving the nic (but we won't be able to compile in parallel). While in this state we can fix the main node (and if nessary restore it's contents to a new hardrive -- this takes a very long time, but the restored system is EXSACTLY as it was at 1:00am of the day it crashed). Work is hindered, but neededly stoped.
a) if so, is possible to have two master node in one cluster?
"master" is an intersting word. There three different types of nodes a cluster can have:
compute - dumb node that just runs the paralllel code
control - the "master" or control for the parallel processes
display/output - the node which collects/displays output from parallel procecesses, stats about system health, CPU usage and the like.
In general a "master" is thought to be just the second and third rolled together. There is nothing stopping you from seperating that functionality or having multipule of each.
2) is possible to have nodes mirror each other in a cluster ? (ie, mysql database, qmail)
Yes. but it is easier to use RAID 1. When the system goes down just move the mirror drive to another computer and stick in a second nic.
But, all this is just my humble opinion as a sys admin of 22 node beowulf for going on 2 years.
[plug=shameless]
[/plug]