一、zookeeper选主逻辑主要是根据投票数来定的,具体的逻辑如下:
判断依据: 1、Epoch:leader的任期,任期大的优先级高,其他的节点优先投票给任期大的节点 2、ZXID:zookeeper事务ID,越大表示数据越新,在任期相同时则比较zxid 3、SID:集群中每个节点的唯一编号,当任期、事务id都相同的时候则比较该值,sid越大的优先获得其他节点的投票
二:zookeeper选主源代码分析:
public Vote lookForLeader() throws InterruptedException
选主的入口主要是集中在这儿:
synchronized (this) {
logicalclock.incrementAndGet();
updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
}
其中 logicalclock 表示选举轮数,每一次选举该值都会增加,表示选举轮数增加。我们跟踪进 在这里updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());插入代码片 方法去看一下:
synchronized void updateProposal(long leader, long zxid, long epoch) {
LOG.debug(
"Updating proposal: {} (newleader), 0x{} (newzxid), {} (oldleader), 0x{} (oldzxid)",
leader,
Long.toHexString(zxid),
proposedLeader,
Long.toHexString(proposedZxid));
proposedLeader = leader;
proposedZxid = zxid;
proposedEpoch = epoch;
}
可以看到,当投票开始的时候,这里所作的操作就是初始化选举信息, 具体的投票操作是在 sendNotifications(); 这里所作的操作就是将自己的选举票发送给自己将要选举的节点,其代码如下:
private void sendNotifications() {
for (long sid : self.getCurrentAndNextConfigVoters()) {
QuorumVerifier qv = self.getQuorumVerifier();
ToSend notmsg = new ToSend(
ToSend.mType.notification,
proposedLeader,
proposedZxid,
logicalclock.get(),
QuorumPeer.ServerState.LOOKING,
sid,
proposedEpoch,
qv.toString().getBytes(UTF_8));
LOG.debug(
"Sending Notification: {} (n.leader), 0x{} (n.zxid), 0x{} (n.round), {} (recipient),"
+ " {} (myid), 0x{} (n.peerEpoch) ",
proposedLeader,
Long.toHexString(proposedZxid),
Long.toHexString(logicalclock.get()),
sid,
self.getId(),
Long.toHexString(proposedEpoch));
sendqueue.offer(notmsg);
}
}
offer 方法跟踪查看:
public boolean offer(E e) {
if (e == null) throw new NullPointerException();
final AtomicInteger count = this.count;
if (count.get() == capacity)
return false;
final int c;
final Node<E> node = new Node<E>(e);
final ReentrantLock putLock = this.putLock;
putLock.lock();
try {
if (count.get() == capacity)
return false;
enqueue(node);
c = count.getAndIncrement();
if (c + 1 < capacity)
notFull.signal();
} finally {
putLock.unlock();
}
if (c == 0)
signalNotEmpty();
return true;
}
offer方法所做的就是将要投票的信息添加到队列当中去,等待调度
looking选举代码:
case LOOKING:
if (getInitLastLoggedZxid() == -1) {
LOG.debug("Ignoring notification as our zxid is -1");
break;
}
if (n.zxid == -1) {
LOG.debug("Ignoring notification from member with -1 zxid {}", n.sid);
break;
}
if (n.electionEpoch > logicalclock.get()) {
logicalclock.set(n.electionEpoch);
recvset.clear();
if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
updateProposal(n.leader, n.zxid, n.peerEpoch);
} else {
updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
}
sendNotifications();
若果当前节点发现自己的选举轮数比对象节点的选举轮数小,则就放弃自己的选票并将自己的票投给epoch id较大的一方。如若不是便判断当前投票是否成功了,如果成功了则更新当前投票结果,倘若不成功,则获取之前数据再次重新投票。另外,当当前的节点的选举轮数小于在记的任期时,则直接结束,因为它的任期小,期间可能因为网络分区的原因导致其数据过于老旧,这个时候只能等待leader通知它进行数据更新。
投票之后将选票信息添加到recvSet中。
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
voteSet = getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch));
如果已经有leader节点被选出来了,但是当前节点不知道的话就会进入下面这段逻辑当中去:
private Vote receivedLeadingNotification(Map<Long, Vote> recvset, Map<Long, Vote> outofelection, SyncedLearnerTracker voteSet, Notification n) {
Vote result = receivedFollowingNotification(recvset, outofelection, voteSet, n);
if (result == null) {
if (self.getQuorumVerifier().getNeedOracle() && !self.getQuorumVerifier().askOracle()) {
LOG.info("Oracle indicates to follow");
setPeerState(n.leader, voteSet);
Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);
leaveInstance(endVote);
return endVote;
} else {
LOG.info("Oracle indicates not to follow");
return null;
}
} else {
return result;
}
}
|