[大数据] Hadoop源码分析（26）

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 大数据 -> Hadoop源码分析（26） -> 正文阅读

[大数据]Hadoop源码分析（26）

Hadoop源码分析（26）

ZKFC源码分析

?在文档（25）中分析了zkfc的启动过程。在zkfc的启动过程中，其会连接zookeeper与namenode。并对namenode进行健康检查。

? namenode的健康检查，实际是通过RPC调用namenode自身的方法来进行健康检查。健康检查的主要的方法是monitorHealth方法，同时在namenode的启动中分析了用于处理namenode的RPC服务的类为：NameNodeRpcServer。该方法在namenode的实现如下：

  public synchronized void monitorHealth() throws HealthCheckFailedException,
      AccessControlException, IOException {
    checkNNStartup();
    nn.monitorHealth();
  }

? 这里重点是第4行调用的monitorHealth方法这个方法内容如下：

synchronized void monitorHealth() 
      throws HealthCheckFailedException, AccessControlException {
    namesystem.checkSuperuserPrivilege();
    if (!haEnabled) {
      return; // no-op, if HA is not enabled
    }
    getNamesystem().checkAvailableResources();
    if (!getNamesystem().nameNodeHasResourcesAvailable()) {
      throw new HealthCheckFailedException(
          "The NameNode has no resources available");
    }
  }

? 健康检查执行成功后，会回到zkfc远程调用的客户端。在文档（25）中分析了在这个方法调用结束后会执行一个enterState方法，来进入对应的状态，这个方法的调用情况如下：

enterState方法调用片段

? 这两个方法的内容如下：

private synchronized void enterState(State newState) {
    if (newState != state) {
      LOG.info("Entering state " + newState);
      state = newState;
      synchronized (callbacks) {
        for (Callback cb : callbacks) {
          cb.enteredState(newState);
        }
      }
    }
  }

? 这个方法很简单，首先是对state进行赋值，然后遍历callbacks中的对象，然后调用者个对象的enteredState方法。这里的callbacks在文档（25）中分析过了在创建healthMonitor的时候会调用addCallback方法添加callback到上述的callbacks中。这里设置的callback是HealthCallbacks类的对象，这个对象的enteredState方法内容如下：

  public void enteredState(HealthMonitor.State newState) {
      setLastHealthState(newState);
      recheckElectability();
    }

? 这里重点是第3行的recheckElectability方法，其内容如下：

private void recheckElectability() {
    // Maintain lock ordering of elector -> ZKFC
    synchronized (elector) {
      synchronized (this) {
        boolean healthy = lastHealthState == State.SERVICE_HEALTHY;

        long remainingDelay = delayJoiningUntilNanotime - System.nanoTime(); 
        if (remainingDelay > 0) {
          if (healthy) {
            LOG.info("Would have joined master election, but this node is " +
                "prohibited from doing so for " +
                TimeUnit.NANOSECONDS.toMillis(remainingDelay) + " more ms");
          }
          scheduleRecheck(remainingDelay);
          return;
        }

        switch (lastHealthState) {
        case SERVICE_HEALTHY:
          elector.joinElection(targetToData(localTarget));
          if (quitElectionOnBadState) {
            quitElectionOnBadState = false;
          }
          break;

        case INITIALIZING:
          LOG.info("Ensuring that " + localTarget + " does not " +
              "participate in active master election");
          elector.quitElection(false);
          serviceState = HAServiceState.INITIALIZING;
          break;

        case SERVICE_UNHEALTHY:
        case SERVICE_NOT_RESPONDING:
          LOG.info("Quitting master election for " + localTarget +
              " and marking that fencing is necessary");
          elector.quitElection(true);
          serviceState = HAServiceState.INITIALIZING;
          break;

        case HEALTH_MONITOR_FAILED:
          fatalError("Health monitor failed!");
          break;

        default:
          throw new IllegalArgumentException("Unhandled state:" + lastHealthState);
        }
      }
    }
  }

? 这里重点是第18行的switch语句，这里会根据lastHealthState的值来执行不同的方法。而这个lastHealthState的会在健康检查后由enterState方法传入。上文提到了传入的值为SERVICE_HEALTHY。所以这里实际会执行第19行到第24行的方法，这里的重点是第19行执行elector的joinElection方法。这个elector在zkfc初始化的时候提到过，其创建的是ActiveStandbyElector类的对象。其joinElection方法的内容如下：

 public synchronized void joinElection(byte[] data)
      throws HadoopIllegalArgumentException {

    if (data == null) {
      throw new HadoopIllegalArgumentException("data cannot be null");
    }

    if (wantToBeInElection) {
      LOG.info("Already in election. Not re-connecting.");
      return;
    }

    appData = new byte[data.length];
    System.arraycopy(data, 0, appData, 0, data.length);

    LOG.debug("Attempting active election for " + this);
    joinElectionInternal();
  }

? 这里主要是一些参数的处理，重点在第17行的joinElectionInternal方法。这个方法的内容如下：

  private void joinElectionInternal() {
    Preconditions.checkState(appData != null,
        "trying to join election without any app data");
    if (zkClient == null) {
      if (!reEstablishSession()) {
        fatalError("Failed to reEstablish connection with ZooKeeper");
        return;
      }
    }

    createRetryCount = 0;
    wantToBeInElection = true;
    createLockNodeAsync();
  }

? 这里首先判断了zkClient是否为空，然后对几个参数赋值，最后调用createLockNodeAsync方法。这个方法的内容如下：

  private void createLockNodeAsync() {
    zkClient.create(zkLockFilePath, appData, zkAcl, CreateMode.EPHEMERAL,
        this, zkClient);
  }

? 从上述代码中可以看出，这里的选举实际是在zookeeper的指定路径下创建一个节点。若这个节点创建成功则代表该节点是选举出的主节点，若失败则为从节点。而在传入的参数中：zkLockFilePath是临时目录的路径，CreateMode.EPHEMERAL是目录的类型（此类目录会在客户端断开的时候被删除），this传入的是elector本身，即ActiveStandbyElector类。这个参数是作为回调函数被传入的，而传ActiveStandbyElector是因为他实现了StatCallback和StringCallback，这两个接口，所以它也可以作为回调函数被传入。

? 无论创建节点是成功还是失败，zookeeper都会调用ActiveStandbyElector中的processResult方法，由该方法会判断其是否成功，如果成功则将其对应的namenode变为Active，否则为standby。processResult方法内容如下：

public synchronized void processResult(int rc, String path, Object ctx,
      String name) {
    if (isStaleClient(ctx)) return;
    LOG.debug("CreateNode result: " + rc + " for path: " + path
        + " connectionState: " + zkConnectionState +
        "  for " + this);

    Code code = Code.get(rc);
    if (isSuccess(code)) {
      // we successfully created the znode. we are the leader. start monitoring
      if (becomeActive()) {
        monitorActiveStatus();
      } else {
        reJoinElectionAfterFailureToBecomeActive();
      }
      return;
    }

    if (isNodeExists(code)) {
      if (createRetryCount == 0) {
        // znode exists and we did not retry the operation. so a different
        // instance has created it. become standby and monitor lock.
        becomeStandby();
      }
      // if we had retried then the znode could have been created by our first
      // attempt to the server (that we lost) and this node exists response is
      // for the second attempt. verify this case via ephemeral node owner. this
      // will happen on the callback for monitoring the lock.
      monitorActiveStatus();
      return;
    }

    String errorMessage = "Received create error from Zookeeper. code:"
        + code.toString() + " for path " + path;
    LOG.debug(errorMessage);

    if (shouldRetry(code)) {
      if (createRetryCount < maxRetryNum) {
        LOG.debug("Retrying createNode createRetryCount: " + createRetryCount);
        ++createRetryCount;
        createLockNodeAsync();
        return;
      }
      errorMessage = errorMessage
          + ". Not retrying further znode create connection errors.";
    } else if (isSessionExpired(code)) {
      // This isn't fatal - the client Watcher will re-join the election
      LOG.warn("Lock acquisition failed because session was lost");
      return;
    }

    fatalError(errorMessage);
  }

? 这里的重点有两个：第一个是第11行的becomeActive方法，第二个是第23行的becomeStandby方法。这里两个方法用于转换namenode的状态。

? 首先是becomeActive方法，其内容如下：

private boolean becomeActive() {
    assert wantToBeInElection;
    if (state == State.ACTIVE) {
      // already active
      return true;
    }
    try {
      Stat oldBreadcrumbStat = fenceOldActive();
      writeBreadCrumbNode(oldBreadcrumbStat);

      LOG.debug("Becoming active for " + this);
      appClient.becomeActive();
      state = State.ACTIVE;
      return true;
    } catch (Exception e) {
      LOG.warn("Exception handling the winning of election", e);
      // Caller will handle quitting and rejoining the election.
      return false;
    }
  }

?这里主要有两个方法，首先是第8行的fenceOldActive方法。这个方法是用来处理切换状态前的active节点。然后是第12行的becomeActive方法。

? 这里主要分析becomeActive方法，这个方法内容如下：

   public void becomeActive() throws ServiceFailedException {
      ZKFailoverController.this.becomeActive();
    }

? 这里是继续调用ZKFailoverController的becomeActive方法。该方法内容如下：

private synchronized void becomeActive() throws ServiceFailedException {
    LOG.info("Trying to make " + localTarget + " active...");
    try {
      HAServiceProtocolHelper.transitionToActive(localTarget.getProxy(
          conf, FailoverController.getRpcTimeoutToNewActive(conf)),
          createReqInfo());
      String msg = "Successfully transitioned " + localTarget +
          " to active state";
      LOG.info(msg);
      serviceState = HAServiceState.ACTIVE;
      recordActiveAttempt(new ActiveAttemptRecord(true, msg));

    } catch (Throwable t) {
      String msg = "Couldn't make " + localTarget + " active";
      LOG.fatal(msg, t);

      recordActiveAttempt(new ActiveAttemptRecord(false, msg + "\n" +
          StringUtils.stringifyException(t)));

      if (t instanceof ServiceFailedException) {
        throw (ServiceFailedException)t;
      } else {
        throw new ServiceFailedException("Couldn't transition to active",
            t);
      }
    }
  }

? 这里的重点在第4行的方法，这个方法传入了两个参数，其中第一个参数是通过localTarget的getProxy方法获取的。这方法在文档（25）中解析过是用来获取namenode的代理对象的。然后调用的transitionToActive方法内容如下：

  public static void transitionToActive(HAServiceProtocol svc,
      StateChangeRequestInfo reqInfo)
      throws IOException {
    try {
      svc.transitionToActive(reqInfo);
    } catch (RemoteException e) {
      throw e.unwrapRemoteException(ServiceFailedException.class);
    }
  }

? 这里可以看见第5行直接调用了代理对象的transitionToActive方法，这两个方法会通过RPC直接调用NameNodeRpcServer的方法。该方法内容如下：

 public synchronized void transitionToActive(StateChangeRequestInfo req) 
      throws ServiceFailedException, AccessControlException, IOException {
    checkNNStartup();
    nn.checkHaStateChange(req);
    nn.transitionToActive();
  }

? 这里的重点在第5行的transitionToActive方法，这个方法的内容如下：

  synchronized void transitionToActive() 
      throws ServiceFailedException, AccessControlException {
    namesystem.checkSuperuserPrivilege();
    if (!haEnabled) {
      throw new ServiceFailedException("HA for namenode is not enabled");
    }
    state.setState(haContext, ACTIVE_STATE);
  }

? 重点是第7行的setState方法，这个方法重新设置state的状态。在之前的文档中解析了在namenode的启动的时候都是以standby状态启动的。所以这里的state是standby状态的。其执行的setState方法内容如下：

  public void setState(HAContext context, HAState s) throws ServiceFailedException {
    if (s == NameNode.ACTIVE_STATE) {
      setStateInternal(context, s);
      return;
    }
    super.setState(context, s);
  }

? 这里传入的s的值为ACTIVE_STATE，所以第2行的if 条件的结果是True。即这段代码会执行第3行的setStateInternal方法。这个方法内容如下：

protected final void setStateInternal(final HAContext context, final HAState s)
      throws ServiceFailedException {
    prepareToExitState(context);
    s.prepareToEnterState(context);
    context.writeLock();
    try {
      exitState(context);
      context.setState(s);
      s.enterState(context);
      s.updateLastHATransitionTime();
    } finally {
      context.writeUnlock();
    }
  }

? 这里的逻辑很简单，首先需要准备退出当前状态（第3行和第4行），没有问题后开始执行退出程序（第7行），然后再设置新的状态（第8行），然后进入新的状态（第9行）。

? 执行退出程序调用的exitState方法，这里主要是需要退出standby状态，在文档（24）中解析了在进入standby状态下的时候主要是启动两个线程，用于同步active的数据与执行checkpoint。这里退出standby状态主要就是停掉上述的两个线程。这里调用的exitState方法的内容如下：

 public void exitState(HAContext context) throws ServiceFailedException {
    try {
      context.stopStandbyServices();
    } catch (IOException e) {
      throw new ServiceFailedException("Failed to stop standby services", e);
    }
  }

? 这里会继续调用context的stopStandbyServices方法来处理，这个方法的内容如下：

   public void stopStandbyServices() throws IOException {
      try {
        if (namesystem != null) {
          namesystem.stopStandbyServices();
        }
      } catch (Throwable t) {
        doImmediateShutdown(t);
      }
    }

? 重点在第6行会调用 namesystem的stopStandbyServices方法。这个方法的内容如下：

  void stopStandbyServices() throws IOException {
    LOG.info("Stopping services started for standby state");
    if (standbyCheckpointer != null) {
      standbyCheckpointer.stop();
    }
    if (editLogTailer != null) {
      editLogTailer.stop();
    }
    if (dir != null && getFSImage() != null && getFSImage().editLog != null) {
      getFSImage().editLog.close();
    }
  }

? 这个方法在第4行和第7行停掉了上文提到的两个进程：standbyCheckpointer和editLogTailer。

? 然后再看进入新状态的enterState方法，这里的新状态是active，所以调用的是active的enterState方法，其内容如下：

public void enterState(HAContext context) throws ServiceFailedException {
    try {
      context.startActiveServices();
    } catch (IOException e) {
      throw new ServiceFailedException("Failed to start active services", e);
    }
  }

  public void startActiveServices() throws IOException {
      try {
        namesystem.startActiveServices();
      } catch (Throwable t) {
        doImmediateShutdown(t);
      }
    }

? 这里和上文相同逐级调用方法，最后调用的startActiveServices方法内容如下：

void startActiveServices() throws IOException {
    startingActiveService = true;
    LOG.info("Starting services required for active state");
    writeLock();
    try {
      FSEditLog editLog = getFSImage().getEditLog();

      if (!editLog.isOpenForWrite()) {
        // During startup, we're already open for write during initialization.
        editLog.initJournalsForWrite();
        // May need to recover
        editLog.recoverUnclosedStreams();

        LOG.info("Catching up to latest edits from old active before " +
            "taking over writer role in edits logs");
        editLogTailer.catchupDuringFailover();

        blockManager.setPostponeBlocksFromFuture(false);
        blockManager.getDatanodeManager().markAllDatanodesStale();
        blockManager.clearQueues();
        blockManager.processAllPendingDNMessages();

        // Only need to re-process the queue, If not in SafeMode.
        if (!isInSafeMode()) {
          LOG.info("Reprocessing replication and invalidation queues");
          initializeReplQueues();
        }

        if (LOG.isDebugEnabled()) {
          LOG.debug("NameNode metadata after re-processing " +
              "replication and invalidation queues during failover:\n" +
              metaSaveAsString());
        }

        long nextTxId = getFSImage().getLastAppliedTxId() + 1;
        LOG.info("Will take over writing edit logs at txnid " + 
            nextTxId);
        editLog.setNextTxId(nextTxId);

        getFSImage().editLog.openForWrite();
      }

      // Enable quota checks.
      dir.enableQuotaChecks();
      if (haEnabled) {
        // Renew all of the leases before becoming active.
        // This is because, while we were in standby mode,
        // the leases weren't getting renewed on this NN.
        // Give them all a fresh start here.
        leaseManager.renewAllLeases();
      }
      leaseManager.startMonitor();
      startSecretManagerIfNecessary();

      //ResourceMonitor required only at ActiveNN. See HDFS-2914
      this.nnrmthread = new Daemon(new NameNodeResourceMonitor());
      nnrmthread.start();

      nnEditLogRoller = new Daemon(new NameNodeEditLogRoller(
          editLogRollerThreshold, editLogRollerInterval));
      nnEditLogRoller.start();

      if (lazyPersistFileScrubIntervalSec > 0) {
        lazyPersistFileScrubber = new Daemon(new LazyPersistFileScrubber(
            lazyPersistFileScrubIntervalSec));
        lazyPersistFileScrubber.start();
      }

      cacheManager.startMonitorThread();
      blockManager.getDatanodeManager().setShouldSendCachingCommands(true);
    } finally {
      startingActiveService = false;
      checkSafeMode();
      writeUnlock("startActiveServices");
    }
  }