Custom Accumulator in Spark 2.1

Accumulator can sum or count number in spark tasks over all nodes, and then return the final result. For example, LongAccumulator.
But when I want to accumulate value by custom way, it can be implemented by extends AccumulatorV2 in spark 2.1.

Below, a MapLongAccumulator is implemented to count the numbers of several values seperately.

import java.util
import java.util.Collections

import org.apache.spark.util.AccumulatorV2
import scala.collection.JavaConversions._


class MapLongAccumulator[T] extends AccumulatorV2[T, java.util.Map[T, java.lang.Long]] {
    private val _map: java.util.Map[T, java.lang.Long] = Collections.synchronizedMap(new util.HashMap[T, java.lang.Long]())

    override def isZero: Boolean = _map.isEmpty

    override def copyAndReset(): MapLongAccumulator[T] = new MapLongAccumulator

    override def copy(): MapLongAccumulator[T] = {
        val newAcc = new MapLongAccumulator[T]
        _map.synchronized {
            newAcc._map.putAll(_map)
        }
        newAcc
    }

    override def reset(): Unit = _map.clear()

    override def add(v: T): Unit = _map.synchronized {
        val old = _map.put(v, 1l)
        if (old != null) {
            _map.put(v, 1 + old)
        }
    }

    override def merge(other: AccumulatorV2[T, java.util.Map[T, java.lang.Long]]): Unit = other match {
        case o: MapLongAccumulator[T] => {
            for ((k,v) <- o.value) {
                val old = _map.put(k, v)
                if(old != null){
                    _map.put(k, old.longValue() + v)
                }
            }
        }
        case _ => throw new UnsupportedOperationException(
            s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
    }

    override def value: java.util.Map[T, java.lang.Long] = _map.synchronized {
        java.util.Collections.unmodifiableMap(new util.HashMap[T, java.lang.Long](_map))
    }

    def setValue(newValue: java.util.Map[T, java.lang.Long]): Unit = {
        _map.clear()
        _map.putAll(newValue)
    }
}

Use case:

val accumulator = new MapLongAccumulator[String]()
sc.register(accumulator, "CustomAccumulator")
someRdd.map(a => {
            ...
            val convertResult = convert(a)
            convertResultAccumulator.add(convertResult.toString)
            ...
        }).repartition(1).saveAsTextFile(outputPath)
System.out.println(accumulator.value)   // the several convertResults will be counted seperately, and then get the output value.

AccumulatorV2 is new accumulator class since spark 2.0.0.

public abstract class AccumulatorV2<IN,OUT>
extends Object
implements scala.Serializable

The base class for accumulators, that can accumulate inputs of type IN, and produce output of type OUT.
OUT should be a type that can be read atomically (e.g., Int, Long), or thread-safely (e.g., synchronized collections) because it will be read from other threads.

Methods of AccumulatorV2:
merge method doesn't need thread-safe, it is called only in one thread when task completion.(handleTaskCompletion in DAGScheduler)

org.apache.spark.scheduler.DAGScheduler.scala
/**
   * Responds to a task finishing. This is called inside the event loop so it assumes that it can
   * modify the scheduler's internal state. Use taskEnded() to post a task end event from outside.
   */
  private[scheduler] def handleTaskCompletion(event: CompletionEvent) {
    val task = event.task
    val taskId = event.taskInfo.id
    val stageId = task.stageId
    val taskType = Utils.getFormattedClassName(task)

    outputCommitCoordinator.taskCompleted(
      stageId,
      task.partitionId,
      event.taskInfo.attemptNumber, // this is a task attempt number
      event.reason)

    // Reconstruct task metrics. Note: this may be null if the task has failed.
    val taskMetrics: TaskMetrics =
      if (event.accumUpdates.nonEmpty) {
        try {
          TaskMetrics.fromAccumulators(event.accumUpdates)
        } catch {
          case NonFatal(e) =>
            logError(s"Error when attempting to reconstruct metrics for task $taskId", e)
            null
        }
      } else {
        null
      }

    ......
    val stage = stageIdToStage(task.stageId)
    event.reason match {
      case Success =>
        stage.pendingPartitions -= task.partitionId
        task match {
          case rt: ResultTask[_, _] =>
            // Cast to ResultStage here because it's part of the ResultTask
            // TODO Refactor this out to a function that accepts a ResultStage
            val resultStage = stage.asInstanceOf[ResultStage]
            resultStage.activeJob match {
              case Some(job) =>
                if (!job.finished(rt.outputId)) {
                  updateAccumulators(event)
                  ......
           case smt: ShuffleMapTask =>
            val shuffleStage = stage.asInstanceOf[ShuffleMapStage]
            updateAccumulators(event)
            val status = event.result.asInstanceOf[MapStatus]
            val execId = status.location.executorId
            logDebug("ShuffleMapTask finished on " + execId)
            if (failedEpoch.contains(execId) && smt.epoch <= failedEpoch(execId)) {
              logInfo(s"Ignoring possibly bogus $smt completion from executor $execId")
            } else {
              shuffleStage.addOutputLoc(smt.partitionId, status)
            }
            ......
    case exceptionFailure: ExceptionFailure =>
        // Tasks failed with exceptions might still have accumulator updates.
        updateAccumulators(event)
    ......
}


/**
   * Merge local values from a task into the corresponding accumulators previously registered
   * here on the driver.
   *
   * Although accumulators themselves are not thread-safe, this method is called only from one
   * thread, the one that runs the scheduling loop. This means we only handle one task
   * completion event at a time so we don't need to worry about locking the accumulators.
   * This still doesn't stop the caller from updating the accumulator outside the scheduler,
   * but that's not our problem since there's nothing we can do about that.
   */
  private def updateAccumulators(event: CompletionEvent): Unit = {
    val task = event.task
    val stage = stageIdToStage(task.stageId)
    try {
      event.accumUpdates.foreach { updates =>
        val id = updates.id
        // Find the corresponding accumulator on the driver and update it
        val acc: AccumulatorV2[Any, Any] = AccumulatorContext.get(id) match {
          case Some(accum) => accum.asInstanceOf[AccumulatorV2[Any, Any]]
          case None =>
            throw new SparkException(s"attempted to access non-existent accumulator $id")
        }
        acc.merge(updates.asInstanceOf[AccumulatorV2[Any, Any]])
        // To avoid UI cruft, ignore cases where value wasn't updated
        if (acc.name.isDefined && !updates.isZero) {
          stage.latestInfo.accumulables(id) = acc.toInfo(None, Some(acc.value))
          event.taskInfo.accumulables += acc.toInfo(Some(updates.value), Some(acc.value))
        }
      }
    } catch {
      case NonFatal(e) =>
        logError(s"Failed to update accumulators for task ${task.partitionId}", e)
    }
  }

org.apache.spark.executor.TaskMetrics.scala
  /**
   * Construct a [[TaskMetrics]] object from a list of accumulator updates, called on driver only.
   */
  def fromAccumulators(accums: Seq[AccumulatorV2[_, _]]): TaskMetrics = {
    val tm = new TaskMetrics
    val (internalAccums, externalAccums) =
      accums.partition(a => a.name.isDefined && tm.nameToAccums.contains(a.name.get))

    internalAccums.foreach { acc =>
      val tmAcc = tm.nameToAccums(acc.name.get).asInstanceOf[AccumulatorV2[Any, Any]]
      tmAcc.metadata = acc.metadata
      tmAcc.merge(acc.asInstanceOf[AccumulatorV2[Any, Any]])
    }

    tm.externalAccums ++= externalAccums
    tm
  }

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末蹄衷，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子阀湿，更是在濱河造成了極大的恐慌求摇，老刑警劉巖勇垛，帶你破解...
沈念sama閱讀 216,470評(píng)論 6贊 501
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場(chǎng)離奇詭異，居然都是意外死亡谜慌，警方通過查閱死者的電腦和手機(jī)字旭，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,393評(píng)論 3贊 392
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門对湃，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人遗淳，你說我怎么就攤上這事拍柒。” “怎么了屈暗？”我有些...
開封第一講書人閱讀 162,577評(píng)論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵拆讯，是天一觀的道長(zhǎng)。經(jīng)常有香客問我养叛，道長(zhǎng)种呐，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 58,176評(píng)論 1贊 292
?港島之戀（遺憾婚禮）
正文為了忘掉前任弃甥，我火速辦了婚禮爽室，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘潘飘。我一直安慰自己肮之，他們只是感情好，可當(dāng)我...
茶點(diǎn)故事閱讀 67,189評(píng)論 6贊 388
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布卜录。她就那樣靜靜地躺著戈擒，像睡著了一般。火紅的嫁衣襯著肌膚如雪艰毒。梳的紋絲不亂的頭發(fā)上筐高，一...
開封第一講書人閱讀 51,155評(píng)論 1贊 299
城市分裂傳說
那天，我揣著相機(jī)與錄音丑瞧，去河邊找鬼柑土。笑死，一個(gè)胖子當(dāng)著我的面吹牛绊汹，可吹牛的內(nèi)容都是我干的稽屏。我是一名探鬼主播，決...
沈念sama閱讀 40,041評(píng)論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼西乖，長(zhǎng)吁一口氣：“原來是場(chǎng)噩夢(mèng)啊……” “哼狐榔！你這毒婦竟也來了坛增？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 38,903評(píng)論 0贊 274
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤薄腻，失蹤者是張志新（化名）和其女友劉穎收捣，沒想到半個(gè)月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體庵楷，經(jīng)...
沈念sama閱讀 45,319評(píng)論 1贊 310
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡罢艾，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,539評(píng)論 2贊 332
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了尽纽。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片咐蚯。...
茶點(diǎn)故事閱讀 39,703評(píng)論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖蜓斧，靈堂內(nèi)的尸體忽然破棺而出仓蛆，到底是詐尸還是另有隱情，我是刑警寧澤挎春，帶...
沈念sama閱讀 35,417評(píng)論 5贊 343
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布看疙，位于F島的核電站，受9級(jí)特大地震影響直奋，放射性物質(zhì)發(fā)生泄漏能庆。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,013評(píng)論 3贊 325
男人毒藥：我在死后第九天來索命
文/蒙蒙一脚线、第九天我趴在偏房一處隱蔽的房頂上張望搁胆。院中可真熱鬧，春花似錦邮绿、人聲如沸渠旁。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,664評(píng)論 0贊 22
一樁弒父案船逮，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽顾腊。三九已至，卻和暖如春挖胃，著一層夾襖步出監(jiān)牢的瞬間杂靶，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 32,818評(píng)論 1贊 269
情欲美人皮
我被黑心中介騙來泰國打工酱鸭，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留吗垮，地道東北人。一個(gè)月前我還...
沈念sama閱讀 47,711評(píng)論 2贊 368
代替公主和親
正文我出身青樓凹髓，卻偏偏與公主長(zhǎng)得像烁登，于是被迫代替她去往敵國和親。傳聞我的和親對(duì)象是個(gè)殘疾皇子蔚舀，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,601評(píng)論 2贊 353

Custom Accumulator in Spark 2.1

Custom Accumulator in Spark 2.1

推薦閱讀更多精彩內(nèi)容