`

[spark-src-core] 6. checkpoint in spark

阅读更多

   same as others big data technology,CheckPoint is a well-knowed solution to keep data a snapshot for speeduping failovers,ie. restores to most recent checkpoint state of data ,so u will not need to recomputate the  rdd against the job.

  in fact,the checkpoint op will cut down the relationships of all parent rdds.so the current rdd will be the last rdd of data line,and it will be derived by CheckpointRDD to achieve this goal.moreover,CheckpointRDDData is a other wrapper of CheckpointRDD.

 

1.how to

  in spark,the checkpoint version is done by below steps(spark 1.4.1):

  

a. setup checkpoint dir by SparkContext.setupCheckpointDir(xx)
b. snapshot a data state of timeline:rdd.checkpoint()
c. do real checkpoint op at the last of a job(by default)

  now lets detail more the steps respectively.

   in the step 'b',the src is implemented by below codepath:

 /**
   * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
   * directory set with SparkContext.setCheckpointDir() and all references to its parent
   * RDDs will be removed. This function must be called before any job has been
   * executed on this RDD. It is strongly recommended that this RDD is persisted in
   * memory, otherwise saving it on a file will require recomputation.-cmp RDDCheckpointData#doCheckpoint()
   */
  def checkpoint() {
    if (context.checkpointDir.isEmpty) {
      throw new SparkException("Checkpoint directory has not been set in the SparkContext")
    } else if (checkpointData.isEmpty) {
      checkpointData = Some(new RDDCheckpointData(this))
      checkpointData.get.markForCheckpoint()
    }
  }

   in the comment,u will curious about :why its necessary to persist the rdd,and to memory?

   by diving into the src we know that the checkpoint op is really a job to run one more time on this rdd to save the result to file,so u will do one more computation if this rdd is not persisted.

  on the other hand,why this rdd is recommanded to save in memory but disk? in fact,it's a little bit of differencs between the data saved in memory and file(maybe data format is),therefor,i think the author does not emphasize where to persist but the op of 'persist'.

  

2.FAQ

 

a.how to use checkpoint to restore data

  from the StreamContext,we know that a func named 'getOrCreate(...)' is there for using the specified checkpoint dir defined before .so the snapshot data will readin rdd if any.

 

b.why not to save computated results when the rdd is run in first time

   hm...no doubt,the real meaning of checkpont op is a second same job run on thie rdd.so why not to save ths results to file simetaneously at the first time?

  first,there is only one anomyous function only defined in any runJob(..),thereby no more param can be accpted besides the user function .

  second,the user function divided by the checkpoint save-op is more clearly to debug ,mantain etc.

 

 

分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics